com.bigdata.btree
Class BytesUtil

java.lang.Object
  extended by com.bigdata.btree.BytesUtil

public class BytesUtil
extends Object

Class supporting operations on variable length byte[] keys.

Comparison operations that accept a starting offset are used when the byte[]s are known to share a leading prefix that may be skipped during comparison.

Comparison operations that accept a starting offset and length are used when immutable keys are stored in a single byte[] and an index into starting positions in that byte[] is maintained.

JNI methods are provided for unsigned byte[] comparison. However, note that the JNI methods do not appear to be as fast as the pure Java methods - presumably because of the overhead of going from Java to C. In order to execute using the JNI methods you MUST define the optional boolean system property, e.g.,

 java -Dcom.bigdata.btree.BytesUtil.jni=true ...
 
See BytesUtil.c in this package for instructions on compiling the JNI methods.

See main(String[]) which provides a test for the JNI integration and some pointers on how to get this running on your platform.

Version:
$Id: BytesUtil.java 5875 2012-01-25 15:20:16Z thompsonbry $
Author:
Bryan Thompson

Nested Class Summary
static class BytesUtil.UnsignedByteArrayComparator
          Compares two unsigned byte[]s.
 
Field Summary
static byte[] EMPTY
          An empty byte[].
static byte[][] EMPTY2
          An empty byte[][].
static int minlen
          JNI routines are not invoked unless we will compare byte[]s with at least this many potential bytes to compare (the actual# may be much less of course since comparisons may fail fast).
 
Constructor Summary
BytesUtil()
           
 
Method Summary
static int altGetBits32(byte[] a, int off, int len)
           
static long altGetBits64(byte[] a, int off, int len)
           
static int binarySearch(byte[][] keys, int base, int nmem, byte[] key)
          Binary search on an array whose members are variable length unsigned byte[]s.
static int bitFlagByteLength(int nbits)
          Return the #of bytes required to bit code the specified #of bits.
static int byteIndexForBit(long bitIndex)
          Return the index of the byte in which the bit with the given index is encoded.
static boolean bytesEqual(byte[] a, byte[] b)
          True iff the two arrays compare as equal.
static int compareBytes(byte[] a, byte[] b)
          Byte-wise comparison of byte[]s (the arrays are treated as arrays of unsigned bytes).
static int compareBytesWithLenAndOffset(int aoff, int alen, byte[] a, int boff, int blen, byte[] b)
          Byte-wise comparison of byte[]s (the arrays are treated as arrays of unsigned bytes).
static boolean getBit(byte[] buf, long bitIndex)
          Get the value of a bit.
static int getBits(byte[] a, int off, int len)
          Return the n-bit integer corresponding to the inclusive bit range of the byte[].
static int getBits(int a, int off, int len)
          Return the n-bit integer corresponding to the inclusive bit range of the byte[].
static long getBits64(byte[] a, int off, int len)
           
static long getByteCount(String s)
          Decode a string of the form [0-9]+(k|kb|m|mb|g|gb)?, returning the number of bytes.
static byte[] getPrefix(byte[] a, byte[] b)
          Return a new byte[] containing the leading bytes in common between two byte[]s.
static int getPrefixLength(byte[] a, byte[] b)
          Return the #of leading bytes in common.
static byte[] getSeparatorKey(byte[] givenKey, byte[] priorKey)
           The keys in the nodes of a btree are known as separator keys.
static boolean loadJNILibrary()
          Attempt to load the JNI library.
static void main(String[] args)
          This method forces the load of the JNI library and tries to execute the JNI methods.
static int maskOffLSB(int h, int nbits)
          Mask off all but the LSB nbits of the hash value.
static int maskOffMSB(int h, int nbits)
          Mask off all but the MSB nbits of the hash value and shift them down such that the masked bits appear at bits (nbits:0] of the returned value.
static int optGetBits(byte[] a, int off, int len)
          Some benchmarks seem to indicate that altGetBits32 is faster than getBits for smaller byte counts.
static boolean rangeCheck(byte[] key, byte[] fromKey, byte[] toKey)
          Return true if the key lies inside of the optional half-open range constraint.
static boolean setBit(byte[] buf, long bitIndex, boolean value)
          Set the value of a bit - this is NOT thread-safe (contention for the byte in the backing buffer can cause lost updates).
static byte[] successor(byte[] key)
          Computes the successor of a variable length byte array by appending a unsigned zero(0) byte to the end of the array.
static byte[] toArray(ByteBuffer b)
          Return a byte[] having the data in the ByteBuffer from the Buffer.position() to the Buffer.limit().
static byte[] toArray(ByteBuffer b, boolean forceCopy)
          Return a byte[] having the data in the ByteBuffer from the Buffer.position() to the Buffer.limit().
static String toBitString(byte[] b)
          Return the binary representation of the unsigned byte[].
static String toString(byte[] key)
          Formats a key as a series of comma delimited unsigned bytes.
static String toString(byte[][] data)
          Formats the data into a String.
static String toString(byte[] key, int off, int len)
          Formats a key as a series of comma delimited unsigned bytes.
static int withinByteIndexForBit(long bitIndex)
          Return the offset within the byte in which the bit is coded of the bit (this is just the remainder bitIndex % 8).
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

EMPTY

public static final byte[] EMPTY
An empty byte[].


EMPTY2

public static final byte[][] EMPTY2
An empty byte[][].


minlen

public static final int minlen
JNI routines are not invoked unless we will compare byte[]s with at least this many potential bytes to compare (the actual# may be much less of course since comparisons may fail fast).

See Also:
Constant Field Values
Constructor Detail

BytesUtil

public BytesUtil()
Method Detail

loadJNILibrary

public static boolean loadJNILibrary()
Attempt to load the JNI library.

Note: this is done automatically if the optional boolean system property com.bigdata.btree.BytesUtil.jni=true is specified, e.g., using

    java -Dcom.bigdata.btree.BytesUtil.jni=true ...
 

Returns:
True iff the JNI library was successfully linked.

bytesEqual

public static final boolean bytesEqual(byte[] a,
                                       byte[] b)
True iff the two arrays compare as equal. This is somewhat optimized in that it tests the array lengths first, assumes that it is being used on sorted data and therefore compares the last bytes first, and does not convert the bytes to unsigned integers before testing for equality.

Parameters:
a - A byte[].
b - Another byte[].
Returns:
If the two arrays have the same reference (including null) or if they have the same data.

compareBytes

public static final int compareBytes(byte[] a,
                                     byte[] b)
Byte-wise comparison of byte[]s (the arrays are treated as arrays of unsigned bytes).

Parameters:
a - A byte[].
b - A byte[].
Returns:
a negative integer, zero, or a positive integer if the first argument is less than, equal to, or greater than the second.
TODO:
Return the index of the byte at which the difference with the sign adjusted to indicate the relative order of the data rather than the difference of the bytes at that index. The index would be negative or positive depending on which way the comparison went. See CustomByteArrayFrontCodedList for an implementation guideline.

Change all implementations in this class and also BytesUtil.c, which needs to be recompiled for Windows. Also makes sure that it gets compiled and linked for Un*x. That should be tested from the ant installer and the result reported. Do the same for ICU4JNI.


compareBytesWithLenAndOffset

public static final int compareBytesWithLenAndOffset(int aoff,
                                                     int alen,
                                                     byte[] a,
                                                     int boff,
                                                     int blen,
                                                     byte[] b)
Byte-wise comparison of byte[]s (the arrays are treated as arrays of unsigned bytes).

Parameters:
aoff - The offset into a at which the comparison will begin.
alen - The #of bytes in a to consider starting at aoff.
a - A byte[].
boff - The offset into b at which the comparison will begin.
blen - The #of bytes in b to consider starting at boff.
b - A byte[].
Returns:
a negative integer, zero, or a positive integer as the first argument is less than, equal to, or greater than the second.

getPrefixLength

public static final int getPrefixLength(byte[] a,
                                        byte[] b)
Return the #of leading bytes in common. This is used to compute the prefix for a node or leaf, which is formed by the leading bytes in common between the first and last key for a node or leaf.

Parameters:
a - A variable length unsigned byte array.
b - A variable length unsigned byte array.
Returns:
The #of leading bytes in common (aka the index of the first byte in which the two arrays differ, although that index could lie beyond the end of one of the arrays).

getPrefix

public static final byte[] getPrefix(byte[] a,
                                     byte[] b)
Return a new byte[] containing the leading bytes in common between two byte[]s. This is often used to compute the minimum length separator key.

Parameters:
a - A variable length unsigned byte array[].
b - A variable length unsigned byte array[].
Returns:
A new byte[] containing the leading bytes in common between the two arrays.

successor

public static final byte[] successor(byte[] key)
Computes the successor of a variable length byte array by appending a unsigned zero(0) byte to the end of the array.

Parameters:
key - A variable length unsigned byte array.
Returns:
A new unsigned byte[] that is the successor of the key.

getSeparatorKey

public static final byte[] getSeparatorKey(byte[] givenKey,
                                           byte[] priorKey)

The keys in the nodes of a btree are known as separator keys. The role of the separator keys is to direct search towards the leaf in which a key exists or would exist by always searching the first child having a separator key that is greater than or equal to the search key.

Separator keys separate leaves and must be choosen with that purpose in mind. The simplest way to choose the separator key is to just take the first key of the leaf - this is always correct. However, shorter separator keys may be choosen by defining the separator key as the shortest key that is less than or equal to the first key of a leaf and greater than the last key of the left sibling of that leaf (that is, the key for the entry that immediately proceeds the first entry on the leaf).

There are several advantages to always choosing the shortest separator key. The original rationale (in "Prefix B-Trees" by Bayer and Unterauer) was to increase the branching factors for fixed size pages. Since we use variable size serialized record, that is not an issue. However, using the shortest separator keys in this implementation provides both smaller serialized records for nodes and faster search since fewer bytes must be tested.

Note that this trick can not be used at higher levels in the btree - separator keys are always formed based on the keys in the leaves and then propagated through the tree.

The rules are simple enough:

  1. The separator contains all bytes in the shared prefix (if any) plus the 1st byte at which the given key differs from the prior key.
  2. If the separator key would equal the given key by value then return the reference to the given key.

Parameters:
givenKey - A key.
priorKey - Another key that proceeds the givenKey.
Returns:
The shortest key that is less than or equal to givenKey and greater than priorKey. This will be a reference to the givenKey iff that is also the shortest separator.
Throws:
IllegalArgumentException - if either key is null.
IllegalArgumentException - if both keys are the same reference.
See Also:
http://portal.acm.org/citation.cfm?doid=320521.320530

toString

public static final String toString(byte[] key)
Formats a key as a series of comma delimited unsigned bytes.

Parameters:
key - The key.
Returns:
The string representation of the array as unsigned bytes.

toString

public static final String toString(byte[] key,
                                    int off,
                                    int len)
Formats a key as a series of comma delimited unsigned bytes.

Parameters:
key - The key.
off - The index of the first byte that will be visited.
len - The #of bytes to visit.
Returns:
The string representation of the array as unsigned bytes.

toString

public static String toString(byte[][] data)
Formats the data into a String.

Parameters:
data - An array of unsigned byte arrays.

binarySearch

public static final int binarySearch(byte[][] keys,
                                     int base,
                                     int nmem,
                                     byte[] key)
Binary search on an array whose members are variable length unsigned byte[]s.

Parameters:
keys - The buffer.
base - The offset of the base of the array within the buffer.
nmem - The #of members in the array. When [nmem == 0], the array is empty.
key - The key for the search.
Returns:
index of the search key, if it is contained in keys; otherwise, (-(insertion point) - 1). The insertion point is defined as the point at which the key would be inserted into the array of keys. Note that this guarantees that the return value will be >= 0 if and only if the key is found.

rangeCheck

public static final boolean rangeCheck(byte[] key,
                                       byte[] fromKey,
                                       byte[] toKey)
Return true if the key lies inside of the optional half-open range constraint.

Parameters:
key - The key.
fromKey - The inclusive lower bound -or- null if there is no lower bound.
toKey - The exclusive upper bound -or- null if there is no upper bound.
Returns:
true unless the key is LT [fromKey] or GTE [toKey].

main

public static void main(String[] args)
This method forces the load of the JNI library and tries to execute the JNI methods.

In order to use the JNI library under Windows, you must specify the JNI library location using the PATH environment variable, e.g.,

   cd bigdata
   set PATH=%PATH%;lib
   java -cp bin com.bigdata.btree.BytesUtil
 

In order to use the JNI library under un*x, you must specify the JNI library location

    java -Djava.library.path=lib com.bigdata.btree.BytesUtil
 

Parameters:
args -
Throws:
UnsatisfiedLinkError - if the JNI methods can not be resolved.
AssertionError - if the JNI methods do not produce the expected answers.

bitFlagByteLength

public static final int bitFlagByteLength(int nbits)
Return the #of bytes required to bit code the specified #of bits.

Parameters:
nbits - The #of bit flags.
Returns:
The #of bytes required. This will be zero iff nbits is zero.

byteIndexForBit

public static final int byteIndexForBit(long bitIndex)
Return the index of the byte in which the bit with the given index is encoded.

Parameters:
bitIndex - The bit index.
Returns:
The byte index.

withinByteIndexForBit

public static final int withinByteIndexForBit(long bitIndex)
Return the offset within the byte in which the bit is coded of the bit (this is just the remainder bitIndex % 8).

Note, the computation of the bit offset is intentionally aligned with OutputBitStream and InputBitStream.

Parameters:
bitIndex - The bit index into the byte[].
Returns:
The offset of the bit in the appropriate byte.

getBit

public static final boolean getBit(byte[] buf,
                                   long bitIndex)
Get the value of a bit.

Note, the computation of the bit offset is intentionally aligned with OutputBitStream and InputBitStream.

Parameters:
bitIndex - The index of the bit.
Returns:
The value of the bit.

setBit

public static final boolean setBit(byte[] buf,
                                   long bitIndex,
                                   boolean value)
Set the value of a bit - this is NOT thread-safe (contention for the byte in the backing buffer can cause lost updates).

Note, the computation of the bit offset is intentionally aligned with OutputBitStream and InputBitStream.

Parameters:
bitIndex - The index of the bit.
Returns:
The old value of the bit.

maskOffMSB

public static int maskOffMSB(int h,
                             int nbits)
Mask off all but the MSB nbits of the hash value and shift them down such that the masked bits appear at bits (nbits:0] of the returned value. This is used to index into a dictionary page based on the revealed bits.

Parameters:
h - The hash value.
nbits - The #of bits already accounted for by the path from the root.
Returns:
The hash value considering only the MSB nbits and shifted down into an nbits integer.

maskOffLSB

public static int maskOffLSB(int h,
                             int nbits)
Mask off all but the LSB nbits of the hash value.

Parameters:
h - The hash value.
nbits - The #of LSB bits to be retained.
Returns:
The LSB nbits. TODO unit test.

getBits

public static int getBits(byte[] a,
                          int off,
                          int len)
Return the n-bit integer corresponding to the inclusive bit range of the byte[]. Bit ZERO (0) is the Most Significant Bit (MSB). Bit positions increase from zero up to a.length * 8 - 1. The return value is an int32 and the bit range must not be greater than 32 bits.

For example, given the following data and the bit range (0,2)

 bit index: 01234567 
 ---------+---------- 
 bit value: 10110000
 
TWO (2) bits starting at bit offset ZERO (0) would be extracted and returned as a 2-bit integer. For those data, the return value would be an int32 value whose binary representation was 10 (with leading zeros suppressed).

Note: This method is design for use with the unsigned byte[] keys in a bigdata hash index. All keys in bigdata are internally represented as unsigned byte[]s, which is why this method accepts a byte[] rather than an long[] for the bits. Also, while the length of an unsigned byte[] key can vary, they are never huge and an int32 value is sufficient to index into the bits in the byte[]. Finally, the return value is an int because it will be used in hash table designs to index into a hash table based on those bits in a hash code key which are masked as relevant to that hash table. 32bits is far more than we will need to index into a hash table. For an 8k page, we might expect a fan out of at most 1024 which is only 10 bits.

Parameters:
a - A byte[].
off - The index of the first bit to be included.
len - The number of bits to be returned in [0:32]. However, a bit length of zero will always return zero.
Returns:
The integer extracted from the specified bit range.

altGetBits64

public static long altGetBits64(byte[] a,
                                int off,
                                int len)

altGetBits32

public static int altGetBits32(byte[] a,
                               int off,
                               int len)

getBits64

public static long getBits64(byte[] a,
                             int off,
                             int len)

optGetBits

public static int optGetBits(byte[] a,
                             int off,
                             int len)
Some benchmarks seem to indicate that altGetBits32 is faster than getBits for smaller byte counts. OTOH the cost of the redirection may outweigh any benefit.


getBits

public static int getBits(int a,
                          int off,
                          int len)
Return the n-bit integer corresponding to the inclusive bit range of the byte[]. Bit ZERO (0) is the Most Significant Bit (MSB). Bit positions increase from zero up to 31. The return value is an int32 and the bit range must not be greater than 32 bits.

For example, given the following data and the bit range (0,2)

 bit index: 01234567 
 ---------+---------- 
 bit value: 10110000
 
TWO (2) bits starting at bit offset ZERO (0) would be extracted and returned as a 2-bit integer. For those data, the return value would be an int32 value whose binary representation was 10 (with leading zeros suppressed).

Note: This method is design for use in a bigdata hash index having native int32 keys rather than unsigned byte[] keys.

Parameters:
a - An integer.
off - The index of the first bit to be included.
len - The number of bits to be returned in [0:32]. However, a bit length of zero will always return zero.
Returns:
The integer extracted from the specified bit range.

toBitString

public static String toBitString(byte[] b)
Return the binary representation of the unsigned byte[].

Parameters:
b - The unsigned byte[].
Returns:
The representation of the bits in that unsigned byte[].
Throws:
IllegalArgumentException - if the argument is null.

getByteCount

public static long getByteCount(String s)
Decode a string of the form [0-9]+(k|kb|m|mb|g|gb)?, returning the number of bytes. When a suffix indicates kilobytes, megabytes, or gigabytes then the returned value is scaled accordingly. The suffix is NOT case sensitive.

Parameters:
s - The string value.
Returns:
The byte count.
Throws:
IllegalArgumentException - if there is a problem with the argument (null, ill-formed, etc).

toArray

public static byte[] toArray(ByteBuffer b)
Return a byte[] having the data in the ByteBuffer from the Buffer.position() to the Buffer.limit(). The position, limit, and mark are not affected by this operation. When the ByteBuffer has a backing array, the array offset is ZERO (0), and the Buffer.limit() is equal to the Buffer.capacity() then the backing array is returned. Otherwise, a new byte[] is allocated and the data are copied into that byte[], which is then returned.

Parameters:
b - The ByteBuffer.
Returns:
The byte[].

toArray

public static byte[] toArray(ByteBuffer b,
                             boolean forceCopy)
Return a byte[] having the data in the ByteBuffer from the Buffer.position() to the Buffer.limit(). The position, limit, and mark are not affected by this operation.

Under certain circumstances it is possible and may be desirable to return the backing ByteBuffer.array(). This behavior is enabled by forceCopy := false.

It is possible to return the backing byte[] when the ByteBuffer has a backing array, the array offset is ZERO (0), and the Buffer.limit() is equal to the Buffer.capacity() then the backing array is returned. Otherwise, a new byte[] must be allocated, and the data are copied into that byte[], which may then be returned.

Parameters:
b - The ByteBuffer.
forceCopy - When false, the backing array will be returned if possible.
Returns:
The byte[].


Copyright © 2006-2012 SYSTAP, LLC. All Rights Reserved.