com.bigdata.btree.raba.codec
Class CanonicalHuffmanRabaCoder

java.lang.Object
  extended by com.bigdata.btree.raba.codec.CanonicalHuffmanRabaCoder
All Implemented Interfaces:
IRabaCoder, Externalizable, Serializable

public class CanonicalHuffmanRabaCoder
extends Object
implements IRabaCoder, Externalizable

This class provides (de-)compression for logical byte[][]s based on canonical Huffman codes. Canonical huffman codes preserve the alphabetic order of the original values. However, they are often used because it is possible to transmit the dictionary using fewer bits by sending only the bitLength[] for the code words. The CanonicalHuffmanRabaCoder can be used for keys or values and supports efficient search of the coded keys.

Record format

The encoded data are represented in the buffer as follows:
 version:byte     The version identifier for this record format (8 bits,
                  which allows for 256 format revisions).
 
 isKeys:1         Bit flag indicates whether the record is coding B+Tree keys
                  or B+Tree values.  When coding values only the non-zero
                  byte frequency counts are used to compute the dictionary
                  and nulls are permitted.  When coding keys, the dictionary
                  includes codes for all 256 possible byte values so we may
                  code search keys and nulls are not permitted.
 
 isSymbolTable:1
                  A bit flag whose value is 1 iff the symbols are given as
                  a packed symbol[] and 0 if all possible byte values were
                  coded.  Note that the implementation MAY choose to code
                  byte values with a zero frequency if MOST byte values are
                  used since that does not degrade the code by much and it
                  allows us to eliminate a ˜256 bytes from the data record.
 
 isOffsetArray:1  A bit flag whose value is 1 iff the codedValueOffset[] is
                  stored.  At this time, this array is always stored.  Either
                  the offset[] or the #of symbols in each coded value MUST be
                  stored in order for us to compute the end of each sequence
                  of codeWord[]s for a given coded value. 
 
 isByteAlignedOffset:1
                  A bit flag whose value is 1 iff the codedValueOffset[] is
                  byte aligned.  When true, the start of the array will fall
                  on a byte boundary and each array element will be a multiple
                  of 8 bits.
                  
 unused:4         unused bit flags.
 
 size:int31       The #of elements in the logical byte[][].
 
 nsymbols:uint9   There are at most 256 distinct symbols, which are the
                  distinct possible byte values (9 bits, which allows
                  for an empty leaf or node with no byte values used as
                  well as a leaf or node with all 256 byte values used).
 
 -- note: at this point the record is still byte aligned --
 
 symbol2byte:byte[]
                  The symbol2byte array.  The index into the array is the
                  symbol, which is correlated with the frequency[] used to
                  build the code.  The value is the byte corresponding to that
                  symbol.  There are nsymbols entries in the array.  The array
                  is present IFF isSymbolTable is true.
                  
                  O_symbols := 48 bits.
 
 -- note: at this point the record is still byte aligned --
 
 codedValueOffsetBits:uint8
                  The width in bits of the integers used to code the
                  codedValueOffset[].
 
                  O_codedValueOffsetBits := [ibs.readBits];
 
 sumCodedValueBitLengths:uint32
                  The sum of the bit lengths of the coded values.  This is
                  a 32bit unsigned integer, which is sufficient to code up
                  bit lengths of up to 512MB.  This field IS NOT present 
                  if the codeOffsetBits is ZERO (0) since the field is only
                  used to compute the bit offset of the codeOffset[].
 
                  O_sumCodedValueBitLengths := O_codedValueOffsetBits + 8;
 
 nulls[]          A vector of [nvalues] bit flags IFF isKeys==false.  A flag
                  is a ONE (1) iff the corresponding byte[] in the logical
                  byte[][] was a null.  A null is coded as a sequence of ZERO
                  (0) code words.  However, empty byte[]s are also permitted 
                  for B+Tree values (but not for B+Tree keys).  Therefore you
                  MUST test the nulls[] to distinguish between a null byte[]
                  and an empty byte[].  This field IS NOT present when keys
                  are coded.
 
                  O_nulls :=  O_codedValueOffsetBits + 8 + (codedValueOffsetBits==0?0:32);
 
 -- note: if nsymbols==0 then this is the end of the record --
 
 decoderInputs:=
 length[],        The bit length of each code word in the assigned order.
 symbol[],        The correlated symbol indices.
 codeWord[0].     The shortest code word.
 
                  These length[] and the symbol[] data are written out together
                  using a compact representation.  The length of the shortest
                  code word length is given by length[0].  Since we are using
                  a canonical huffman code, all we need is the shortest code
                  and the length[] to regenerate the entire code.
                                     
                  O_decoderInputs := O_symbols + (isSymbolTable?0:nsymbols*8)
                  
 codedValue:bit[] The coded values given as a sequence of code words. The bit
                  length of this field is given by sumCodedValueBitLengths.
 
                  O_codedValue[] := O_nulls + size IFF values are coded
                  
                  -or-
                  
                  O_codedValue[] := O_nulls IFF keys are coded.
 
 codeValueOffset[]
                  The bit offset to the start of each coded value in the
                  codedValue[].  The offsets are relative to the start of
                  the first coded value.
                  
                  While the delta in the offsets could be represented more
                  efficiently, the offsets are represented directly so that
                  we may avoid reading the entire codeOffset[] into memory.
                  This array is present iff isOffsetArray is true.
 
                  O_codedValueOffset[] := O_codedValue[] + sumCodedValueBitLengths
 

Version:
$Id: CanonicalHuffmanRabaCoder.java 2547 2010-03-24 20:44:07Z thompsonbry $
Author:
Bryan Thompson
See Also:
HuffmanCodec, http://en.wikipedia.org/wiki/Huffman_coding, http://en.wikipedia.org/wiki/Canonical_Huffman_code, Serialized Form
TODO:
it could be useful to byte align the codedValueOffset[] IF we also coded the offsets using nbits, where nbits was a multiple of 8. FIXME Generalize to allow alphabets of size GT 256 so we can reuse this to directly encode the long[] keys of the leaves in the RDF DB. This will still be an IRaba (a random access byte[]). The difference is that each symbol will correspond to a sequence of 8 bytes. So what we need is a means to decouple the generation of the frequency[], the assumption that alphabets of 256 are "complete", and the application of the codec and the decoder, which must process 8 bytes at a time.

Nested Class Summary
protected static class CanonicalHuffmanRabaCoder.AbstractCodingSetup
          Abstract base class for preparing a logical byte[][] for coding.
static class CanonicalHuffmanRabaCoder.CodedRabaImpl
          Decoder.
protected static class CanonicalHuffmanRabaCoder.RabaCodingSetup
          Sets up for coding an IRaba representing B+Tree values.
 
Field Summary
static CanonicalHuffmanRabaCoder INSTANCE
           
protected static org.apache.log4j.Logger log
           
protected static byte VERSION0
          The original serialization version for the coded data record.
 
Constructor Summary
CanonicalHuffmanRabaCoder()
           
 
Method Summary
 ICodedRaba decode(AbstractFixedByteArrayBuffer data)
          Return an IRaba which can access the coded data.
 AbstractFixedByteArrayBuffer encode(IRaba raba, DataOutputBuffer buf)
          Encode the data.
 ICodedRaba encodeLive(IRaba raba, DataOutputBuffer buf)
          Encode the data, returning an ICodedRaba.
protected  long getSumCodedValueBitLengths(it.unimi.dsi.bits.BitVector[] codeWords, IRaba raba, com.bigdata.btree.raba.codec.CanonicalHuffmanRabaCoder.Byte2Symbol byte2symbol)
          Deprecated. Leave this field and the #of bits per codedValueOffset[] element blank until we have written out the coded values and then rewind the OBS and fill in those fields. Otherwise we are encoding the same byte[][] data twice, which is wasted effort.
 boolean isKeyCoder()
          Return true if this implementation can code B+Tree keys (supports search on the coded representation).
 boolean isValueCoder()
          Return true if this implementation can code B+Tree values (allows nulls).
protected static it.unimi.dsi.compression.HuffmanCodec.DecoderInputs readDecoderInputs(int nsymbols, it.unimi.dsi.io.InputBitStream ibs, StringBuilder sb)
          Reconstruct the HuffmanCodec.DecoderInputs from the data written by #writeDecoderInputs(BitVector[], OutputBitStream).
 void readExternal(ObjectInput in)
           
protected  long writeCodedValues(it.unimi.dsi.compression.PrefixCoder coder, IRaba raba, com.bigdata.btree.raba.codec.CanonicalHuffmanRabaCoder.Byte2Symbol byte2symbol, long[] codedValueOffset, it.unimi.dsi.io.OutputBitStream obs)
          Write out the coded values.
protected static void writeDecoderInputs(it.unimi.dsi.compression.HuffmanCodec.DecoderInputs decoderInputs, it.unimi.dsi.io.OutputBitStream obs, StringBuilder sb)
          Write a compact minimum representation of the data required to reconstruct the decoder (bit lengths and correlated symbols).
 void writeExternal(ObjectOutput out)
           
protected  void writeSymbolTable(com.bigdata.btree.raba.codec.CanonicalHuffmanRabaCoder.Symbol2Byte symbol2byte, it.unimi.dsi.io.OutputBitStream obs)
          Write out the optional packed symbol table (symbol2byte).
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

log

protected static final org.apache.log4j.Logger log

VERSION0

protected static final transient byte VERSION0
The original serialization version for the coded data record.

See Also:
Constant Field Values

INSTANCE

public static final transient CanonicalHuffmanRabaCoder INSTANCE
Constructor Detail

CanonicalHuffmanRabaCoder

public CanonicalHuffmanRabaCoder()
Method Detail

readExternal

public void readExternal(ObjectInput in)
                  throws IOException,
                         ClassNotFoundException
Specified by:
readExternal in interface Externalizable
Throws:
IOException
ClassNotFoundException

writeExternal

public void writeExternal(ObjectOutput out)
                   throws IOException
Specified by:
writeExternal in interface Externalizable
Throws:
IOException

isKeyCoder

public final boolean isKeyCoder()
Description copied from interface: IRabaCoder
Return true if this implementation can code B+Tree keys (supports search on the coded representation). Note that some implementations can code either keys or values.

Specified by:
isKeyCoder in interface IRabaCoder

isValueCoder

public final boolean isValueCoder()
Description copied from interface: IRabaCoder
Return true if this implementation can code B+Tree values (allows nulls). Note that some implementations can code either keys or values.

Specified by:
isValueCoder in interface IRabaCoder

writeSymbolTable

protected void writeSymbolTable(com.bigdata.btree.raba.codec.CanonicalHuffmanRabaCoder.Symbol2Byte symbol2byte,
                                it.unimi.dsi.io.OutputBitStream obs)
                         throws IOException
Write out the optional packed symbol table (symbol2byte). When present, the symbol table is written as a sequence of the in use byte values in the unsigned byte order (this is the order in which the frequency[] was specified).

Parameters:
symbol2byte - The symbol table.
obs - The symbol table is written on this bit stream.
Throws:
IOException

writeDecoderInputs

protected static void writeDecoderInputs(it.unimi.dsi.compression.HuffmanCodec.DecoderInputs decoderInputs,
                                         it.unimi.dsi.io.OutputBitStream obs,
                                         StringBuilder sb)
                                  throws IOException
Write a compact minimum representation of the data required to reconstruct the decoder (bit lengths and correlated symbols). The data are written out as follows:
 min
 max
 count, symbol(s)
 
where min is the bit length of the shortest code word; max is the bit length of the longest code word; count is the #of code words of a given length; and symbol(s) are the symbols associated with each code word. All values are nibble coded and will generally fit in 1-2 bytes each.

Parameters:
decoderInputs - This contains both the bit lengths of the canonical huffman code and the symbols assigned to each code word.
obs - The output bit stream.
sb - Debugging information is added to this buffer (optional).
Throws:
IOException
See Also:
DecoderInputs}

readDecoderInputs

protected static it.unimi.dsi.compression.HuffmanCodec.DecoderInputs readDecoderInputs(int nsymbols,
                                                                                       it.unimi.dsi.io.InputBitStream ibs,
                                                                                       StringBuilder sb)
                                                                                throws IOException
Reconstruct the HuffmanCodec.DecoderInputs from the data written by #writeDecoderInputs(BitVector[], OutputBitStream).

Parameters:
nsymbols - The #of symbols.
ibs - The input bit stream.
sb - Debugging information is added to this buffer (optional).
Returns:
The decoded bit lengths and the corresponding symbol indices for the canonical huffman code.
Throws:
IOException

getSumCodedValueBitLengths

protected long getSumCodedValueBitLengths(it.unimi.dsi.bits.BitVector[] codeWords,
                                          IRaba raba,
                                          com.bigdata.btree.raba.codec.CanonicalHuffmanRabaCoder.Byte2Symbol byte2symbol)
Deprecated. Leave this field and the #of bits per codedValueOffset[] element blank until we have written out the coded values and then rewind the OBS and fill in those fields. Otherwise we are encoding the same byte[][] data twice, which is wasted effort.

Return the cumulative bit length of the coded values.

Parameters:
coder - The coder.
raba - The logical byte[][].
byte2symbol - The mapping from byte values to symbol indices.
Returns:
The total bit length of the coded values.

writeCodedValues

protected long writeCodedValues(it.unimi.dsi.compression.PrefixCoder coder,
                                IRaba raba,
                                com.bigdata.btree.raba.codec.CanonicalHuffmanRabaCoder.Byte2Symbol byte2symbol,
                                long[] codedValueOffset,
                                it.unimi.dsi.io.OutputBitStream obs)
                         throws IOException
Write out the coded values.

Parameters:
coder - The coder.
raba - The logical byte[][].
byte2symbol - The mapping from byte values to symbol indices.
codedValueOffset - An optional array dimensioned to nvalues+1, where nvalues is #of values in the logical byte[][]. When given, the array will be populated with the relative bit offset of the start of each coded value. The offsets are relative to the start of the first coded value.
Returns:
The #of bits written (sum total of the bit lengths of the coded values).
Throws:
IOException

encode

public AbstractFixedByteArrayBuffer encode(IRaba raba,
                                           DataOutputBuffer buf)
Description copied from interface: IRabaCoder
Encode the data.

Note: Implementations of this method are typically heavy. While it is always valid to IRabaCoder.encode(IRaba, DataOutputBuffer) an IRaba , DO NOT invoke this arbitrarily on data which may already be coded. The ICodedRaba interface will always be implemented for coded data.

Specified by:
encode in interface IRabaCoder
Parameters:
raba - The data.
buf - A buffer on which the coded data will be written.
Returns:
A slice onto the post-condition state of the caller's buffer whose view corresponds to the coded record. This may be written directly onto an output stream or the slice may be converted to an exact fit byte[].

encodeLive

public ICodedRaba encodeLive(IRaba raba,
                             DataOutputBuffer buf)
Description copied from interface: IRabaCoder
Encode the data, returning an ICodedRaba. Implementations of this method should be optimized for the very common use case where the caller requires immediate access to the coded data record. In that case, many of the IRabaCoder implementations can be optimized by passing the underlying decoding object directly into an alternative constructor for the ICodedRaba. The byte[] slice for the coded data record is available from ICodedRaba.data().

This method covers the vast major of the use cases for coding data, which is to code B+Tree keys or values for a node or leaf that has been evicted from the AbstractBTree's write retention queue. The common use case is to wrap a coded record that was read from an IRawStore. The IndexSegmentBuilder is a special case, since the coded record will not be used other than to write it on the disk.

Specified by:
encodeLive in interface IRabaCoder

decode

public ICodedRaba decode(AbstractFixedByteArrayBuffer data)
Description copied from interface: IRabaCoder
Return an IRaba which can access the coded data. In general, implementations SHOULD NOT materialize a backing byte[][]. Instead, the implementation should access the data in place within the caller's buffer. Frequently used fields MAY be cached, but the whole point of the IRabaCoder is to minimize the in-memory footprint for the B+Tree by using a coded (aka compressed) representation of the keys and values whenever possible.

Specified by:
decode in interface IRabaCoder
Parameters:
data - The record containing the coded data.
Returns:
A view of the coded data.


Copyright © 2006-2011 SYSTAP, LLC. All Rights Reserved.