com.bigdata.rdf.internal.encoder
Class IVSolutionSetEncoder

java.lang.Object
  extended by com.bigdata.rdf.internal.encoder.IVSolutionSetEncoder
All Implemented Interfaces:
IBindingSetEncoder

public class IVSolutionSetEncoder
extends Object
implements IBindingSetEncoder

This class provides fast, efficient serialization for solution sets. Each solution must be an IBindingSets whose bound values are IVs and their cached BigdataValues. The IVs and the cached BigdataValues are efficiently and compactly represented in format suitable for chunked messages or streaming. Decode is a fast online process. Both encode and decode require the maintenance of a map from the IV having cached BigdataValues to those cached values.

Record Format

The format is as follows:
 nbound
 nvars
 ncached 
 (namespace)
 var[0]...var[nvars-1]
 bitmap-for-bound-variables
 bitmap-for-IV-with-cached-Values
 IV[0] ... IV[nbound-1]
 Value[0] ... Value[ncached-1]
 
where nbound is the #of bindings in the binding set. When zero, the rest of the record is omitted.

where nvars is the #of new variables in this binding set. The "schema" used to encode the bindings is based on the ordered set of variables for which bindings are observed. The encoder writes this information out incrementally. The decoder builds up this information as it decodes solutions.

where ncached is the #of bindings in the binding set for which there is a cached BigdataValue which has not already been written into a previous record. Even if the IV has a cached BigdataValue, if the IV has been previously written into a record then the IV is NOT record in this record with a cached Value. Further, if the IV appears more than once in a given record, the cached value is only marked in the bitmap for the first such occurrence and the cached value is only written into the record once.

where namespace is the namespace of the lexicon relation. This is written out for the first solution having an IVCache association. It is assumed that all Values are BigdataValue for the same lexicon relation. If no solutions have an IVCache association, then the namespace will never be written into the encoded output.

where var is the name of a variable for which a binding was first observed for the current solution. The names of the variables are written in the order in which they are first observed. This forms the implicit "schema" required to decode the IV[].

where bitmap-for-bound-variables is zero or more bytes providing a bit map indicating those variables which are bound in this solution out of the total set of variables which have been observed in the solutions presented to this encode.

where bitmap-for-IVs-with-cached-Values is zero or more bytes providing a bit map indicating which IVs are associated with cached values written into the record. Whether or not an IV has a cached value must be decided by the caller after processing the record and consulting an (IV,Value) cache which they maintain over the set of records processed to date. Cached values are written out (and the bit set) only the first time a given IV with a cached Value is observed.

where IV[n] is an IV as encoded by IVUtility.

where BigdataValue is an RDF Value serialized using the BigdataValueSerializer for the namespace of the lexicon.

Decode

The namespace of the lexicon is required in to obtain the BigdataValueFactory and BigdataValueSerializer used to decode and materialize the cached BigdataValues. This information can be sent before the records if it is not known to the caller.

The decoder materializes the cached values into a map (either a HashMap or HTree, as appropriate for the data scale) as the records are processed. Only one solution needs to be decoded at a time, but the decoder must maintain the (IV,Value) cache across all decoded records. There is no need to indicate the #of records, but IChunkMessage#getSolutionCount() in fact reports exactly that information.

Each solution can be turned into an IBindingSet at the time that it is decoded. If we use a standard ListBindingSet, then we need to resolve each IV against the IV cache, setting its RDF Value as a side effect before returning the IBindingSet to the caller. If we do a custom IBindingSet implementation, then the cached BigdataValue could be lazily materialized by hooking IVCache.getValue(). Either way, the life cycle of the materialized objects will be very short unless they are propagated into new solutions. Short life cycle objects entail very little heap burden.

Version:
$Id: IVSolutionSetEncoder.java 6032 2012-02-16 12:48:04Z thompsonbry $
Author:
Bryan Thompson
See Also:
Optimize serialization for query messages on cluster TODO There chould be a completely different encoding when only a single variable is bound (column projection) especially if there are likely to be duplicate IVs. However, we still have to pass through the cached Value associations, which this does pretty efficiently.

Constructor Summary
IVSolutionSetEncoder()
           
 
Method Summary
 void encodeSolution(DataOutputBuffer out, IBindingSet bset)
          Encode the solution on the stream.
 byte[] encodeSolution(IBindingSet bset)
          Encode the solution as an IV[], collecting updates for the internal IV to BigdataValue cache.
 byte[] encodeSolution(IBindingSet bset, boolean updateCacheIsIgnored)
          Encode the solution as an IV[].
 void flush()
          Flush any updates.
 boolean isValueCache()
          Return true iff the IVCache associations are preserved by the encoder.
 void release()
          Release the state associated with the IVBindingSetEncoder.
 String toString()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

IVSolutionSetEncoder

public IVSolutionSetEncoder()
Method Detail

toString

public String toString()
Overrides:
toString in class Object

encodeSolution

public void encodeSolution(DataOutputBuffer out,
                           IBindingSet bset)
Encode the solution on the stream.

Parameters:
out - The stream.
bset - The solution.

encodeSolution

public byte[] encodeSolution(IBindingSet bset)
Description copied from interface: IBindingSetEncoder
Encode the solution as an IV[], collecting updates for the internal IV to BigdataValue cache.

Specified by:
encodeSolution in interface IBindingSetEncoder
Parameters:
bset - The solution to be encoded.
Returns:
The encoded solution.

encodeSolution

public byte[] encodeSolution(IBindingSet bset,
                             boolean updateCacheIsIgnored)
Encode the solution as an IV[].

Note: The IVCache associations may be buffered by this method. Use IBindingSetEncoder.flush() to vector any buffered associations. TODO We typically use a ListBindingSet. If the IBindingSet is large enough, then it would be more efficient to create an IVariable to IV map within this method since we have to lookup bindings by variables more than once.

Specified by:
encodeSolution in interface IBindingSetEncoder
Parameters:
bset - The solution to be encoded.
updateCacheIsIgnored - When true, updates are accumulated for the IV to BigdataValue cache. You must still use IBindingSetEncoder.flush() to vector the accumulated updates.

If you are only generating the encoding in order to resolve a key in a hash index, then you would use false since you do not need to maintain the IVCache association for the given IBindingSet.

Returns:
The encoded solution.

release

public void release()
Description copied from interface: IBindingSetEncoder
Release the state associated with the IVBindingSetEncoder.

Specified by:
release in interface IBindingSetEncoder

flush

public void flush()
Description copied from interface: IBindingSetEncoder
Flush any updates. This allows for vectored operations when updating the IVCache associations.

Specified by:
flush in interface IBindingSetEncoder

isValueCache

public boolean isValueCache()
Return true iff the IVCache associations are preserved by the encoder.

Always returns true.

Specified by:
isValueCache in interface IBindingSetEncoder


Copyright © 2006-2012 SYSTAP, LLC. All Rights Reserved.