com.bigdata.bop.join
Class HTreeHashJoinUtility

java.lang.Object
  extended by com.bigdata.bop.join.HTreeHashJoinUtility
All Implemented Interfaces:
IHashJoinUtility

public class HTreeHashJoinUtility
extends Object
implements IHashJoinUtility

Utility methods to support hash index builds and hash index joins using a scalable native memory data structures.

Vectoring and IV encoding

In order to provide efficient encoding and persistence of solutions on the HTree, this class is written directly to the RDF data model. Rather than POJO serialization, solutions are encoded as logical IV[]s in a manner very similar to how we represent the keys of the statement indices.

Since this encoding does not persist the cache, a separate mapping must be maintained from IV to BigdataValue for those IVs which have a materialized BigdataValue. TODO Do a 64-bit hash version which could be used for hash indices having more than 500M distinct join variable combinations. Note that at 500M distinct join variable combinations we have a 1 in 4 chance of a hash collision. Whether or not that turns into a cost really depends on the cardinality of the solutions per distinct combination of the join variables. If there is only one solution per join variable combination, then those collisions will cause basically no increase in the work to be done. However, if there are 50,000 solutions per distinct combination of the join variables then we would be better off using a 64-bit hash code. FIXME Vector resolution of ivCache. Various methods use IVBindingSetEncoderWithIVCache.resolveCachedValues(IBindingSet)

Version:
$Id: HTreeHashJoinUtility.java 5568 2011-11-07 19:39:12Z thompsonbry
Author:
Bryan Thompson

Constructor Summary
HTreeHashJoinUtility(IMemoryManager mmgr, PipelineOp op, JoinTypeEnum joinType)
           
 
Method Summary
 long acceptSolutions(ICloseableIterator<IBindingSet[]> itr, BOpStats stats)
          Buffer solutions on a hash index.
 void checkpointJoinSet()
          Checkpoint the join set (used to buffer the optional solutions).
 long filterSolutions(ICloseableIterator<IBindingSet[]> itr, BOpStats stats, IBuffer<IBindingSet> sink)
          Filter solutions, writing only the DISTINCT solutions onto the sink.
 IVariable<?> getAskVar()
          The variable bound based on whether or not a solution survives an "EXISTS" graph pattern (optional).
 IConstraint[] getConstraints()
          The join constraints (optional).
 JoinTypeEnum getJoinType()
          Return the type safe enumeration indicating what kind of operation is to be performed.
 IVariable<?>[] getJoinVars()
          The join variables.
 long getRightSolutionCount()
          Return the #of solutions in the hash index.
 IVariable<?>[] getSelectVars()
          The variables to be retained (optional, all variables are retained if not specified).
 IRawStore getStore()
          The backing IRawStore.
 void hashJoin(ICloseableIterator<IBindingSet> leftItr, IBuffer<IBindingSet> outputBuffer)
          Do a hash join between a stream of source solutions (left) and a hash index (right).
 void hashJoin2(ICloseableIterator<IBindingSet> leftItr, IBuffer<IBindingSet> outputBuffer, IConstraint[] constraints)
          Variant hash join method allows the caller to impose different constraints or additional constraints.
 boolean isEmpty()
          Return true iff there are no solutions in the hash index.
 void mergeJoin(IHashJoinUtility[] others, IBuffer<IBindingSet> outputBuffer, IConstraint[] constraints, boolean optional)
          Perform an N-way merge join.
 void outputJoinSet(IBuffer<IBindingSet> out)
          Output the solutions which joined.
 void outputOptionals(IBuffer<IBindingSet> outputBuffer)
          Identify and output the optional solutions.
 void outputSolutions(IBuffer<IBindingSet> out)
          Output the solutions buffered in the hash index.
 void release()
          Discard the hash index.
 void saveSolutionSet()
          Checkpoint the HTree instance(s) used to buffer the source solutions (rightSolutions and #ivCache) and then re-load the them in a read-only mode from their checkpoint(s).
 String toString()
          Human readable representation of the IHashJoinUtility metadata (but not the solutions themselves).
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

HTreeHashJoinUtility

public HTreeHashJoinUtility(IMemoryManager mmgr,
                            PipelineOp op,
                            JoinTypeEnum joinType)
Parameters:
mmgr - The IMemoryManager which will back the named solution set.
op - The operator whose annotation will inform construction the hash index. The HTreeAnnotations may be specified for this operator and will control the initialization of the various HTree instances.
joinType - The type of join to be performed.
See Also:
HTreeHashJoinAnnotations
Method Detail

toString

public String toString()
Human readable representation of the IHashJoinUtility metadata (but not the solutions themselves).

Overrides:
toString in class Object

isEmpty

public boolean isEmpty()
Description copied from interface: IHashJoinUtility
Return true iff there are no solutions in the hash index.

Specified by:
isEmpty in interface IHashJoinUtility

getRightSolutionCount

public long getRightSolutionCount()
Description copied from interface: IHashJoinUtility
Return the #of solutions in the hash index.

Specified by:
getRightSolutionCount in interface IHashJoinUtility

getJoinType

public JoinTypeEnum getJoinType()
Description copied from interface: IHashJoinUtility
Return the type safe enumeration indicating what kind of operation is to be performed.

Specified by:
getJoinType in interface IHashJoinUtility

getAskVar

public IVariable<?> getAskVar()
Description copied from interface: IHashJoinUtility
The variable bound based on whether or not a solution survives an "EXISTS" graph pattern (optional).

Specified by:
getAskVar in interface IHashJoinUtility
See Also:
HashJoinAnnotations.ASK_VAR

getJoinVars

public IVariable<?>[] getJoinVars()
Description copied from interface: IHashJoinUtility
The join variables.

Specified by:
getJoinVars in interface IHashJoinUtility
See Also:
HashJoinAnnotations.JOIN_VARS

getSelectVars

public IVariable<?>[] getSelectVars()
Description copied from interface: IHashJoinUtility
The variables to be retained (optional, all variables are retained if not specified).

Specified by:
getSelectVars in interface IHashJoinUtility
See Also:
JoinAnnotations.SELECT

getConstraints

public IConstraint[] getConstraints()
Description copied from interface: IHashJoinUtility
The join constraints (optional).

Specified by:
getConstraints in interface IHashJoinUtility
See Also:
JoinAnnotations.CONSTRAINTS

getStore

public IRawStore getStore()
The backing IRawStore.


saveSolutionSet

public void saveSolutionSet()
Checkpoint the HTree instance(s) used to buffer the source solutions (rightSolutions and #ivCache) and then re-load the them in a read-only mode from their checkpoint(s). This exposes a view of the HTree which is safe for concurrent readers.


checkpointJoinSet

public void checkpointJoinSet()
Checkpoint the join set (used to buffer the optional solutions).

Note: Since we always output the solutions which did not join from a single thread as part of last pass evaluation there is no need to checkpoint the joinSet.


release

public void release()
Description copied from interface: IHashJoinUtility
Discard the hash index.

Specified by:
release in interface IHashJoinUtility

acceptSolutions

public long acceptSolutions(ICloseableIterator<IBindingSet[]> itr,
                            BOpStats stats)
Description copied from interface: IHashJoinUtility
Buffer solutions on a hash index.

When optional:=true, solutions which do not have a binding for one or more of the join variables will be inserted into the hash index anyway using hashCode:=1. This allows the solutions to be discovered when we scan the hash index and the set of solutions which did join to identify the optional solutions.

Specified by:
acceptSolutions in interface IHashJoinUtility
Parameters:
itr - The source from which the solutions will be drained.
stats - The statistics to be updated as the solutions are buffered on the hash index.
Returns:
The #of solutions that were buffered.

filterSolutions

public long filterSolutions(ICloseableIterator<IBindingSet[]> itr,
                            BOpStats stats,
                            IBuffer<IBindingSet> sink)
Description copied from interface: IHashJoinUtility
Filter solutions, writing only the DISTINCT solutions onto the sink.

Specified by:
filterSolutions in interface IHashJoinUtility
Parameters:
itr - The source solutions.
stats - The stats to be updated.
sink - The sink.
Returns:
The #of source solutions which pass the filter.

hashJoin

public void hashJoin(ICloseableIterator<IBindingSet> leftItr,
                     IBuffer<IBindingSet> outputBuffer)
Description copied from interface: IHashJoinUtility
Do a hash join between a stream of source solutions (left) and a hash index (right). For each left solution, the hash index (right) is probed for possible matches (solutions whose as-bound values for the join variables produce the same hash code). Possible matches are tested for consistency and the constraints (if any) are applied. Solutions which join are written on the caller's buffer.

Note: Some JoinTypeEnums have side-effects on the join state. For this joins, once method has been invoked for the final time, you must then invoke either IHashJoinUtility.outputOptionals(IBuffer) (Optional or NotExists) or IHashJoinUtility.outputJoinSet(IBuffer) (Exists).

Specified by:
hashJoin in interface IHashJoinUtility
Parameters:
leftItr - A stream of solutions to be joined against the hash index (left).
outputBuffer - Where to write the solutions which join.

hashJoin2

public void hashJoin2(ICloseableIterator<IBindingSet> leftItr,
                      IBuffer<IBindingSet> outputBuffer,
                      IConstraint[] constraints)
Description copied from interface: IHashJoinUtility
Variant hash join method allows the caller to impose different constraints or additional constraints. This is used to impose join constraints when a solution set is joined back into a query based on the join filters in the join group in which the solution set is included.

Note: Some JoinTypeEnums have side-effects on the join state. For this joins, once method has been invoked for the final time, you must then invoke either IHashJoinUtility.outputOptionals(IBuffer) (Optional or NotExists) or IHashJoinUtility.outputJoinSet(IBuffer) (Exists).

Specified by:
hashJoin2 in interface IHashJoinUtility
Parameters:
leftItr - A stream of solutions to be joined against the hash index (left).
outputBuffer - Where to write the solutions which join.
constraints - Constraints attached to this join (optional). Any constraints specified here are combined with those specified in the constructor.

outputOptionals

public void outputOptionals(IBuffer<IBindingSet> outputBuffer)
Description copied from interface: IHashJoinUtility
Identify and output the optional solutions. This is used with OPTIONAL and NOT EXISTS.

Optionals are identified using a joinSet containing each right solution which joined with at least one left solution. The total set of right solutions is then scanned once. For each right solution, we probe the joinSet. If the right solution did not join, then it is output now as an optional join.

Specified by:
outputOptionals in interface IHashJoinUtility
Parameters:
outputBuffer - Where to write the optional solutions.

outputSolutions

public void outputSolutions(IBuffer<IBindingSet> out)
Description copied from interface: IHashJoinUtility
Output the solutions buffered in the hash index. This is used when an operator is building a hash index for use by a downstream operator.

Specified by:
outputSolutions in interface IHashJoinUtility
Parameters:
out - Where to write the solutions.

outputJoinSet

public void outputJoinSet(IBuffer<IBindingSet> out)
Description copied from interface: IHashJoinUtility
Output the solutions which joined. This is used with EXISTS.

Specified by:
outputJoinSet in interface IHashJoinUtility
Parameters:
out - Where to write the solutions.

mergeJoin

public void mergeJoin(IHashJoinUtility[] others,
                      IBuffer<IBindingSet> outputBuffer,
                      IConstraint[] constraints,
                      boolean optional)
Perform an N-way merge join. For an OPTIONAL join, this instance is understood to be the index having the "required" solutions.

The merge join takes a set of solution sets in the some order and having the same join variables. It examines the next solution in order for each solution set and compares them. For each solution set which reported a solution having the same join variables as that earliest solution, it outputs the cross product and advances the iterator on that solution set.

The iterators draining the source solution sets need to be synchronized such that we consider only solutions having the same hash code in each cycle of the MERGE JOIN. The synchronization step is different depending on whether or not the MERGE JOIN is OPTIONAL.

If the MERGE JOIN is REQUIRED, then we want to synchronize the source solution iterators on the next lowest key (aka hash code) which they all have in common.

If the MERGE JOIN is OPTIONAL, then we want to synchronize the source solution iterators on the next lowest key (aka hash code) which appears for any source iterator. Solutions will not be drawn from iterators not having that key in that pass.

Note that each hash code may be an alias for solutions having different values for their join variables. Such solutions will not join. However, only solutions having the same values for the hash code can join. Thus, by proceeding with synchronized iterators and operating only on solutions having the same hash code in each round, we will consider all solutions which COULD join with one another in each round.

Note: If the solutions are not in a stable and mutually consistent order by hash code in the hash indices then the solutions in each hash index MUST be SORTED before proceeding. (The HTree maintains solutions in such an order but the JVM collections do not.)

Note: For the HTree, the entries are in key order. Those keys are hash codes computed from the solutions using the join variables. Since the keys are hash codes and not the join variable bindings, each hash code identifies a collision bucket from the perspective of the merge join algorithm. Of course, from the perspective of the HTree those solutions are just consequective tuples readily identified using HTree.lookupAll(int). FIXME Either always project everything or raise [select] into a parameter for this method. We DO NOT want to only project whatever was projected by the first source.

Specified by:
mergeJoin in interface IHashJoinUtility
Parameters:
others - The other solution sets to be joined. All instances must be of the same concrete type as this.
outputBuffer - Where to write the solutions.
constraints - The join constraints.
optional - true iff the join is optional.


Copyright © 2006-2012 SYSTAP, LLC. All Rights Reserved.