|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectcom.bigdata.bop.join.HTreeHashJoinUtility
public class HTreeHashJoinUtility
Utility methods to support hash index builds and hash index joins using a scalable native memory data structures.
HTree, this class is written directly to the RDF data model. Rather
than POJO serialization, solutions are encoded as logical IV[]s in a
manner very similar to how we represent the keys of the statement indices.
Since this encoding does not persist the cache, a
separate mapping must be maintained from IV to BigdataValue
for those IVs which have a materialized BigdataValue.
TODO Do a 64-bit hash version which could be used for hash indices having
more than 500M distinct join variable combinations. Note that at 500M
distinct join variable combinations we have a 1 in 4 chance of a hash
collision. Whether or not that turns into a cost really depends on the
cardinality of the solutions per distinct combination of the join variables.
If there is only one solution per join variable combination, then those
collisions will cause basically no increase in the work to be done. However,
if there are 50,000 solutions per distinct combination of the join variables
then we would be better off using a 64-bit hash code.
FIXME Vector resolution of ivCache. Various methods use
IVBindingSetEncoderWithIVCache.resolveCachedValues(IBindingSet)
| Constructor Summary | |
|---|---|
HTreeHashJoinUtility(IMemoryManager mmgr,
PipelineOp op,
JoinTypeEnum joinType)
|
|
| Method Summary | |
|---|---|
long |
acceptSolutions(ICloseableIterator<IBindingSet[]> itr,
BOpStats stats)
Buffer solutions on a hash index. |
void |
checkpointJoinSet()
Checkpoint the join set (used to buffer the optional solutions). |
long |
filterSolutions(ICloseableIterator<IBindingSet[]> itr,
BOpStats stats,
IBuffer<IBindingSet> sink)
Filter solutions, writing only the DISTINCT solutions onto the sink. |
IVariable<?> |
getAskVar()
The variable bound based on whether or not a solution survives an "EXISTS" graph pattern (optional). |
IConstraint[] |
getConstraints()
The join constraints (optional). |
JoinTypeEnum |
getJoinType()
Return the type safe enumeration indicating what kind of operation is to be performed. |
IVariable<?>[] |
getJoinVars()
The join variables. |
long |
getRightSolutionCount()
Return the #of solutions in the hash index. |
IVariable<?>[] |
getSelectVars()
The variables to be retained (optional, all variables are retained if not specified). |
IRawStore |
getStore()
The backing IRawStore. |
void |
hashJoin(ICloseableIterator<IBindingSet> leftItr,
IBuffer<IBindingSet> outputBuffer)
Do a hash join between a stream of source solutions (left) and a hash index (right). |
void |
hashJoin2(ICloseableIterator<IBindingSet> leftItr,
IBuffer<IBindingSet> outputBuffer,
IConstraint[] constraints)
Variant hash join method allows the caller to impose different constraints or additional constraints. |
boolean |
isEmpty()
Return true iff there are no solutions in the hash index. |
void |
mergeJoin(IHashJoinUtility[] others,
IBuffer<IBindingSet> outputBuffer,
IConstraint[] constraints,
boolean optional)
Perform an N-way merge join. |
void |
outputJoinSet(IBuffer<IBindingSet> out)
Output the solutions which joined. |
void |
outputOptionals(IBuffer<IBindingSet> outputBuffer)
Identify and output the optional solutions. |
void |
outputSolutions(IBuffer<IBindingSet> out)
Output the solutions buffered in the hash index. |
void |
release()
Discard the hash index. |
void |
saveSolutionSet()
Checkpoint the HTree instance(s) used to buffer the source
solutions (rightSolutions and #ivCache) and then
re-load the them in a read-only mode from their checkpoint(s). |
String |
toString()
Human readable representation of the IHashJoinUtility metadata
(but not the solutions themselves). |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
| Constructor Detail |
|---|
public HTreeHashJoinUtility(IMemoryManager mmgr,
PipelineOp op,
JoinTypeEnum joinType)
mmgr - The IMemoryManager which will back the named solution set.op - The operator whose annotation will inform construction the
hash index. The HTreeAnnotations may be specified for
this operator and will control the initialization of the
various HTree instances.joinType - The type of join to be performed.HTreeHashJoinAnnotations| Method Detail |
|---|
public String toString()
IHashJoinUtility metadata
(but not the solutions themselves).
toString in class Objectpublic boolean isEmpty()
IHashJoinUtilitytrue iff there are no solutions in the hash index.
isEmpty in interface IHashJoinUtilitypublic long getRightSolutionCount()
IHashJoinUtility
getRightSolutionCount in interface IHashJoinUtilitypublic JoinTypeEnum getJoinType()
IHashJoinUtility
getJoinType in interface IHashJoinUtilitypublic IVariable<?> getAskVar()
IHashJoinUtility
getAskVar in interface IHashJoinUtilityHashJoinAnnotations.ASK_VARpublic IVariable<?>[] getJoinVars()
IHashJoinUtility
getJoinVars in interface IHashJoinUtilityHashJoinAnnotations.JOIN_VARSpublic IVariable<?>[] getSelectVars()
IHashJoinUtility
getSelectVars in interface IHashJoinUtilityJoinAnnotations.SELECTpublic IConstraint[] getConstraints()
IHashJoinUtility
getConstraints in interface IHashJoinUtilityJoinAnnotations.CONSTRAINTSpublic IRawStore getStore()
IRawStore.
public void saveSolutionSet()
HTree instance(s) used to buffer the source
solutions (rightSolutions and #ivCache) and then
re-load the them in a read-only mode from their checkpoint(s). This
exposes a view of the HTree which is safe for concurrent
readers.
public void checkpointJoinSet()
Note: Since we always output the solutions which did not join from a
single thread as part of last pass evaluation there is no need to
checkpoint the joinSet.
public void release()
IHashJoinUtility
release in interface IHashJoinUtility
public long acceptSolutions(ICloseableIterator<IBindingSet[]> itr,
BOpStats stats)
IHashJoinUtility
When optional:=true, solutions which do not have a binding
for one or more of the join variables will be inserted into the hash
index anyway using hashCode:=1. This allows the solutions to
be discovered when we scan the hash index and the set of solutions which
did join to identify the optional solutions.
acceptSolutions in interface IHashJoinUtilityitr - The source from which the solutions will be drained.stats - The statistics to be updated as the solutions are buffered on
the hash index.
public long filterSolutions(ICloseableIterator<IBindingSet[]> itr,
BOpStats stats,
IBuffer<IBindingSet> sink)
IHashJoinUtility
filterSolutions in interface IHashJoinUtilityitr - The source solutions.stats - The stats to be updated.sink - The sink.
public void hashJoin(ICloseableIterator<IBindingSet> leftItr,
IBuffer<IBindingSet> outputBuffer)
IHashJoinUtility
Note: Some JoinTypeEnums have side-effects on the join state. For
this joins, once method has been invoked for the final time, you must
then invoke either IHashJoinUtility.outputOptionals(IBuffer) (Optional or
NotExists) or IHashJoinUtility.outputJoinSet(IBuffer) (Exists).
hashJoin in interface IHashJoinUtilityleftItr - A stream of solutions to be joined against the hash index
(left).outputBuffer - Where to write the solutions which join.
public void hashJoin2(ICloseableIterator<IBindingSet> leftItr,
IBuffer<IBindingSet> outputBuffer,
IConstraint[] constraints)
IHashJoinUtility
Note: Some JoinTypeEnums have side-effects on the join state. For
this joins, once method has been invoked for the final time, you must
then invoke either IHashJoinUtility.outputOptionals(IBuffer) (Optional or
NotExists) or IHashJoinUtility.outputJoinSet(IBuffer) (Exists).
hashJoin2 in interface IHashJoinUtilityleftItr - A stream of solutions to be joined against the hash index
(left).outputBuffer - Where to write the solutions which join.constraints - Constraints attached to this join (optional). Any constraints
specified here are combined with those specified in the
constructor.public void outputOptionals(IBuffer<IBindingSet> outputBuffer)
IHashJoinUtilityOptionals are identified using a joinSet containing each right solution which joined with at least one left solution. The total set of right solutions is then scanned once. For each right solution, we probe the joinSet. If the right solution did not join, then it is output now as an optional join.
outputOptionals in interface IHashJoinUtilityoutputBuffer - Where to write the optional solutions.public void outputSolutions(IBuffer<IBindingSet> out)
IHashJoinUtility
outputSolutions in interface IHashJoinUtilityout - Where to write the solutions.public void outputJoinSet(IBuffer<IBindingSet> out)
IHashJoinUtility
outputJoinSet in interface IHashJoinUtilityout - Where to write the solutions.
public void mergeJoin(IHashJoinUtility[] others,
IBuffer<IBindingSet> outputBuffer,
IConstraint[] constraints,
boolean optional)
The merge join takes a set of solution sets in the some order and having the same join variables. It examines the next solution in order for each solution set and compares them. For each solution set which reported a solution having the same join variables as that earliest solution, it outputs the cross product and advances the iterator on that solution set.
The iterators draining the source solution sets need to be synchronized such that we consider only solutions having the same hash code in each cycle of the MERGE JOIN. The synchronization step is different depending on whether or not the MERGE JOIN is OPTIONAL.
If the MERGE JOIN is REQUIRED, then we want to synchronize the source solution iterators on the next lowest key (aka hash code) which they all have in common.
If the MERGE JOIN is OPTIONAL, then we want to synchronize the source solution iterators on the next lowest key (aka hash code) which appears for any source iterator. Solutions will not be drawn from iterators not having that key in that pass.
Note that each hash code may be an alias for solutions having different values for their join variables. Such solutions will not join. However, only solutions having the same values for the hash code can join. Thus, by proceeding with synchronized iterators and operating only on solutions having the same hash code in each round, we will consider all solutions which COULD join with one another in each round.
Note: If the solutions are not in a stable and mutually consistent order
by hash code in the hash indices then the solutions in each hash index
MUST be SORTED before proceeding. (The HTree maintains solutions
in such an order but the JVM collections do not.)
Note: For the HTree, the entries are in key order. Those keys are
hash codes computed from the solutions using the join variables. Since
the keys are hash codes and not the join variable bindings, each hash
code identifies a collision bucket from the perspective of the merge join
algorithm. Of course, from the perspective of the HTree those
solutions are just consequective tuples readily identified using
HTree.lookupAll(int).
FIXME Either always project everything or raise [select] into a parameter
for this method. We DO NOT want to only project whatever was projected by
the first source.
mergeJoin in interface IHashJoinUtilityothers - The other solution sets to be joined. All instances must be of
the same concrete type as this.outputBuffer - Where to write the solutions.constraints - The join constraints.optional - true iff the join is optional.
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||