com.bigdata.bop.join
Interface IHashJoinUtility

All Known Implementing Classes:
HTreeHashJoinUtility, JVMHashJoinUtility

public interface IHashJoinUtility

Interface for hash index build and hash join operations.

Use cases

For a JOIN, there are two core steps, plus one additional step if the join is optional. The hash join logically has a Left Hand Side (LHS) and a Right Hand Side (RHS). The RHS is used to build up a hash index which is then probed for each LHS solution. The LHS is generally an access path scan, which is done once. A hash join therefore provides an alternative to a nested index join in which we visit the access path once, probing the hash index for solutions which join.
Accept solutions
This step builds the hash index, also known as the RHS (Right Hand Side).
hash join
The hash join considers each left solution in turn and outputs solutions which join. If optionals are required, this step also builds an hash index (the joinSet) over the right solutions which did join.
Output optionals
The RHS hash index is scanned and the joinSet is probed to identify right solutions which did not join with any left solution. Those solutions are output as "optionals".

This class also supports DISTINCT SOLUTIONS filters. For this use case, the caller uses filterSolutions(ICloseableIterator, BOpStats, IBuffer) method.

Version:
$Id: IHashJoinUtility.java 5773 2011-12-13 20:51:16Z thompsonbry $
Author:
Bryan Thompson

Method Summary
 long acceptSolutions(ICloseableIterator<IBindingSet[]> itr, BOpStats stats)
          Buffer solutions on a hash index.
 long filterSolutions(ICloseableIterator<IBindingSet[]> itr, BOpStats stats, IBuffer<IBindingSet> sink)
          Filter solutions, writing only the DISTINCT solutions onto the sink.
 IVariable<?> getAskVar()
          The variable bound based on whether or not a solution survives an "EXISTS" graph pattern (optional).
 IConstraint[] getConstraints()
          The join constraints (optional).
 JoinTypeEnum getJoinType()
          Return the type safe enumeration indicating what kind of operation is to be performed.
 IVariable<?>[] getJoinVars()
          The join variables.
 long getRightSolutionCount()
          Return the #of solutions in the hash index.
 IVariable<?>[] getSelectVars()
          The variables to be retained (optional, all variables are retained if not specified).
 void hashJoin(ICloseableIterator<IBindingSet> leftItr, IBuffer<IBindingSet> outputBuffer)
          Do a hash join between a stream of source solutions (left) and a hash index (right).
 void hashJoin2(ICloseableIterator<IBindingSet> leftItr, IBuffer<IBindingSet> outputBuffer, IConstraint[] constraints)
          Variant hash join method allows the caller to impose different constraints or additional constraints.
 boolean isEmpty()
          Return true iff there are no solutions in the hash index.
 void mergeJoin(IHashJoinUtility[] others, IBuffer<IBindingSet> outputBuffer, IConstraint[] constraints, boolean optional)
          Perform an N-way merge join.
 void outputJoinSet(IBuffer<IBindingSet> out)
          Output the solutions which joined.
 void outputOptionals(IBuffer<IBindingSet> outputBuffer)
          Identify and output the optional solutions.
 void outputSolutions(IBuffer<IBindingSet> out)
          Output the solutions buffered in the hash index.
 void release()
          Discard the hash index.
 

Method Detail

getJoinType

JoinTypeEnum getJoinType()
Return the type safe enumeration indicating what kind of operation is to be performed.


getAskVar

IVariable<?> getAskVar()
The variable bound based on whether or not a solution survives an "EXISTS" graph pattern (optional).

See Also:
HashJoinAnnotations.ASK_VAR

getJoinVars

IVariable<?>[] getJoinVars()
The join variables.

See Also:
HashJoinAnnotations.JOIN_VARS

getSelectVars

IVariable<?>[] getSelectVars()
The variables to be retained (optional, all variables are retained if not specified).

See Also:
JoinAnnotations.SELECT

getConstraints

IConstraint[] getConstraints()
The join constraints (optional).

See Also:
JoinAnnotations.CONSTRAINTS

isEmpty

boolean isEmpty()
Return true iff there are no solutions in the hash index.


getRightSolutionCount

long getRightSolutionCount()
Return the #of solutions in the hash index.


release

void release()
Discard the hash index.


acceptSolutions

long acceptSolutions(ICloseableIterator<IBindingSet[]> itr,
                     BOpStats stats)
Buffer solutions on a hash index.

When optional:=true, solutions which do not have a binding for one or more of the join variables will be inserted into the hash index anyway using hashCode:=1. This allows the solutions to be discovered when we scan the hash index and the set of solutions which did join to identify the optional solutions.

Parameters:
itr - The source from which the solutions will be drained.
stats - The statistics to be updated as the solutions are buffered on the hash index.
Returns:
The #of solutions that were buffered.

filterSolutions

long filterSolutions(ICloseableIterator<IBindingSet[]> itr,
                     BOpStats stats,
                     IBuffer<IBindingSet> sink)
Filter solutions, writing only the DISTINCT solutions onto the sink.

Parameters:
itr - The source solutions.
stats - The stats to be updated.
sink - The sink.
Returns:
The #of source solutions which pass the filter.

hashJoin

void hashJoin(ICloseableIterator<IBindingSet> leftItr,
              IBuffer<IBindingSet> outputBuffer)
Do a hash join between a stream of source solutions (left) and a hash index (right). For each left solution, the hash index (right) is probed for possible matches (solutions whose as-bound values for the join variables produce the same hash code). Possible matches are tested for consistency and the constraints (if any) are applied. Solutions which join are written on the caller's buffer.

Note: Some JoinTypeEnums have side-effects on the join state. For this joins, once method has been invoked for the final time, you must then invoke either outputOptionals(IBuffer) (Optional or NotExists) or outputJoinSet(IBuffer) (Exists).

Parameters:
leftItr - A stream of solutions to be joined against the hash index (left).
outputBuffer - Where to write the solutions which join.

hashJoin2

void hashJoin2(ICloseableIterator<IBindingSet> leftItr,
               IBuffer<IBindingSet> outputBuffer,
               IConstraint[] constraints)
Variant hash join method allows the caller to impose different constraints or additional constraints. This is used to impose join constraints when a solution set is joined back into a query based on the join filters in the join group in which the solution set is included.

Note: Some JoinTypeEnums have side-effects on the join state. For this joins, once method has been invoked for the final time, you must then invoke either outputOptionals(IBuffer) (Optional or NotExists) or outputJoinSet(IBuffer) (Exists).

Parameters:
leftItr - A stream of solutions to be joined against the hash index (left).
outputBuffer - Where to write the solutions which join.
constraints - Constraints attached to this join (optional). Any constraints specified here are combined with those specified in the constructor.

mergeJoin

void mergeJoin(IHashJoinUtility[] others,
               IBuffer<IBindingSet> outputBuffer,
               IConstraint[] constraints,
               boolean optional)
Perform an N-way merge join. For an OPTIONAL join, this instance is understood to be the index having the "required" solutions.

The merge join takes a set of solution sets in the some order and having the same join variables. It examines the next solution in order for each solution set and compares them. For each solution set which reported a solution having the same join variables as that earliest solution, it outputs the cross product and advances the iterator on that solution set.

The iterators draining the source solution sets need to be synchronized such that we consider only solutions having the same hash code in each cycle of the MERGE JOIN. The synchronization step is different depending on whether or not the MERGE JOIN is OPTIONAL.

If the MERGE JOIN is REQUIRED, then we want to synchronize the source solution iterators on the next lowest key (aka hash code) which they all have in common.

If the MERGE JOIN is OPTIONAL, then we want to synchronize the source solution iterators on the next lowest key (aka hash code) which appears for any source iterator. Solutions will not be drawn from iterators not having that key in that pass.

Note that each hash code may be an alias for solutions having different values for their join variables. Such solutions will not join. However, only solutions having the same values for the hash code can join. Thus, by proceeding with synchronized iterators and operating only on solutions having the same hash code in each round, we will consider all solutions which COULD join with one another in each round.

Note: If the solutions are not in a stable and mutually consistent order by hash code in the hash indices then the solutions in each hash index MUST be SORTED before proceeding. (The HTree maintains solutions in such an order but the JVM collections do not.)

Parameters:
others - The other solution sets to be joined. All instances must be of the same concrete type as this.
outputBuffer - Where to write the solutions.
constraints - The join constraints.
optional - true iff the join is optional.

outputOptionals

void outputOptionals(IBuffer<IBindingSet> outputBuffer)
Identify and output the optional solutions. This is used with OPTIONAL and NOT EXISTS.

Optionals are identified using a joinSet containing each right solution which joined with at least one left solution. The total set of right solutions is then scanned once. For each right solution, we probe the joinSet. If the right solution did not join, then it is output now as an optional join.

Parameters:
outputBuffer - Where to write the optional solutions.

outputSolutions

void outputSolutions(IBuffer<IBindingSet> out)
Output the solutions buffered in the hash index. This is used when an operator is building a hash index for use by a downstream operator.

Parameters:
out - Where to write the solutions.

outputJoinSet

void outputJoinSet(IBuffer<IBindingSet> out)
Output the solutions which joined. This is used with EXISTS.

Parameters:
out - Where to write the solutions.


Copyright © 2006-2012 SYSTAP, LLC. All Rights Reserved.