com.bigdata.btree
Class IndexSegmentBuilder

java.lang.Object
  extended by com.bigdata.btree.IndexSegmentBuilder
All Implemented Interfaces:
Callable<IndexSegmentCheckpoint>

public class IndexSegmentBuilder
extends Object
implements Callable<IndexSegmentCheckpoint>

Builds an IndexSegment given a source btree and a target branching factor. There are two main use cases:

  1. Evicting a key range of an index into an optimized on-disk index. In this case, the input is a BTree that is ideally backed by a fully buffered IRawStore so that no random reads are required.
  2. Merging index segments. In this case, the input is typically records emerging from a merge-sort. There are two distinct cases here. In one, we simply have raw records that are being merged into an index. This might occur when merging two key ranges or when external data are being loaded. In the other case we are processing two time-stamped versions of an overlapping key range. In this case, the more recent version may have "delete" markers indicating that a key present in an older version has been deleted in the newer version. Also, key-value entries in the newer version replaced (rather than are merged with) key-value entries in the older version. If an entry history policy is defined, then it must be applied here to cause key-value whose retention is no longer required by that policy to be dropped.
Note: In order for the nodes to be written in a contiguous block we either have to buffer them in memory or have to write them onto a temporary file and then copy them into place after the last leaf has been processed. The code abstracts this decision using a TemporaryRawStore for this purpose. This space demand was not present in West's algorithm because it did not attempt to place the leaves contiguously onto the store.

Note: The use of up to three TemporaryRawStores per index segment build can raise the demand on the DirectBufferPool, for example if a number of concurrent index builds are occurring on a data service with a large #of index partitions. One choice is to limit the parallelism of the asynchronous overflow operation.

Version:
$Id: IndexSegmentBuilder.java 2265 2009-10-26 12:51:06Z thompsonbry $
Author:
Bryan Thompson
See Also:
"Post-order B-Tree Construction" by Lawerence West, ACM 1992. Note that West's algorithm is for a b-tree (values are stored on internal stack as well as leaves), not a b+-tree (values are stored only on the leaves). Our implementation is therefore an adaptation., "Batch-Construction of B+-Trees" by Kim and Won, ACM 2001. The approach outlined by Kim and Won is designed for B+-Trees, but it appears to be less efficient on first glance., IndexSegment, IndexSegmentFile, IndexSegmentCheckpoint, IndexSegmentMerger
TODO:
allow builds where the #of index entries would exceed an [int].

Nested Class Summary
protected static class IndexSegmentBuilder.AbstractSimpleNodeData
          Abstract base class for classes used to construct and serialize nodes and leaves written onto the index segment.
protected static class IndexSegmentBuilder.NOPNodeFactory
          Factory does not support node or leaf creation.
protected static class IndexSegmentBuilder.SimpleLeafData
          A class that can be used to (de-)serialize the data for a leaf without any of the logic for operations on the leaf.
protected static class IndexSegmentBuilder.SimpleNodeData
          A class that can be used to (de-)serialize the data for a node without any of the logic for operations on the node.
 
Field Summary
protected  TemporaryRawStore blobBuffer
          The optional buffer used to hold records referenced by index entries.
 long commitTime
          The commit time associated with the view from which the IndexSegment is being generated (from the ctor).
 boolean compactingMerge
          true iff the generated IndexSegment will incorporate all state for the source index (partition) as of the specified commitTime.
 long elapsed
          The process runtime in milliseconds.
 long elapsed_build
          The time to write the nodes and leaves into their respective buffers, not including the time to transfer those buffered onto the output file.
 long elapsed_setup
          The time to setup the index build, including the generation of the index plan and the initialization of some helper objects.
 long elapsed_write
          The time to write the nodes and leaves from their respective buffers onto the output file and synch and close that output file.
 int entryCount
          The value specified to the ctor.
protected static String ERR_NO_TUPLES
          Warning message when the index segment will be empty.
protected static String ERR_TOO_MANY_TUPLES
          Error message when the #of tuples in the IndexSegment would exceed Integer.MAX_VALUE.
protected  TemporaryRawStore leafBuffer
          The buffer used to hold leaves so that they can be evicted en mass onto a region of the outFile.
protected static org.apache.log4j.Logger log
          Logger.
 float mbPerSec
          The data throughput rate in megabytes per second.
 IndexMetadata metadata
          A copy of the metadata object provided to the ctor.
protected  TemporaryRawStore nodeBuffer
          The buffer used to hold nodes so that they can be evicted en mass onto a region of the outFile.
protected  RandomAccessFile out
          The file on which the IndexSegment is written.
 File outFile
          The file specified by the caller on which the IndexSegment is written.
 IndexSegmentPlan plan
          The plan for building the B+-Tree.
 UUID segmentUUID
          The unique identifier for the generated IndexSegment resource.
 
Constructor Summary
IndexSegmentBuilder(File outFile, File tmpDir, int entryCount, ITupleIterator entryIterator, int m, IndexMetadata metadata, long commitTime, boolean compactingMerge)
           Designated constructor sets up a build of an IndexSegment for some caller defined read-only view.
 
Method Summary
protected  void addChild(IndexSegmentBuilder.SimpleNodeData parent, long childAddr, IndexSegmentBuilder.AbstractSimpleNodeData child)
          Record the persistent address of a child on its parent and the #of entries spanned by that child.
protected  void addSeparatorKey(IndexSegmentBuilder.SimpleLeafData leaf)
          Copies the first key of a new leaf as a separatorKey for the appropriate parent (if any) of that leaf.
 IndexSegmentCheckpoint call()
          Build the IndexSegment given the parameters specified to the ctor.
protected  void flush(IndexSegmentBuilder.AbstractSimpleNodeData node)
           Flush a node or leaf that has been closed (no more data will be added).
 IndexSegmentCheckpoint getCheckpoint()
          The IndexSegmentCheckpoint record written on the IndexSegmentStore.
protected  IndexSegmentBuilder.SimpleNodeData getParent(IndexSegmentBuilder.AbstractSimpleNodeData node)
          Return the parent of a node or leaf in the stack.
 IResourceMetadata getSegmentMetadata()
          The description of the constructed IndexSegment resource.
 long getStartTime()
          The timestamp in milliseconds when call() was invoked -or- ZERO (0L) if call() has not been invoked.
static IndexSegmentBuilder newInstance(String name, ILocalBTreeView src, File outFile, File tmpDir, boolean compactingMerge, long createTime, byte[] fromKey, byte[] toKey)
          Builder factory will build an IndexSegment from an index (partition).
protected  IndexSegmentCheckpoint writeIndexSegment(FileChannel outChannel, long commitTime)
           Writes the complete file format for the index segment.
protected  long writeLeaf(IndexSegmentBuilder.SimpleLeafData leaf)
          Serialize and write the leaf onto the leafBuffer.
protected  long writeNode(IndexSegmentBuilder.SimpleNodeData node)
          Code and write the node onto the nodeBuffer.
protected  long writeNodeOrLeaf(IndexSegmentBuilder.AbstractSimpleNodeData node)
          Serialize, compress, and write the node or leaf onto the appropriate output channel.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

log

protected static final org.apache.log4j.Logger log
Logger.


ERR_TOO_MANY_TUPLES

protected static final String ERR_TOO_MANY_TUPLES
Error message when the #of tuples in the IndexSegment would exceed Integer.MAX_VALUE.

Note: This is not an inherent limit in the IndexSegment but rather a limit in the IndexSegmentPlan (and perhaps the IndexSegmentBuilder) which presumes that the entry count is an int rather than a long.

See Also:
Constant Field Values

ERR_NO_TUPLES

protected static final String ERR_NO_TUPLES
Warning message when the index segment will be empty.

See Also:
Constant Field Values

outFile

public final File outFile
The file specified by the caller on which the IndexSegment is written.


entryCount

public final int entryCount
The value specified to the ctor.


commitTime

public final long commitTime
The commit time associated with the view from which the IndexSegment is being generated (from the ctor). This value is written into IndexSegmentCheckpoint.commitTime.


compactingMerge

public final boolean compactingMerge
true iff the generated IndexSegment will incorporate all state for the source index (partition) as of the specified commitTime.

Note: This flag is written into the IndexSegmentCheckpoint but it has no other effect on the build process.


metadata

public final IndexMetadata metadata
A copy of the metadata object provided to the ctor. This object is further modified before being written on the IndexSegmentStore.


segmentUUID

public final UUID segmentUUID
The unique identifier for the generated IndexSegment resource.


out

protected RandomAccessFile out
The file on which the IndexSegment is written. The file is closed regardless of the outcome of the operation.


leafBuffer

protected TemporaryRawStore leafBuffer
The buffer used to hold leaves so that they can be evicted en mass onto a region of the outFile.


nodeBuffer

protected TemporaryRawStore nodeBuffer
The buffer used to hold nodes so that they can be evicted en mass onto a region of the outFile.


blobBuffer

protected TemporaryRawStore blobBuffer
The optional buffer used to hold records referenced by index entries. In order to use this buffer the IndexMetadata MUST specify an IOverflowHandler.


plan

public final IndexSegmentPlan plan
The plan for building the B+-Tree.


elapsed_setup

public final long elapsed_setup
The time to setup the index build, including the generation of the index plan and the initialization of some helper objects.


elapsed_build

public long elapsed_build
The time to write the nodes and leaves into their respective buffers, not including the time to transfer those buffered onto the output file.


elapsed_write

public long elapsed_write
The time to write the nodes and leaves from their respective buffers onto the output file and synch and close that output file.


elapsed

public long elapsed
The process runtime in milliseconds.


mbPerSec

public float mbPerSec
The data throughput rate in megabytes per second.

Constructor Detail

IndexSegmentBuilder

public IndexSegmentBuilder(File outFile,
                           File tmpDir,
                           int entryCount,
                           ITupleIterator entryIterator,
                           int m,
                           IndexMetadata metadata,
                           long commitTime,
                           boolean compactingMerge)
                    throws IOException

Designated constructor sets up a build of an IndexSegment for some caller defined read-only view.

Note: The caller must determine whether or not deleted index entries are present in the view. The entryCount MUST be the exact #of index entries that are visited by the given iterator. In general, this is not difficult. However, if a compacting merge is desired (that is, if you are trying to generate a view containing only the non-deleted entries) then you MUST explicitly count the #of entries that will be visited by the iterator, e.g., it will require two passes over the iterator to setup the index build operation.

Note: With a branching factor of 4096 a tree of height 2 (three levels) could address 68,719,476,736 entries - well beyond what we want in a given index segment! Well before that the index segment should be split into multiple files. The split point should be determined by the size of the serialized leaves and nodes, e.g., the amount of data on disk required by the index segment and the amount of memory required to fully buffer the index nodes. While the size of a serialized node can be estimated easily, the size of a serialized leaf depends on the kinds of values stored in that index. The actual sizes are recorded in the IndexSegmentCheckpoint record in the header of the IndexSegment.

Parameters:
outFile - The file on which the index segment is written. The file MAY exist but MUST have zero length if it does exist (this permits you to use the temporary file facility to create the output file).
tmpDir - The temporary directory in data are buffered during the build (optional - the default temporary directory is used if this is null).
entryCount - The #of entries that will be visited by the iterator.
entryIterator - Visits the index entries in key order that will be written onto the IndexSegment.
m - The branching factor for the generated tree. This can be chosen with an eye to minimizing the height of the generated tree. (Small branching factors are permitted for testing, but generally you want something relatively large.)
metadata - The metadata record for the source index. A copy will be made of this object. The branching factor in the generated tree will be overridden to m.
commitTime - The commit time associated with the view from which the IndexSegment is being generated. This value is written into IndexSegmentCheckpoint.commitTime.
compactingMerge - true iff the generated IndexSegment will incorporate all state for the source index (partition) as of the specified commitTime. This flag is written into the IndexSegmentCheckpoint but does not otherwise effect the build process.
Throws:
IOException
Method Detail

getCheckpoint

public IndexSegmentCheckpoint getCheckpoint()
The IndexSegmentCheckpoint record written on the IndexSegmentStore.


getStartTime

public long getStartTime()
The timestamp in milliseconds when call() was invoked -or- ZERO (0L) if call() has not been invoked.


newInstance

public static IndexSegmentBuilder newInstance(String name,
                                              ILocalBTreeView src,
                                              File outFile,
                                              File tmpDir,
                                              boolean compactingMerge,
                                              long createTime,
                                              byte[] fromKey,
                                              byte[] toKey)
                                       throws IOException
Builder factory will build an IndexSegment from an index (partition). Delete markers are propagated to the IndexSegment unless compactingMerge is true.

Parameters:
name - The name of the index (for non-scale-out indices) or the name of the index partition (for scale-out indices). DO NOT specify the name of the scale-out index!
src - A view of the index partition as of the createTime. When compactingMerge is false then this MUST be a single BTree since incremental builds are only support for a BTree source while compacting merges are defined for any IIndex.
outFile - The file on which the IndexSegment will be written. The file MAY exist, but if it exists then it MUST be empty.
compactingMerge - When true the caller asserts that src is a FusedView and deleted index entries WILL NOT be included in the generated IndexSegment. Otherwise, it is assumed that the only select component(s) of the index partition view are being exported onto an IndexSegment and deleted index entries will therefore be propagated to the new IndexSegment (aka an incremental build).
createTime - The commit time associated with the view from which the IndexSegment is being generated. This value is written into IndexSegmentCheckpoint.commitTime.
fromKey - The lowest key that will be included (inclusive). When null there is no lower bound.
toKey - The first key that will be included (exclusive). When null there is no upper bound.
Returns:
An object which can be used to construct the IndexSegment.
Throws:
IOException

call

public IndexSegmentCheckpoint call()
                            throws Exception
Build the IndexSegment given the parameters specified to the ctor.

Specified by:
call in interface Callable<IndexSegmentCheckpoint>
Throws:
Exception

flush

protected void flush(IndexSegmentBuilder.AbstractSimpleNodeData node)
              throws IOException

Flush a node or leaf that has been closed (no more data will be added).

Note: When a node or leaf is flushed we write it out to obtain its address and set that address on its direct parent using #addChild(SimpleNodeData, long). This also updates the per-child counters of the #of entries spanned by a node.

Throws:
IOException

addChild

protected void addChild(IndexSegmentBuilder.SimpleNodeData parent,
                        long childAddr,
                        IndexSegmentBuilder.AbstractSimpleNodeData child)
                 throws IOException
Record the persistent address of a child on its parent and the #of entries spanned by that child. If all children on the parent become assigned then the parent is closed.

Parameters:
parent - The parent.
childAddr - The address of the child (node or leaf).
child - The child reference.
Throws:
IOException

addSeparatorKey

protected void addSeparatorKey(IndexSegmentBuilder.SimpleLeafData leaf)
Copies the first key of a new leaf as a separatorKey for the appropriate parent (if any) of that leaf. This must be invoked when the first key is set on that leaf. However, it must not be invoked on the first leaf.

Parameters:
leaf - The current leaf. The first key on that leaf must be defined.

getParent

protected IndexSegmentBuilder.SimpleNodeData getParent(IndexSegmentBuilder.AbstractSimpleNodeData node)
Return the parent of a node or leaf in the stack.

Parameters:
node - The node or leaf.
Returns:
The parent or null iff node is the root node or leaf.

writeNodeOrLeaf

protected long writeNodeOrLeaf(IndexSegmentBuilder.AbstractSimpleNodeData node)
                        throws IOException
Serialize, compress, and write the node or leaf onto the appropriate output channel.

Returns:
The address that may be used to read the compressed node or leaf from the file. Note that the address of a node is relative to the start of the node channel and therefore must be adjusted before reading the node from the final index segment file.
Throws:
IOException

writeLeaf

protected long writeLeaf(IndexSegmentBuilder.SimpleLeafData leaf)
                  throws IOException
Serialize and write the leaf onto the leafBuffer.

Note: For leaf addresses we know the absolute offset into the IndexSegmentStore where the leaf will wind up so we encode the address of the leaf using the IndexSegmentRegion.BASE region.

Note: In order to write out the leaves using a double-linked list with prior-/next-leaf addresses we have to use a "write behind" strategy. Instead of writing out the leaf as soon as it is serialized, we save the unencoded address and a copy of the serialized record on private member fields. When we serialize the next leaf (or if we learn that we have no more leaves to serialize because IndexSegmentPlan.nleaves EQ nleavesWritten) then we patch the serialized representation of the prior leaf and write it on the store at the previously obtained address, thereby linking the leaves together in both directions. It is definitely confusing.

Returns:
The address that may be used to read the leaf from the file backing the IndexSegmentStore.
Throws:
IOException

writeNode

protected long writeNode(IndexSegmentBuilder.SimpleNodeData node)
                  throws IOException
Code and write the node onto the nodeBuffer.

Returns:
An relative address that must be correctly decoded before you can read the compressed node from the file. This value is also set on SimpleNodeData#addr.
Throws:
IOException
See Also:
IndexSegmentBuilder.AbstractSimpleNodeData, IndexSegmentRegion, IndexSegmentAddressManager

writeIndexSegment

protected IndexSegmentCheckpoint writeIndexSegment(FileChannel outChannel,
                                                   long commitTime)
                                            throws IOException

Writes the complete file format for the index segment. The file is divided up as follows:

  1. fixed length metadata record (required)
  2. leaves (required)
  3. nodes (may be empty)
  4. the bloom filter (optional)
  5. the extension metadata record (required, but extensible)

The index segment metadata is divided into a base IndexSegmentCheckpoint record with a fixed format containing only essential data and additional metadata records written at the end of the file including the optional bloom filter and the required IndexMetadata record. The latter is where we write variable length metadata including the _name_ of the index, or additional metadata defined by a specific class of index.

Once all nodes and leaves have been buffered we are ready to start writing the data. We skip over a fixed size metadata record since otherwise we are unable to pre-compute the offset to the leaves and hence the addresses of the leaves. The node addresses are written in an encoding that requires active translation by the receiver who must be aware of the offset to the start of the node region. We can not write the metadata record until we know the size and length of each of these regions (leaves, nodes, and the bloom filter, or other metadata records) since that information is required in order to be able to form their addresses for insertion in the metadata record.

Parameters:
outChannel -
commitTime -
Throws:
IOException

getSegmentMetadata

public IResourceMetadata getSegmentMetadata()
The description of the constructed IndexSegment resource.

Throws:
IllegalStateException - if requested before the build operation is complete.


Copyright © 2006-2009 SYSTAP, LLC. All Rights Reserved.