com.bigdata.htree
Class AbstractHTree

java.lang.Object
  extended by com.bigdata.htree.AbstractHTree
All Implemented Interfaces:
ICheckpointProtocol, ICounterSetAccess, ICommitter
Direct Known Subclasses:
HTree

public abstract class AbstractHTree
extends Object
implements ICounterSetAccess, ICheckpointProtocol

Abstract base class for a persistence capable extensible hash tree.

Version:
$Id$
Author:
Bryan Thompson

Nested Class Summary
static class AbstractHTree.HTreePageStateException
          Exception that can be thrown for asserts and testing, retaining the source page of the error
 
Field Summary
protected  int addressBits
          The #of bits in the address space for a directory page (from the constructor).
protected  int bucketSlots
          The #of bucket slots in a bucket page.
static org.apache.log4j.Logger dumpLog
          Log for dump(PrintStream) and friends.
protected static String ERROR_CLOSED
          The index is already closed.
protected static String ERROR_LESS_THAN_ZERO
          A parameter was less than zero.
protected static String ERROR_READ_ONLY
          The index is read-only but a mutation operation was requested.
protected static String ERROR_TOO_LARGE
          A parameter was too large.
protected static String ERROR_TRANSIENT
          The index is transient (not backed by persistent storage) but an operation that requires persistence was requested.
protected  HTreeIndexMetadata metadata
          The metadata record for the index.
protected  int ndistinctOnWriteRetentionQueue
          The #of distinct nodes and leaves on the writeRetentionQueue.
protected  NodeSerializer nodeSer
          Used to serialize and de-serialize the nodes and leaves of the tree.
protected  boolean readOnly
          When true the AbstractHTree does not permit mutation.
protected  com.bigdata.htree.DirectoryPage root
          The root directory.
protected  IRawStore store
          The backing store.
protected  IHardReferenceQueue<PO> writeRetentionQueue
          Nodes (that is nodes or leaves) are added to a hard reference queue when they are created or read from the store.
 
Constructor Summary
protected AbstractHTree(IRawStore store, INodeFactory nodeFactory, boolean readOnly, HTreeIndexMetadata metadata, IRecordCompressorFactory<?> recordCompressorFactory)
           
 
Method Summary
protected abstract  void _reopen()
          This method is responsible for setting up the root leaf (either new or read from the store), the bloom filter, etc.
protected  void assertNotReadOnly()
           
protected  void assertNotTransient()
           
 void checkConsistency(com.bigdata.htree.AbstractPage pge, boolean full)
           
 void checkConsistency(boolean full)
          Checks globalDepth of each node and also whether any BucketPages are empty.
 void close()
          The contract for ICheckpointProtocol.close() is to reduce the resource burden of the index while not rendering the index inoperative.
static long decodeRecordAddr(byte[] buf)
          Decodes a signed long value as encoded by #appendSigned(long).
 boolean dump(org.apache.log4j.Level level, PrintStream out, boolean materialize)
           
 boolean dump(PrintStream out)
          Recursive dump of the tree.
static byte[] encodeRecordAddr(ByteArrayBuffer recordAddrBuf, long addr)
          Encode a raw record address into a byte[] suitable for storing in the value associated with a tuple and decoding using decodeRecordAddr(byte[])
 int getAddressBits()
          The #of bits in the address space for a hash directory page.
 BTreeCounters getBtreeCounters()
          Counters tracking various aspects of the btree.
 CounterSet getCounters()
          Return performance counters.
abstract  long getEntryCount()
          The #of tuples in the HTree.
 HTreeIndexMetadata getIndexMetadata()
          Returns the metadata record for this index.
abstract  long getLastCommitTime()
          The timestamp associated with the last IAtomicStore.commit() in which writes buffered by this index were made restart-safe on the backing store.
abstract  long getLeafCount()
          The #of BucketPages in the HTree (not buddy hash buckets, but the pages on which they appear).
abstract  long getNodeCount()
          The #of DirectoryPages in the HTree (not buddy hash tables, but the pages on which they appear).
protected  com.bigdata.htree.DirectoryPage getRoot()
          The root of the HTree.
 IRawStore getStore()
          The backing store.
 boolean isOpen()
          An "open" index has may have some buffered data.
 boolean isReadOnly()
          Return true iff this B+Tree is read-only.
 boolean isTransient()
          Return true iff this is a transient data structure (no backing store).
abstract  long rangeCount()
          The #of index entries.
 ITupleIterator rangeIterator()
          Simple iterator visits all tuples in the HTree in order by the effective prefix of their keys.
protected  com.bigdata.htree.AbstractPage readNodeOrLeaf(long addr)
          Read an AbstractPage from the store.
 void reopen()
          (Re-) open the index.
 void setBTreeCounters(BTreeCounters btreeCounters)
          Replace the BTreeCounters.
 String toString()
          Fast summary information about the B+Tree.
protected  void touch(com.bigdata.htree.AbstractPage node)
           This method is responsible for putting the node or leaf onto the ring buffer which controls (a) how long we retain a hard reference to the node or leaf; and (b) for writes, when the node or leaf is evicted with a zero reference count and made persistent (along with all dirty children).
 Iterator<byte[]> values()
          Visits the values stored in the HTree in order by the effective prefix of their keys.
protected  long writeNodeOrLeaf(com.bigdata.htree.AbstractPage node)
          Codes the node and writes the coded record on the store (non-recursive).
protected  void writeNodeRecursive(com.bigdata.htree.AbstractPage node)
          Write a dirty node and its children using a post-order traversal that first writes any dirty leaves and then (recursively) their parent nodes.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface com.bigdata.btree.ICheckpointProtocol
getCheckpoint, getDirtyListener, getMetadataAddr, getRecordVersion, getRootAddr, removeAll, scan, setDirtyListener, setLastCommitTime, writeCheckpoint, writeCheckpoint2
 
Methods inherited from interface com.bigdata.journal.ICommitter
handleCommit
 

Field Detail

ERROR_CLOSED

protected static final String ERROR_CLOSED
The index is already closed.

See Also:
Constant Field Values

ERROR_LESS_THAN_ZERO

protected static final String ERROR_LESS_THAN_ZERO
A parameter was less than zero.

See Also:
Constant Field Values

ERROR_TOO_LARGE

protected static final String ERROR_TOO_LARGE
A parameter was too large.

See Also:
Constant Field Values

ERROR_READ_ONLY

protected static final String ERROR_READ_ONLY
The index is read-only but a mutation operation was requested.

See Also:
Constant Field Values

ERROR_TRANSIENT

protected static final String ERROR_TRANSIENT
The index is transient (not backed by persistent storage) but an operation that requires persistence was requested.

See Also:
Constant Field Values

dumpLog

public static final org.apache.log4j.Logger dumpLog
Log for dump(PrintStream) and friends.


store

protected final IRawStore store
The backing store.


readOnly

protected final boolean readOnly
When true the AbstractHTree does not permit mutation.


addressBits

protected final int addressBits
The #of bits in the address space for a directory page (from the constructor). This constant is specified when the hash tree is created. A directory page has 2^addressBits entries. Those entries are divided up among one or more buddy hash tables on that page.


bucketSlots

protected final int bucketSlots
The #of bucket slots in a bucket page. There is a case for allowing the number of bucket slots to be different than the number of directory page slots (as determined by the number of addressBits) since the storage requirements for bucket values/keys is not set.


root

protected volatile com.bigdata.htree.DirectoryPage root
The root directory.


writeRetentionQueue

protected final IHardReferenceQueue<PO> writeRetentionQueue
Nodes (that is nodes or leaves) are added to a hard reference queue when they are created or read from the store. On eviction from the queue a dirty node is serialized by a listener against the IRawStore. The nodes and leaves refer to their parent with a WeakReferences. Likewise, nodes refer to their children with a WeakReference. The hard reference queue in combination with touch(AbstractPage) and with hard references held on the stack ensures that the parent and/or children remain reachable during operations. Once the node is no longer strongly reachable weak references to that node may be cleared by the VM - in this manner the node will become unreachable by navigation from its ancestors in the btree. The special role of the hard reference queue is to further ensure that dirty nodes remain dirty by defering persistence until the reference count for the node is zero during an eviction from the queue.

Note that nodes are evicted as new nodes are added to the hard reference queue. This occurs in two situations: (1) when a new node is created during a split of an existing node; and (2) when a node is read in from the store. Inserts on this hard reference queue always drive evictions. Incremental writes basically make it impossible for the commit set to get "too large" - the maximum #of nodes to be written is bounded by the size of the hard reference queue. This helps to ensure fast commit operations on the store.

The minimum capacity for the hard reference queue is two (2) so that a split may occur without forcing eviction of either node participating in the split.

Note: The code in AbstractPage.postOrderNodeIterator(boolean, boolean) and DirtyChildIterator MUST NOT touch the hard reference queue since those iterators are used when persisting a node using a post-order traversal. If a hard reference queue eviction drives the serialization of a node and we touch the hard reference queue during the post-order traversal then we break down the semantics of HardReferenceQueue.add(Object) as the eviction does not necessarily cause the queue to reduce in length. Another way to handle this is to have HardReferenceQueue.add(Object) begin to evict objects before is is actually at capacity, but that is also a bit fragile.

Note: The writeRetentionQueue uses a HardReferenceQueue. This is based on a RingBuffer and is very fast. It does not use a HashMap because we can resolve the WeakReference to the child DirectoryPage or BucketPage using top-down navigation as long as the DirectoryPage or BucketPage remains strongly reachable (that is, as long as it is on the writeRetentionQueue or otherwise strongly held). This means that lookup in a map is not required for top-down navigation.


ndistinctOnWriteRetentionQueue

protected int ndistinctOnWriteRetentionQueue
The #of distinct nodes and leaves on the writeRetentionQueue.


metadata

protected HTreeIndexMetadata metadata
The metadata record for the index. This data rarely changes during the life of the HTree object, but it CAN be changed.


nodeSer

protected final NodeSerializer nodeSer
Used to serialize and de-serialize the nodes and leaves of the tree.

Constructor Detail

AbstractHTree

protected AbstractHTree(IRawStore store,
                        INodeFactory nodeFactory,
                        boolean readOnly,
                        HTreeIndexMetadata metadata,
                        IRecordCompressorFactory<?> recordCompressorFactory)
Parameters:
store - The persistence store.
nodeFactory - Object that provides a factory for node and leaf objects.
readOnly - true IFF it is known that the AbstractHTree is read-only.
metadata - The IndexMetadata object for this AbstractHTree.
recordCompressorFactory - Object that knows how to (de-)compress the serialized data records.
Throws:
IllegalArgumentException - if addressBits is LT ONE (1).
IllegalArgumentException - if addressBits is GT (16).
Method Detail

getBtreeCounters

public final BTreeCounters getBtreeCounters()
Counters tracking various aspects of the btree. TODO Refactor / reuse performance counters for HTree and collect counters in the code.


setBTreeCounters

public final void setBTreeCounters(BTreeCounters btreeCounters)
Replace the BTreeCounters.

Note: This is used by the IndexManager to ensure that an index loaded from its backing store uses the BTreeCounters associated with that index since the DataService was last (re-)started.

Parameters:
btreeCounters - The counters to be used.
Throws:
IllegalArgumentException - if the argument is null.

getCounters

public CounterSet getCounters()
Description copied from interface: ICounterSetAccess
Return performance counters.

Specified by:
getCounters in interface ICounterSetAccess

getIndexMetadata

public HTreeIndexMetadata getIndexMetadata()
Returns the metadata record for this index.

Note: If the index is read-only then the metadata object will be cloned to avoid potential modification. However, only a single cloned copy of the metadata record will be shared between all callers for a given instance of this class.

Specified by:
getIndexMetadata in interface ICheckpointProtocol
Returns:
The metadata record for this btree and never null.
See Also:
IIndex.getIndexMetadata()
TODO:
the clone once policy is a compromise between driving the read only semantics into the IndexMetadata object, cloning each time, and leaving it to the caller to clone as required. Either the former or the latter might be a better choice. FIXME if the metadata record is updated (and it can be) then we really need to invalidate the cloned metadata record also.

close

public void close()
The contract for ICheckpointProtocol.close() is to reduce the resource burden of the index while not rendering the index inoperative. An index that has been closed MAY be reopened at any time (conditional on the continued availability of the backing store). Such an index reference remains valid after a ICheckpointProtocol.close(). A closed index is transparently reopened by any access to the index data (scanning the index, probing the index, etc).

Note: A ICheckpointProtocol.close() on a dirty index MUST discard writes rather than flushing them to the store and MUST NOT update its Checkpoint record - (ICheckpointProtocol.close() is used to discard indices with partial writes when an AbstractTask fails). If you are seeking to ICheckpointProtocol.close() a mutable index view that state can be recovered by ICheckpointProtocol.reopen() then you MUST write a new Checkpoint record before closing the index.

Note: CLOSING A TRANSIENT INDEX WILL DISCARD ALL DATA!

This implementation clears the hard reference queue (releasing all node references), releases the hard reference to the root node, and releases the buffers on the NodeSerializer (they will be naturally reallocated on reuse).

Note: AbstractHTree is NOT thread-safe and close() MUST be invoked in a context in which there will not be concurrent threads -- the natural choice being the single-threaded commit service on the journal.

Specified by:
close in interface ICheckpointProtocol
Throws:
IllegalStateException - if the root is null, indicating that the index is already closed.
IllegalStateException - if the root is dirty (this implies that this is a mutable HTree and there are mutations that have not been written through to the store)

reopen

public final void reopen()
(Re-) open the index. This method is part of a ICheckpointProtocol.close() / ICheckpointProtocol.reopen() protocol. That protocol may be used to reduce the resource burden of an index. This method is automatically invoked by a variety of methods that need to ensure that the index is available for use..

This method delegates to _reopen() if double-checked locking demonstrates that the root is null (indicating that the index has been closed). This method is automatically invoked by a variety of methods that need to ensure that the index is available for use.

Specified by:
reopen in interface ICheckpointProtocol
See Also:
ICheckpointProtocol.close(), ICheckpointProtocol.isOpen(), #getRoot()

_reopen

protected abstract void _reopen()
This method is responsible for setting up the root leaf (either new or read from the store), the bloom filter, etc. It is invoked by reopen() once root has been show to be null with double-checked locking. When invoked in this context, the caller is guaranteed to hold a lock on this. This is done to ensure that at most one thread gets to re-open the index from the backing store.


isOpen

public final boolean isOpen()
Description copied from interface: ICheckpointProtocol
An "open" index has may have some buffered data. A closed index will have to re-read any backing data from the backing store.

Specified by:
isOpen in interface ICheckpointProtocol
Returns:
If the index is "open".
See Also:
ICheckpointProtocol.close(), ICheckpointProtocol.reopen()

isTransient

public final boolean isTransient()
Return true iff this is a transient data structure (no backing store).


assertNotTransient

protected final void assertNotTransient()

isReadOnly

public final boolean isReadOnly()
Return true iff this B+Tree is read-only.


assertNotReadOnly

protected final void assertNotReadOnly()
Throws:
UnsupportedOperationException - if the B+Tree is read-only.
See Also:
isReadOnly()

getLastCommitTime

public abstract long getLastCommitTime()
The timestamp associated with the last IAtomicStore.commit() in which writes buffered by this index were made restart-safe on the backing store. The lastCommitTime is set when the index is loaded from the backing store and updated after each commit. It is ZERO (0L) when HTree is first created and will remain ZERO (0L) until the HTree is committed. If the backing store does not support atomic commits, then this value will always be ZERO (0L).


getStore

public IRawStore getStore()
The backing store.


getAddressBits

public final int getAddressBits()
The #of bits in the address space for a hash directory page. This constant is specified to the constructor. The #of child pages is 2^addressBits. When addressBits is 10 we have 2^10 := 1024. If the size of a child address is 4 bytes, then a 10 bit address space implies a 4k page size.


getNodeCount

public abstract long getNodeCount()
The #of DirectoryPages in the HTree (not buddy hash tables, but the pages on which they appear).


getLeafCount

public abstract long getLeafCount()
The #of BucketPages in the HTree (not buddy hash buckets, but the pages on which they appear).


getEntryCount

public abstract long getEntryCount()
The #of tuples in the HTree.


getRoot

protected final com.bigdata.htree.DirectoryPage getRoot()
The root of the HTree. This is always a DirectoryPage.

The hard reference to the root node is cleared if the index is closed. This method automatically reopen()s the index if it is closed, making it available for use.


toString

public String toString()
Fast summary information about the B+Tree.

Overrides:
toString in class Object

dump

public boolean dump(PrintStream out)
Recursive dump of the tree.

Parameters:
out - The dump is written on this stream.
Returns:
true unless an inconsistency is detected.

dump

public boolean dump(org.apache.log4j.Level level,
                    PrintStream out,
                    boolean materialize)

touch

protected void touch(com.bigdata.htree.AbstractPage node)

This method is responsible for putting the node or leaf onto the ring buffer which controls (a) how long we retain a hard reference to the node or leaf; and (b) for writes, when the node or leaf is evicted with a zero reference count and made persistent (along with all dirty children). The concurrency requirements and the implementation behavior and guarentees differ depending on whether the B+Tree is read-only or mutable for two reasons: For writers, the B+Tree is single-threaded so there is no contention. For readers, every touch on the B+Tree goes through this point, so it is vital to make this method non-blocking.

Writers

This method guarantees that the specified node will NOT be synchronously persisted as a side effect and thereby made immutable. (Of course, the node may be already immutable.)

In conjunction with DefaultEvictionListener, this method guarantees that the reference counter for the node will reflect the #of times that the node is actually present on the writeRetentionQueue.

If the node is not found on a scan of the head of the queue, then it is appended to the queue and its AbstractPage.referenceCount is incremented. If a node is being appended to the queue and the queue is at capacity, then this will cause a reference to be evicted from the queue. If the reference counter for the evicted node or leaf is zero and the evicted node or leaf is dirty, then a data record will be coded for the evicted node or leaf and written onto the backing store. A subsequent attempt to modify the node or leaf will force copy-on-write for that node or leaf. Regardless of whether or not the node or leaf is dirty, it is touched on the LRUNexus#getGlobalLRU() when it is evicted from the write retention queue.

For the mutable B+Tree we also track the #of references to the node/leaf on the ring buffer. When that reference count reaches zero we do an eviction and the node/leaf is written onto the backing store if it is dirty. Those reference counting games DO NOT matter for read-only views so we can take a code path which does not update the per-node/leaf reference count and we do not need to use either synchronization or atomic counters to track the reference counts.

Readers

In order to reduce contention for the lock required to update the backing queue, the writeRetentionQueue is configured to collect references for touched nodes or leaves in a thread-local queue and then batch those references through to the backing hard reference queue while holding the lock.

Parameters:
node - The node or leaf.
TODO:
The per-node/leaf reference counts and ndistinctOnWriteRetentionQueue fields are not guaranteed to be consistent for of concurrent readers since no locks are held when those fields are updated. Since the reference counts only effect when a node is made persistent (assuming its dirty) and since we already require single threaded access for writes on the btree, this does not cause a problem but can lead to unexpected values for the reference counters and ndistinctOnWriteRetentionQueue.

writeNodeRecursive

protected final void writeNodeRecursive(com.bigdata.htree.AbstractPage node)
Write a dirty node and its children using a post-order traversal that first writes any dirty leaves and then (recursively) their parent nodes. The parent nodes are guaranteed to be dirty if there is a dirty child so this never triggers copy-on-write. This is used as part of the commit protocol where it is invoked with the root of the tree, but it may also be used to incrementally flush dirty non-root Nodes. Note: This will throw an exception if the backing store is read-only.

Parameters:
node - The root of the hierarchy of nodes to be written. The node MUST be dirty. The node this does NOT have to be the root of the tree and it does NOT have to be a Node.

writeNodeOrLeaf

protected long writeNodeOrLeaf(com.bigdata.htree.AbstractPage node)
Codes the node and writes the coded record on the store (non-recursive). The node MUST be dirty. If the node has a parent, then the parent is notified of the persistent identity assigned to the node by the store. This method is NOT recursive and dirty children of a node will NOT be visited. By coding the nodes and leaves as they are evicted from the writeRetentionQueue, the B+Tree continuously converts nodes and leaves to their more compact coded record forms which results in a smaller in memory footprint.

Note: For a transient B+Tree, this merely codes the node but does not write the node on the store (there is none).

Returns:
The persistent identity assigned by the store.
Throws:
UnsupportedOperationException - if the B+Tree (or the backing store) is read-only.

readNodeOrLeaf

protected com.bigdata.htree.AbstractPage readNodeOrLeaf(long addr)
Read an AbstractPage from the store.

Note: Callers SHOULD be synchronized in order to ensure that only one thread will read the desired node or leaf in from the store and attach the reference to the newly read node or leaf as appropriate into the existing data structures (e.g., as the root reference or as a child of a node or leaf already live in memory).

Note: The caller MUST set the AbstractPage.globalDepth on the returned value.

Parameters:
addr - The address in the store.
Returns:
The AbstractPage.
Throws:
IllegalArgumentException - if the address is IAddressManager.NULL.

encodeRecordAddr

public static byte[] encodeRecordAddr(ByteArrayBuffer recordAddrBuf,
                                      long addr)
Encode a raw record address into a byte[] suitable for storing in the value associated with a tuple and decoding using decodeRecordAddr(byte[])

Parameters:
recordAddrBuf - The buffer that will be used to format the record address.
addr - The raw record address.
Returns:
A newly allocated byte[] which encodes that address.

decodeRecordAddr

public static long decodeRecordAddr(byte[] buf)
Decodes a signed long value as encoded by #appendSigned(long).

Parameters:
buf - The buffer containing the encoded record address.
Returns:
The signed long value.

rangeCount

public abstract long rangeCount()
The #of index entries.

Specified by:
rangeCount in interface ICheckpointProtocol
Returns:
The #of tuples in the index.
See Also:
IRangeQuery.rangeCount()

rangeIterator

public ITupleIterator rangeIterator()
Simple iterator visits all tuples in the HTree in order by the effective prefix of their keys. Since the key is typically a hash of some fields in the associated application data record, this will normally visit application data records in what appears to be an arbitrary order. Tuples within a buddy hash bucket will be delivered in a random order.

Note: The HTree does not currently maintain metadata about the #of spanned tuples in a DirectoryPage. Without that we can not provide fast range counts, linear list indexing, etc.


values

public Iterator<byte[]> values()
Visits the values stored in the HTree in order by the effective prefix of their keys. Since the key is typically a hash of some fields in the associated application data record, this will normally visit application data records in what appears to be an arbitrary order. Tuples within a buddy hash bucket will be delivered in a random order.

Note: This allocates a new byte[] for each visited value. It is more efficient to reuse a buffer for each visited Tuple. This can be done using rangeIterator().


checkConsistency

public void checkConsistency(boolean full)
Checks globalDepth of each node and also whether any BucketPages are empty.

Parameters:
full - indicates whether to check consistency on disk

checkConsistency

public void checkConsistency(com.bigdata.htree.AbstractPage pge,
                             boolean full)


Copyright © 2006-2012 SYSTAP, LLC. All Rights Reserved.