|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectcom.bigdata.btree.AbstractBTree
com.bigdata.btree.BTree
public class BTree
This class implements a variant of a B+Tree in which all values are stored in leaves, but the leaves are not connected with prior-next links. This constraint arises from the requirement to support a copy-on-write policy.
Note: No mechanism is exposed for fetching a node or leaf of the tree other than the root by its key. This is because the parent reference on the node (or leaf) can only be set when it is read from the store in context by its parent node.
Note: the leaves can not be stitched together with prior and next references without forming cycles that make it impossible to write out the leaves of the btree. This restriction arises because each time we write out a node or leaf it is assigned a persistent identifier as an unavoidable artifact of providing isolation for the object index.
Note: This implementation is thread-safe for concurrent readers BUT NOT for
concurrent writers. If a writer has access to a BTree then there MUST
NOT be any other reader -or- writer operating on the BTree at the
same time.
IndexSegmentBuilder., reuse a buffer when copying keys and values out of the index, e.g., for
the ITupleIterator, in order to minimize heap churn., The B+Tree implementation does not support limits on the serialized
size of a node or leaf. The design strategy is to allow flexible sizes
for serialized node and leaf records since larger branching factors and
continuous IO are cheaper and nothing in particular requires the use of
fixed size records when accessing disk. However, it would still be
useful to be able to place an upper bound on the serialized size of a
node or leaf. Nodes and leaves are only serialized on eviction, so this
would require the ability to split nodes and/or leaves immediately
before eviction if they would be likely to overflow the upper bound on
the record size. Probably the node or leaf needs to be made immutable,
the serialized size checked, and then a split operation invoked if the
record would exceed the upper bound. If this happens a lot, then the
branching factor is too high for the data and would have to be lowered
to regain performance.
Another use case for limits on the size of a node/leaf is maintaining the data in a serialized form. This can have several benefits, including reducing heap churn by copying keys and values (not their references) into the record structure, and reducing (eliminating) deserialization costs (which are substantially more expensive than serialization). This can be approached using new implementations of the key buffer and a value buffer implementation (one that can be used for nodes or for leaves). If an object implements both then it is a small additional step towards using it for the entire serialized record (minus some header/tailer).
Adaptive packed memory arrays can be used in this context to minimize the amount of copying that needs to be performed on mutation of a node or leaf.
Leading key compression is intrinsically part of how the key buffer maintains its representation so, for example, a micro-index would have to be maintained as part of the key buffer., create ring buffers to track the serialized size of the last 50 nodes and leaves so that we can estimate the serialized size of the total btree based on recent activity. we could use a moving average and persist it as part of the btree metadata. this could be used when making a decision to evict a btree vs migrate it onto a new journal and whether to split or join index segments during a journal overflow event., we could defer splits by redistributing keys to left/right siblings that are under capacity - this makes the tree a b*-tree. however, this is not critical since the journal is designed to be fully buffered and the index segments are read-only but it would reduce memory by reducing the #of nodes -- and we can expect that the siblings will be either resident or in the direct buffer for the journal, evict subranges by touching the node on the way up so that a node that is evicted from the hard reference cache will span a subrange that can be evicted together. this will help to preserve locality on disk. with this approach leaves might be evicted independently, but also as part of a node subrange eviction., since forward scans are much more common, change the post-order iterator for commit processing to use a reverse traversal so that we can write the next leaf field whenever we evict a sequence of leaves as part of a sub-range commit (probably not much of an issue since the journal is normally fully buffered and the perfect index segments will not have this problem)., pre-fetch leaves for range scans? if we define a parallel iterator flag?, Note that efficient support for large branching factors requires a more sophisticated approach to maintaining the key order within a node or leaf. E.g., using a red-black tree or adaptive packed memory array. However, we can use smaller branching factors for btrees in the journal and use a separate implementation for bulk generating and reading "perfect" read-only key range segments., derive a string index that uses patricia trees in the leaves per several published papers.
| Nested Class Summary | |
|---|---|
static class |
BTree.Counter
Mutable counter. |
class |
BTree.LeafCursor
A cursor that may be used to traversal Leafs. |
protected static class |
BTree.NodeFactory
Factory for mutable nodes and leaves used by the NodeSerializer. |
static class |
BTree.PartitionedCounter
Places the counter values into a namespace formed by the partition identifier. |
protected static class |
BTree.Stack
A simple stack based on an array used to maintain hard references for the parent Nodes in the BTree.LeafCursor. |
| Field Summary | |
|---|---|
protected AtomicLong |
counter
The mutable counter exposed by #getCounter()}. |
protected int |
height
The height of the btree. |
protected int |
nentries
The #of entries in the btree. |
protected int |
nleaves
The #of leaf nodes in the btree. |
protected int |
nnodes
The #of non-leaf nodes in the btree. |
| Fields inherited from class com.bigdata.btree.AbstractBTree |
|---|
branchingFactor, debug, DEBUG, dumpLog, ERROR_CLOSED, ERROR_LESS_THAN_ZERO, ERROR_READ_ONLY, ERROR_TOO_LARGE, ERROR_TRANSIENT, INFO, log, metadata, ndistinctOnWriteRetentionQueue, nodeSer, root, store, storeCache, writeRetentionQueue |
| Fields inherited from interface com.bigdata.btree.IRangeQuery |
|---|
ALL, CURSOR, DEFAULT, DELETED, FIXED_LENGTH_SUCCESSOR, KEYS, NONE, PARALLEL, READONLY, REMOVEALL, REVERSE, VALS |
| Constructor Summary | |
|---|---|
BTree(IRawStore store,
Checkpoint checkpoint,
IndexMetadata metadata)
Required constructor form for BTree and any derived subclasses. |
|
| Method Summary | |
|---|---|
protected void |
_reopen()
This method is responsible for setting up the root leaf (either new or read from the store), the bloom filter, etc. |
static BTree |
create(IRawStore store,
IndexMetadata metadata)
Create a new BTree or derived class. |
static BTree |
createTransient(IndexMetadata metadata)
Create a new BTree or derived class that is fully transient (NO
backing IRawStore). |
protected void |
fireDirtyEvent()
Fire an event to the listener (iff set). |
boolean |
flush()
Flush the nodes of the BTree to the backing store. |
BloomFilter |
getBloomFilter()
Lazily reads the bloom filter from the backing store if it exists and is not already in memory. |
Checkpoint |
getCheckpoint()
Returns the most recent Checkpoint record for this BTree. |
ICounter |
getCounter()
Returns a mutable counter. |
IDirtyListener |
getDirtyListener()
Return the IDirtyListener. |
int |
getEntryCount()
The #of entries (aka values) in the AbstractBTree. |
int |
getHeight()
The height of the btree. |
long |
getLastCommitTime()
The timestamp associated with the last IAtomicStore.commit() in
which writes buffered by this index were made restart-safe on the backing
store. |
int |
getLeafCount()
The #of leaf nodes in the AbstractBTree. |
BTree |
getMutableBTree()
The BTree that is absorbing writes for the view. |
int |
getNodeCount()
The #of non-leaf nodes in the AbstractBTree. |
int |
getSourceCount()
Returns ONE (1). |
AbstractBTree[] |
getSources()
An array containing this BTree. |
IRawStore |
getStore()
The backing store. |
long |
handleCommit(long commitTime)
Handle request for a commit by writeCheckpoint()ing dirty nodes
and leaves onto the store, writing a new metadata record, and returning
the address of that metadata record. |
boolean |
isReadOnly()
Return true iff this B+Tree is read-only. |
static BTree |
load(IRawStore store,
long addrCheckpoint)
Load an instance of a BTree or derived class from the store. |
static BTree |
load(IRawStore store,
long addrCheckpoint,
boolean readOnly)
Load an instance of a BTree or derived class from the store. |
boolean |
needsCheckpoint()
Return true iff changes would be lost unless the B+Tree is flushed to the backing store using writeCheckpoint(). |
BTree.LeafCursor |
newLeafCursor(byte[] key)
Return a cursor that may be used to efficiently locate and scan the leaves in the B+Tree. |
BTree.LeafCursor |
newLeafCursor(SeekEnum where)
Return a cursor that may be used to efficiently locate and scan the leaves in the B+Tree. |
protected BloomFilter |
readBloomFilter()
Read the bloom filter from the backing store using the address stored in the last checkpoint record. |
void |
removeAll()
Remove all entries in the B+Tree. |
void |
setDirtyListener(IDirtyListener listener)
Set or clear the listener (there can be only one). |
void |
setIndexMetadata(IndexMetadata indexMetadata)
Method updates the index metadata associated with this BTree. |
void |
setLastCommitTime(long lastCommitTime)
Sets the lastCommitTime. |
void |
setReadOnly(boolean readOnly)
Mark the B+Tree as read-only. |
long |
writeCheckpoint()
Checkpoint operation flush()es dirty nodes, the optional
IBloomFilter (if dirty), the IndexMetadata (if dirty),
and then writes a new Checkpoint record on the backing store,
saves a reference to the current Checkpoint and returns the
address of that Checkpoint record. |
Checkpoint |
writeCheckpoint2()
Checkpoint operation flush()es dirty nodes, the optional
IBloomFilter (if dirty), the IndexMetadata (if dirty),
and then writes a new Checkpoint record on the backing store,
saves a reference to the current Checkpoint and returns the
address of that Checkpoint record. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
| Methods inherited from interface com.bigdata.btree.IIndex |
|---|
getCounters, getIndexMetadata, getResourceMetadata, submit, submit, submit |
| Methods inherited from interface com.bigdata.btree.ISimpleBTree |
|---|
contains, insert, lookup, remove |
| Methods inherited from interface com.bigdata.btree.IAutoboxBTree |
|---|
contains, insert, lookup, remove |
| Methods inherited from interface com.bigdata.btree.IRangeQuery |
|---|
rangeCount, rangeCount, rangeCountExact, rangeCountExactWithDeleted, rangeIterator, rangeIterator, rangeIterator |
| Field Detail |
|---|
protected int height
height := 0. Note
that the height only changes when we split the root node.
protected int nnodes
protected int nleaves
protected int nentries
protected AtomicLong counter
Note: This is protected so that it will be visible to
Checkpoint which needs to write the actual value store in this
counter into its serialized record (without the partition identifier).
| Constructor Detail |
|---|
public BTree(IRawStore store,
Checkpoint checkpoint,
IndexMetadata metadata)
BTree and any derived subclasses.
This ctor is used both to create a new BTree, and to load a
BTree from the store using a Checkpoint record.
store - The store.checkpoint - A Checkpoint record for that BTree.metadata - The metadata record for that BTree.create(IRawStore, IndexMetadata),
load(IRawStore, long, boolean)| Method Detail |
|---|
public final int getHeight()
AbstractBTreeheight := 0. A btree with a
root node and one level of leaves under it has height := 1.
Note that all leaves of a btree are at the same height (this is what is
means for the btree to be "balanced"). Also note that the height only
changes when we split or join the root node (a btree maintains balance by
growing and shrinking in levels from the top rather than the leaves).
getHeight in class AbstractBTreepublic final int getNodeCount()
AbstractBTreeAbstractBTree. This is zero (0)
for a new btree.
getNodeCount in class AbstractBTreepublic final int getLeafCount()
AbstractBTreeAbstractBTree. This is one (1) for a
new btree.
getLeafCount in class AbstractBTreepublic final int getEntryCount()
AbstractBTreeAbstractBTree. This is zero
(0) for a new B+Tree. Note that this value is tracked explicitly so it
requires no IOs.
getEntryCount in class AbstractBTreepublic final IRawStore getStore()
getStore in class AbstractBTreepublic ICounter getCounter()
ICounters returned by this method
report and increment the same underlying counter.
Note: When the BTree is part of a scale-out index then the
counter will assign values within a namespace defined by the partition
identifier.
getCounter in interface IIndexprotected void _reopen()
AbstractBTreeAbstractBTree.reopen() once AbstractBTree.root has been show to be
null with double-checked locking. When invoked in this
context, the caller is guaranteed to hold a lock on this. This is
done to ensure that at most one thread gets to re-open the index from the
backing store.
_reopen in class AbstractBTreepublic final BloomFilter getBloomFilter()
getBloomFilter in interface ILocalBTreeViewgetBloomFilter in class AbstractBTreenull if there is no bloom
filter (including the case where there is a bloom filter but it
has been disabled since the BTree has grown too large and
the expected error rate of the bloom filter would be too high).protected final BloomFilter readBloomFilter()
checkpoint record. This method will be invoked by
getBloomFilter() when the bloom filter reference is
null but the bloom filter is known to exist and the bloom
filter object is requested.
Note: A bloom filter can be relatively large. The bit length of a bloom filter is approximately one byte per index entry, so a filter for an index with 10M index entries will be on the order of 10mb. Therefore this method will typically have high latency.
Note: the Checkpoint record initially stores 0L
for the bloom filter address. newRootLeaf() is responsible for
allocating the bloom filter (if one is to be used) when the root leaf is
(re-)created. The address then gets stored in the Checkpoint
record by writeCheckpoint() (if invoked and once the bloom
filter is dirty).
IllegalStateException - if the bloom filter does not exist (the caller should check
this first to avoid obtaining a lock).public final boolean isReadOnly()
AbstractBTreetrue iff this B+Tree is read-only.
isReadOnly in class AbstractBTreepublic final void setReadOnly(boolean readOnly)
readOnly - true if you want to mark the B+Tree as
read-only.
UnsupportedOperationException - if the B+Tree is already read-only and you pass
false.public final long getLastCommitTime()
AbstractBTreeIAtomicStore.commit() in
which writes buffered by this index were made restart-safe on the backing
store. The lastCommitTime is set when the index is loaded from the
backing store and updated after each commit. It is ZERO (0L) when
BTree is first created and will remain ZERO (0L) until the
BTree is committed. If the backing store does not support atomic
commits, then this value will always be ZERO (0L).
Note: This is fixed for an IndexSegment.
getLastCommitTime in class AbstractBTreepublic final void setLastCommitTime(long lastCommitTime)
Note: The lastCommitTime is set by a combination of the
AbstractJournal and Name2Addr based on the actual
commitTime of the commit during which an Name2Addr.Entry for that index was
last committed. It is set for both historical index reads and unisolated
index reads using Name2Addr.Entry.commitTime. The lastCommitTime for an
unisolated index will advance as commits are performed with that index.
lastCommitTime - The timestamp of the last committed state of this index.
IllegalArgumentException - if lastCommitTime is ZERO (0).
IllegalStateException - if the timestamp is less than the previous value (it is
permitted to advance but not to go backwards).public final IDirtyListener getDirtyListener()
IDirtyListener.
public final void setDirtyListener(IDirtyListener listener)
listener - The listener.protected final void fireDirtyEvent()
public final boolean flush()
BTree to the backing store. After invoking
this method the root of the BTree will be clean.
Note: This does NOT flush all persistent state. See
writeCheckpoint() which also handles the optional bloom filter,
the IndexMetadata, and the Checkpoint record itself.
true if anything was written.public final long writeCheckpoint()
flush()es dirty nodes, the optional
IBloomFilter (if dirty), the IndexMetadata (if dirty),
and then writes a new Checkpoint record on the backing store,
saves a reference to the current Checkpoint and returns the
address of that Checkpoint record.
Note: A checkpoint by itself is NOT an atomic commit. The commit protocol
is at the store level and uses Checkpoints to ensure that the
state of the BTree is current on the backing store.
Checkpoint record for the
BTree was written onto the store. The BTree can
be reloaded from this Checkpoint record.#writeCheckpoint2(), which returns the {@link Checkpoint} record
itself.,
load(IRawStore, long)public final Checkpoint writeCheckpoint2()
flush()es dirty nodes, the optional
IBloomFilter (if dirty), the IndexMetadata (if dirty),
and then writes a new Checkpoint record on the backing store,
saves a reference to the current Checkpoint and returns the
address of that Checkpoint record.
Note: A checkpoint by itself is NOT an atomic commit. The commit protocol
is at the store level and uses Checkpoints to ensure that the
state of the BTree is current on the backing store.
Checkpoint record for the BTree was written
onto the store. The BTree can be reloaded from this
Checkpoint record.load(IRawStore, long)public final Checkpoint getCheckpoint()
Checkpoint record for this BTree.
Checkpoint record for this BTree
and never null.public boolean needsCheckpoint()
writeCheckpoint().
Note: In order to avoid needless checkpoints this method will return
false if:
null, indicating that the index
is closed (in which case there are no buffered writes);Checkpoint.getRootAddr(), and the
counter value agrees Checkpoint.getCounter().
true true iff changes would be lost unless the
B+Tree was flushed to the backing store using
writeCheckpoint().public final void setIndexMetadata(IndexMetadata indexMetadata)
BTree.
The new metadata record will be written out as part of the next index
writeCheckpoint().
Note: this method should be used with caution.
indexMetadata - The new metadata description for the BTree.
IllegalArgumentException - if the new value is null
IllegalArgumentException - if the new IndexMetadata record has already been
written on the store - see
IndexMetadata.getMetadataAddr()
UnsupportedOperationException - if the index is read-only.public long handleCommit(long commitTime)
writeCheckpoint()ing dirty nodes
and leaves onto the store, writing a new metadata record, and returning
the address of that metadata record.
Note: In order to avoid needless writes the existing metadata record is
always returned if needsCheckpoint() is false.
Note: The address of the existing Checkpoint record is always
returned if autoCommit is disabled.
handleCommit in interface ICommittercommitTime - The timestamp assigned to the commit.
Checkpoint record from which the btree
may be reloaded.public final void removeAll()
When delete markers are not enabled this simply replaces the root with a new root leaf and resets the counters (height, #of nodes, #of leaves, etc). to be consistent with that new root. If the btree is then made restart-safe by the commit protocol of the backing store then the effect is as if all entries had been deleted. Old nodes and leaves will be swept from the store eventually when the journal overflows.
When delete markers are enabled all un-deleted entries in the index are overwritten with a delete marker.
Note: The IIndexManager defines methods for registering (adding)
and dropping indices vs removing the entries in an individual
BTree.
removeAll in class AbstractBTree
public static BTree create(IRawStore store,
IndexMetadata metadata)
BTree or derived class. This method works by writing
the IndexMetadata record on the store and then loading the
BTree from the IndexMetadata record.
store - The store.metadata - The metadata record.
BTree.
IllegalStateException - If you attempt to create two btree objects from the same
metadata record since the metadata address will have
already been noted on the IndexMetadata object.
You can use IndexMetadata.clone() to obtain a new
copy of the metadata object with the metadata address set
to 0L.load(IRawStore, long)public static BTree createTransient(IndexMetadata metadata)
BTree or derived class that is fully transient (NO
backing IRawStore).
Fully transient BTrees provide the functionality of a B+Tree
without a backing persistence store. Internally, reachable nodes and
leaves of the transient BTree use hard references to ensure that
remain strongly reachable. Deleted nodes and leaves simply clear their
references and will be swept by the garbage collector shortly thereafter.
Operations which attempt to write on the backing store will fail.
While nodes and leaves are never persisted, the keys and values of the
transient BTree are unsigned byte[]s. This means that application
keys and values are always converted into unsigned byte[]s before being
stored in the BTree. Hence if an object that is inserted into
the BTree and then looked up using the BTree API, you
WILL NOT get back the same object reference.
Note: CLOSING A TRANSIENT INDEX WILL DISCARD ALL DATA!
metadata - The metadata record.
BTree.
public static BTree load(IRawStore store,
long addrCheckpoint)
BTree or derived class from the store. The
BTree or derived class MUST declare a constructor with the
following signature:
className(IRawStore store, BTreeMetadata metadata)
store - The store.addrCheckpoint - The address of a Checkpoint record for the index.
BTree or derived class loaded from that
Checkpoint record.
public static BTree load(IRawStore store,
long addrCheckpoint,
boolean readOnly)
BTree or derived class from the store. The
BTree or derived class MUST declare a constructor with the
following signature:
className(IRawStore store, BTreeMetadata metadata)
store - The store.addrCheckpoint - The address of a Checkpoint record for the index.readOnly - When true the BTree will be marked as
read-only. Marking the BTree as read-only here rather
than with setReadOnly(boolean) has some
advantages relating to the locking scheme used by
Node.getChild(int) since the root node is known to be
read-only at the time that it is allocated as per-child
locking is therefore in place for all nodes in the read-only
BTree.
BTree or derived class loaded from that
Checkpoint record.public final int getSourceCount()
getSourceCount in interface ILocalBTreeViewpublic final AbstractBTree[] getSources()
BTree.
getSources in interface ILocalBTreeViewpublic final BTree getMutableBTree()
ILocalBTreeViewBTree that is absorbing writes for the view.
getMutableBTree in interface ILocalBTreeViewpublic BTree.LeafCursor newLeafCursor(SeekEnum where)
AbstractBTree
newLeafCursor in class AbstractBTreepublic BTree.LeafCursor newLeafCursor(byte[] key)
AbstractBTree
newLeafCursor in class AbstractBTreekey - The key (required).
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||