com.bigdata.btree
Class BloomFilterFactory

java.lang.Object
  extended by com.bigdata.btree.BloomFilterFactory
All Implemented Interfaces:
Serializable

public class BloomFilterFactory
extends Object
implements Serializable

An interface that is used to generate a bloom filter for an AbstractBTree and which allows the caller to specify the expected number of index entries, the desired error rate for the filter at that #of index entries, and the maximum error rate before the bloom filter will be disabled.

While the bloom filter is always a "perfect" fit for an IndexSegment since the #of index entries is known before we build the IndexSegment, that is not true of a BTree. For a BTree, the performance of the bloom filter will degrade as you exceed the #of index entries for which it was provisioned (at a desired error rate). Therefore the factory also specifies the error rate above which the bloom filter will be disabled for a BTree. The factory works backwards from the actual bit length of the bloom filter and the maximum allowed error rate to derive the #of index entries which would achieve that error rate. The bloom filter is disabled when inserts into the BTree reach that threshold #of index entries.

Once a bloom filter has been disabled for a BTree it will not be re-enabled for that BTree. However, in the scale-out architecture the a new mutable BTree is created each time the journal overflows and the data from the historical view is migrated into IndexSegments. Both the new mutable BTree and IndexSegments will have bloom filters configured according to this factory. The bloom filter on the new BTree will operate until the #of index entries in that BTree (not in the index or the index partition) exceeds the threashold at which the bloom filter would be expected to operate with the specified maximum error rate, at which point it will be disabled.

Version:
$Id: BloomFilterFactory.java 5548 2011-11-05 20:33:08Z thompsonbry $
Author:
Bryan Thompson
See Also:
Serialized Form

Field Summary
static BloomFilterFactory DEFAULT
          The recommenced default factory configuration.
static double DEFAULT_ERROR_RATE
          The default target error rate 0.02.
static double DEFAULT_MAX_ERROR_RATE
          The default maximum error rate 0.15.
static int DEFAULT_N
          The default expected #of index entries 1000000.
 int maxN
          The maximum #of index entries before the expected performance will be worse than the specified maximum error rate.
 double maxP
          The maximum error rate for the bloom filter (it will be disabled for a BTree once the bloom filter can be expected to realize this error rate).
 int n
          The expected #of index entries.
 double p
          The desired error rate for the bloom filter at that #of index entries.
 
Constructor Summary
BloomFilterFactory(int n)
          Configuration with the caller specified #of index entries and having a target error rate of 0.02 and a maximum error rate of 0.15.
BloomFilterFactory(int n, double p, double maxP)
          Core impl.
 
Method Summary
 BloomFilter newBloomFilter()
          Create and return a new (empty) bloom filter for a BTree or IndexSegment.
 String toString()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

n

public final int n
The expected #of index entries.


p

public final double p
The desired error rate for the bloom filter at that #of index entries.


maxP

public final double maxP
The maximum error rate for the bloom filter (it will be disabled for a BTree once the bloom filter can be expected to realize this error rate).


maxN

public final int maxN
The maximum #of index entries before the expected performance will be worse than the specified maximum error rate. A BTree will automatically disable its bloom filter once this many elements have been inserted. In practice, this is only a constraint on scale-up indices. For scale-out indices the maximum applies only the mutable BTree absorbing writes destined for a specific index partition and the IndexSegments will have "perfect fit" bloom filters.


DEFAULT_N

public static final transient int DEFAULT_N
The default expected #of index entries 1000000.

See Also:
Constant Field Values

DEFAULT_ERROR_RATE

public static final transient double DEFAULT_ERROR_RATE
The default target error rate 0.02.

See Also:
Constant Field Values

DEFAULT_MAX_ERROR_RATE

public static final transient double DEFAULT_MAX_ERROR_RATE
The default maximum error rate 0.15.

See Also:
Constant Field Values

DEFAULT

public static final transient BloomFilterFactory DEFAULT
The recommenced default factory configuration. This configuration is designed to provide a bloom filter with good performance up to ~2M index entries and then shut off automatically. This works for both the scale-up case (good for small indices and turns off for large indices) and for scale-out (just about right sized for scale-out, depending on the split handler).

The space requirements of a BloomFilter are approximately one byte per index entry for a reasonable false positive rate (aka error rate). Use DEFAULT for reasonable defaults. With these values, the bloom filter will be limited to an index with no more than ~2M entries. After than the bloom filter begins to have diminishing results due to an increasing false positive rate and will be automatically disabled. If you increase the expected #of index entries, then you can handle larger indices with the same error rate, but this swiftly gets into diminishing returns again because the latency required to materialize and persist the bloom filter becomes noticeable. Therefore the default configuration for the bloom filter is highly accurate up to 1M index entries (0.02 error rate) and then begins to loose precision. If the index grows large enough then the expected error rate of the bloom filter will degrade to the point where it will be automatically disabled.

While this places an effective upper bound on the size of a scale-up index that can make effective use of a bloom filter, scale-out indices DO NOT share the same limitation. Each time a scale-out index is partitioned, it is broken into a mutable BTree for absorbing writes for an index partition and zero or more IndexSegments. Each time an overflow occurs, index entries are migrated to the IndexSegments, so the #of index entries in the BTree is never that large. Finally, #of index entries in an IndexSegment is always known when the IndexSegment is built, so the BloomFilter for an IndexSegment is always a perfect fit.

Constructor Detail

BloomFilterFactory

public BloomFilterFactory(int n)
Configuration with the caller specified #of index entries and having a target error rate of 0.02 and a maximum error rate of 0.15.

Parameters:
n - The expected #of index entries for a BTree (this value is ignored for IndexSegments).
Throws:
IllegalArgumentException - if n is non-positive.

BloomFilterFactory

public BloomFilterFactory(int n,
                          double p,
                          double maxP)
Core impl.

Parameters:
n - The expected #of index entries (this value is ignored for IndexSegments).
p - The desired error rate for the bloom filter at that #of index entries (or at the actual #of index entries for an IndexSegment).
maxP - The maximum error rate for the bloom filter for a BTree (it will be disabled for a BTree once the bloom filter can be expected to realize this error rate).
Throws:
IllegalArgumentException - if n is non-positive.
IllegalArgumentException - unless p lies in (0:1].
IllegalArgumentException
IllegalArgumentException - unless maxP lies in (p:1].
Method Detail

newBloomFilter

public BloomFilter newBloomFilter()
Create and return a new (empty) bloom filter for a BTree or IndexSegment.

The bloom filter can be provisioned with reference to /architecture/bloomfilter.xls. Let p be the probability of a false positive (aka the error rate) and n be the #of index entries. The values p=.02 and n=1M result in a space requirement of 8656171 bits or approximately 1mb and uses ~ 8.6 bits per element. In order to achieve the same error rate with n=10M, the size requirements of the bloom filter will be approximately 10mb since the filter will still use ~ 8.6 bits per element for that error rate, or roughly one byte per index entry.

The maximum record length for the backing store can easily be exceeded by a large bloom filter, large bloom filters will require significant time to read/write during checkpoints, and large bloom filters will be written each time a dirty index is involved in a commit.

While the scale-out architecture uses group commits and hence can be expected to perform more commits during a bulk data load, it also uses one bloom filter per AbstractBTree so the #of index entries is bounded by the configured ISimpleSplitHandler in an application dependent manner. On the other hand, the bloom filter performance will degrade as a scale-up index grows in size since the bloom filter can not be made very large for a scale-up store (the maximum record size is reduced in order to permit more records) and large indices will therefore experience increasing false positive rates as they grow.

Whether or not a bloom filter is useful depends on the application. The bloom filter will ONLY be used for point tests such as contains(), lookup(), or an IAccessPath that is fully bound and therefore can benefit from testing a bloom filter.


toString

public String toString()
Overrides:
toString in class Object


Copyright © 2006-2011 SYSTAP, LLC. All Rights Reserved.