com.bigdata.search
Class TokenBuffer<V extends Comparable<V>>

java.lang.Object
  extended by com.bigdata.search.TokenBuffer<V>
Type Parameters:
V - The generic type of the document identifier.

public class TokenBuffer<V extends Comparable<V>>
extends Object

A buffer holding tokens extracted from one or more documents / fields. Each entry in the buffer corresponds to the TermFrequencyData extracted from a field of some document. When the buffer overflows it is flush(), writing on the indices.

Version:
$Id: TokenBuffer.java 6234 2012-03-31 09:33:43Z mrpersonick $
Author:
Bryan Thompson

Constructor Summary
TokenBuffer(int capacity, FullTextIndex<V> textIndexer)
          Ctor.
 
Method Summary
 void add(V docId, int fieldId, String token)
          Adds another token to the current field of the current document.
protected  long deleteFromIndex(int n, byte[][] keys, byte[][] vals)
          Writes on the index.
 void flush()
          Write any buffered data on the indices.
 TermFrequencyData<V> get(int index)
          Return the TermFrequencyData for the specified index.
 void reset()
          Discards all data in the buffer and resets it to a clean state.
 int size()
          The #of entries in the buffer.
protected  long writeOnIndex(int n, byte[][] keys, byte[][] vals)
          Writes on the index.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TokenBuffer

public TokenBuffer(int capacity,
                   FullTextIndex<V> textIndexer)
Ctor.

Parameters:
capacity - The #of distinct {document,field} tuples that can be held in the buffer before it will overflow. The buffer will NOT overflow until you exceed this capacity.
textIndexer - The object on which the buffer will write when it overflows or is flush()ed.
Method Detail

reset

public void reset()
Discards all data in the buffer and resets it to a clean state.


size

public int size()
The #of entries in the buffer.


get

public TermFrequencyData<V> get(int index)
Return the TermFrequencyData for the specified index.

Parameters:
index - The index in [0:count).
Returns:
The TermFrequencyData at that index.
Throws:
IndexOutOfBoundsException

add

public void add(V docId,
                int fieldId,
                String token)
Adds another token to the current field of the current document. If either the field or the document identifier changes, then begins a new field and possibly a new document. If the buffer is full then it will be flush()ed before beginning a new field.

Note: This method is NOT thread-safe.

Note: There is an assumption that the caller will process all tokens for a given field in the same document at once. Failure to do this will lead to only part of the term-frequency distribution for the field being captured by the indices.

Parameters:
docId - The document identifier.
fieldId - The field identifier.
token - The token.

flush

public void flush()
Write any buffered data on the indices.

Note: The writes on the terms index are scattered since the key for the index is {term, docId, fieldId}. This method will batch up and then apply a set of updates, but the total operation is not atomic. Therefore search results which are concurrent with indexing may not have access to the full data for concurrently indexed documents. This issue may be resolved by allowing the indexer to write ahead and using a historical commit time for the search.

Note: If a document is pre-existing, then the existing data for that document MUST be removed unless you know that the fields to be found in the will not have changed (they may have different contents, but the same fields exist in the old and new versions of the document).


writeOnIndex

protected long writeOnIndex(int n,
                            byte[][] keys,
                            byte[][] vals)
Writes on the index.

Parameters:
n -
keys -
vals -
Returns:
The #of pre-existing records that were updated.

deleteFromIndex

protected long deleteFromIndex(int n,
                               byte[][] keys,
                               byte[][] vals)
Writes on the index.

Parameters:
n -
keys -
vals -
Returns:
The #of pre-existing records that were updated.


Copyright © 2006-2011 SYSTAP, LLC. All Rights Reserved.