com.bigdata.search
Class FullTextIndex

java.lang.Object
  extended by com.bigdata.relation.AbstractResource<IRelation<E>>
      extended by com.bigdata.relation.AbstractRelation
          extended by com.bigdata.search.FullTextIndex
All Implemented Interfaces:
IMutableRelation, IMutableResource, IRelation, ILocatableResource

public class FullTextIndex
extends AbstractRelation

Full text indexing and search support.

The basic data model consists of documents, fields in documents, and tokens extracted by an analyzer from those fields.

The frequency distributions may be normalized to account for a variety of effects producing "term weights". For example, normalizing for document length or relative frequency of a term in the overall collection. Therefore the logical model is:

 
             token : {docId, freq?, weight?}+
 
 
(For RDF, docId is the term identifier as assigned by the term:id index.)

The freq and weight are optional values that are representative of the kinds of statistical data that are kept on a per-token-document basis. The freq is the token frequency (the frequency of occurrence of the token in the document). The weight is generally a normalized token frequency weight for the token in that document in the context of the overall collection.

In fact, we actually represent the data as follows:

 
             {sortKey(token), docId, fldId} : {freq?, weight?, sorted(pos)+}
 
 
That is, there is a distinct entry in the full text B+Tree for each field in each document in which a given token was recognized. The text of the token is not stored in the key, just the Unicode sort key generated from the token text. The value associated with the B+Tree entry is optional - it is simply not used unless we are storing statistics for the token-document pair. The advantages of this approach are: (a) it reuses the existing B+Tree data structures efficiently; (b) we are never faced with the possibility overflow when a token is used in a large number of documents. The entries for the token will simply be spread across several leaves in the B+Tree; (c) leading key compression makes the resulting B+Tree very efficient; and (d) in a scale-out range partitioned index we can load balance the resulting index partitions by choosing the partition based on an even token boundary.

A field is any pre-identified text container within a document. Field identifiers are integers, so there are 32^2 distinct possible field identifiers. It is possible to manage the field identifiers through a secondary index, but that has no direct bearing on the structure of the full text index itself. Field identifies appear after the token in the key so that queries may be expressed that will be matched against any field in the document. Likewise, field identifiers occur before the document identifier in the key since we always search across documents (in a search key, the document identifier is always Long.MIN_VALUE and the field identifier is always Integer.MIN_VALUE). There are many applications for fields: for example, distinct fields may be used for the title, abstract, and full text of a document or for the CDATA section of each distinct element in documents corresponding to some DTD. The application is responsible for recognizing the fields in the document and producing the appropriate token stream, each of which must be tagged by the field.

A query is tokenized, producing a (possibly normalized) token-frequency vector. The relevance of documents to the query is generally taken as the cosine between the query's and each document's (possibly normalized) token-frequency vectors. The main effort of search is assembling a token frequency vector for just those documents with which there is an overlap with the query. This is done using a key range scan for each token in the query against the full text index.

             fromKey := token, Long.MIN_VALUE
             toKey   := successor(token), Long.MIN_VALUE
 
and extracting the appropriate token frequency, normalized token weight, or other statistic. When no value is associated with the entry we follow the convention of assuming a token frequency of ONE (1) for each document in which the token appears.

Tokenization is informed by the language code (when declared) and by the configured Locale for the database otherwise. An appropriate Analyzer is chosen based on the language code or Locale and the "document" is broken into a token-frequency distribution (alternatively a set of tokens). The same process is used to tokenize queries, and the API allows the caller to specify the language code used to select the Analyzer to tokenize the query.

Once the tokens are formed the language code / Locale used to produce the token is discarded (it is not represented in the index). The reason for this is that we never utilize the total ordering of the full text index, merely the manner in which it groups tokens that map onto the same Unicode sort key together. Further, we use only a single Unicode collator configuration regardless of the language family in which the token was originally expressed. Unlike the collator used by the terms index (which often is set at IDENTICAL strength), the collector used by the full text index should be chosen such that it makes relatively few distinctions in order to increase recall (e.g., set at PRIMARY strength). Since a total order over the full text index is not critical from the perspective of its IR application, the Locale for the collator is likewise not critical and PRIMARY strength will produce significantly shorter Unicode sort keys.

The term frequency within that literal is an optional property associated with each term identifier, as is the computed weight for the token in the term.

Note: Documents should be tokenized using an Analyzer appropriate for their declared language code (if any). However, once tokenized, the language code is discarded and we perform search purely on the Unicode sort keys resulting from the extracted tokens.

Version:
$Id: FullTextIndex.java 2265 2009-10-26 12:51:06Z thompsonbry $
Author:
Bryan Thompson
TODO:
The key for the terms index is {term,docId,fieldId}. Since the data are not pre-aggregated by {docId,fieldId} we can not easily remove only those tuples corresponding to some document (or some field of some document).

In order to removal of the fields for a document we need to know either which fields were indexed for the document and the tokens found in those fields and then scatter the removal request (additional space requirements) or we need to flood a delete procedure across the terms index (expensive)., provide M/R alternatives for indexing or computing/updating global weights., Consider model in which fields are declared and then a "Document" is indexed. This lets us encapsulate the "driver" for indexing. The "field" can be a String or a Reader, etc.

Note that lucene handles declaration of the data that will be stored for a field on a per Document basis {none, character offsets, character offsets + token positions}. There is also an option to store the term vector itself. Finally, there are options to store, compress+store, or not store the field value. You can also choose {None, IndexTokenized, IndexUntokenized} and an option dealing with norms., lucene Analyzers may be problematic. For example, it is difficult to tokenize numbers. consider replacing the lucene analyzer/tokenizer with our own stuff. this might help with tokenization of numbers, etc. and with tokenization of native html or xml with intact offsets., lucene analyzers will strip stopwords by default. There should be a configuration option to strip out stopwords and another to enable stemming. how we do that should depend on the language family. Likewise, there should be support for language family specific stopword lists and language family specific exclusions., key compression: prefix compression, possibly with hamming codes for the docId and fieldId., support more term weighting schemes and make them easy to configure., is there an appropriate generic type for the full text index? Basically, it contains "document fields", or at least their indexed terms.


Nested Class Summary
static interface FullTextIndex.Options
          Options understood by the FullTextIndex.
 
Field Summary
protected static org.apache.log4j.Logger log
           
static String NAME_SEARCH
          The basename of the search index.
 
Constructor Summary
FullTextIndex(IIndexManager indexManager, String namespace, Long timestamp, Properties properties)
          Ctor specified by DefaultResourceLocator.
 
Method Summary
protected  void assertWritable()
           
 void create()
          Conditionally registers the necessary index(s).
 long delete(IChunkedOrderedIterator itr)
          Remove elements from the relation.
 void destroy()
          Destroy any logically contained resources (relations, indices).
 IAccessPath getAccessPath(IPredicate predicate)
          Return the best IAccessPath for a relation given a predicate with zero or more unbound variables.
protected  org.apache.lucene.analysis.Analyzer getAnalyzer(String languageCode)
          Return the token analyzer to be used for the given language code.
 Class<?> getElementClass()
          Return the class for the generic type of this relation.
 IIndex getIndex()
          The index used to associate term identifiers with tokens parsed from documents.
 Set<String> getIndexNames()
          Return the fully qualified name of each index maintained by this relation.
protected  IKeyBuilder getKeyBuilder()
          Return a ThreadLocal IKeyBuilder instance configured to support full text indexing and search.
protected static byte[] getTokenKey(IKeyBuilder keyBuilder, String termText, boolean successor, long docId, int fieldId)
          Create a key for a term.
protected  org.apache.lucene.analysis.TokenStream getTokenStream(String languageCode, Reader r)
          Tokenize text using an Analyzer that is appropriate to the specified language family.
protected  byte[] getTokenValue(ByteArrayBuffer buf, TermMetadata metadata)
          Return the byte[] that is the encoded value for per-{token,docId,fieldId} entry in the index.
 void index(TokenBuffer buffer, long docId, int fieldId, String languageCode, Reader r)
          Index a field in a document.
 long insert(IChunkedOrderedIterator itr)
          Write elements on the relation.
 boolean isOverwrite()
          Return the value configured by the FullTextIndex.Options.OVERWRITE property.
 boolean isReadOnly()
          true unless {AbstractResource.getTimestamp() is ITx.UNISOLATED.
 Object newElement(IPredicate predicate, IBindingSet bindingSet)
          Create and return a new element.
 Hiterator search(String query, String languageCode)
          Performs a full text search against indexed documents returning a hit list using a default minCosine of 0.4, a default maxRank of 10,000, and the configured default timeout.
 Hiterator search(String query, String languageCode, boolean prefixMatch)
           
 Hiterator search(String query, String languageCode, boolean prefixMatch, double minCosine, int maxRank, long timeout, TimeUnit unit)
          Performs a full text search against indexed documents returning a hit list.
 Hiterator search(String query, String languageCode, double minCosine, int maxRank)
          Performs a full text search against indexed documents returning a hit list using the configured default timeout.
 
Methods inherited from class com.bigdata.relation.AbstractRelation
getFQN, getIndex, getIndex, newIndexMetadata
 
Methods inherited from class com.bigdata.relation.AbstractResource
acquireExclusiveLock, getChunkCapacity, getChunkOfChunksCapacity, getChunkTimeout, getContainer, getContainerNamespace, getExecutorService, getFullyBufferedReadThreshold, getIndexManager, getMaxParallelSubqueries, getNamespace, getProperties, getProperty, getProperty, getTimestamp, isForceSerialExecution, isNestedSubquery, toString, unlock
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface com.bigdata.relation.IRelation
getExecutorService, getIndexManager
 
Methods inherited from interface com.bigdata.relation.locator.ILocatableResource
getContainerNamespace, getNamespace, getTimestamp
 

Field Detail

log

protected static final transient org.apache.log4j.Logger log

NAME_SEARCH

public static final transient String NAME_SEARCH
The basename of the search index.

See Also:
Constant Field Values
Constructor Detail

FullTextIndex

public FullTextIndex(IIndexManager indexManager,
                     String namespace,
                     Long timestamp,
                     Properties properties)
Ctor specified by DefaultResourceLocator.

Parameters:
client - The client. Configuration information is obtained from the client. See FullTextIndex.Options.
See Also:
FullTextIndex.Options
TODO:
Customize a ISplitHandler such that we never split a {term,doc} tuple?, Make sure that the defaults for the split points (in terms of the #of entries) are reasonable. The #of entries per split could be smaller if we know that we are storing more data in the values.
Method Detail

getIndex

public IIndex getIndex()
The index used to associate term identifiers with tokens parsed from documents.


isOverwrite

public boolean isOverwrite()
Return the value configured by the FullTextIndex.Options.OVERWRITE property.


isReadOnly

public final boolean isReadOnly()
true unless {AbstractResource.getTimestamp() is ITx.UNISOLATED.


assertWritable

protected void assertWritable()

create

public void create()
Conditionally registers the necessary index(s).

Specified by:
create in interface IMutableResource
Overrides:
create in class AbstractResource
Throws:
IllegalStateException - if the client does not have write access.
TODO:
this is not using AbstractResource.acquireExclusiveLock() since I generally allocate the text index inside of another relation and AbstractResource.acquireExclusiveLock() is not reentrant for zookeeper.

destroy

public void destroy()
Description copied from interface: IMutableResource
Destroy any logically contained resources (relations, indices).

Specified by:
destroy in interface IMutableResource
Overrides:
destroy in class AbstractResource

getAnalyzer

protected org.apache.lucene.analysis.Analyzer getAnalyzer(String languageCode)
Return the token analyzer to be used for the given language code.

Parameters:
languageCode - The language code or null to use the default Locale.
Returns:
The token analyzer best suited to the indicated language family.

getKeyBuilder

protected final IKeyBuilder getKeyBuilder()
Return a ThreadLocal IKeyBuilder instance configured to support full text indexing and search.

See Also:
FullTextIndex.Options.INDEXER_COLLATOR_STRENGTH

index

public void index(TokenBuffer buffer,
                  long docId,
                  int fieldId,
                  String languageCode,
                  Reader r)
Index a field in a document.

Note: This method does NOT force a write on the indices. If the buffer overflows, then there will be an index write. Once the caller is done indexing, they MUST invoke TokenBuffer.flush() to force any data remaining in their buffer to the indices.

Note: If a document is pre-existing, then the existing data for that document MUST be removed unless you know that the fields to be found in the will not have changed (they may have different contents, but the same fields exist in the old and new versions of the document).

Parameters:
buffer - Used to buffer writes onto the text index.
docId - The document identifier.
fieldId - The field identifier.
languageCode - The language code -or- null to use the default Locale.
r - A reader on the text to be indexed.
See Also:
TokenBuffer.flush()

getTokenStream

protected org.apache.lucene.analysis.TokenStream getTokenStream(String languageCode,
                                                                Reader r)
Tokenize text using an Analyzer that is appropriate to the specified language family.

Parameters:
languageCode - The language code -or- null to use the default Locale).
r - A reader on the text to be indexed.
Returns:
The extracted token stream.

getTokenKey

protected static byte[] getTokenKey(IKeyBuilder keyBuilder,
                                    String termText,
                                    boolean successor,
                                    long docId,
                                    int fieldId)
Create a key for a term.

Parameters:
keyBuilder - Used to construct the key.
token - The token whose key will be formed.
successor - When true the successor of the token's text will be encoded into the key. This is useful when forming the toKey in a search.
docId - The document identifier - use Long.MIN_VALUE when forming a search key.
fieldId - The field identifier - use Integer.MIN_VALUE when forming a search key.
Returns:
The key.

getTokenValue

protected byte[] getTokenValue(ByteArrayBuffer buf,
                               TermMetadata metadata)
Return the byte[] that is the encoded value for per-{token,docId,fieldId} entry in the index.

Parameters:
buf - Used to encode the value.
metadata - Metadata about the term.
Returns:
The encoded value.
TODO:
optionally record the token position metadata (sequence of token positions in the source) and the token offset (character offsets for the inclusive start and exclusive end of each token)., value compression:, value compression: code "position" as delta from last position in the same field or from 0 if the first token of a new document; code "offsets" as deltas - the maximum offset between tokens will be quite small (it depends on the #of stopwords) so use a nibble format for this.

search

public Hiterator search(String query,
                        String languageCode)
Performs a full text search against indexed documents returning a hit list using a default minCosine of 0.4, a default maxRank of 10,000, and the configured default timeout.

Parameters:
query - The query (it will be parsed into tokens).
languageCode - The language code that should be used when tokenizing the query (an empty string will be interpreted as the default Locale).
Returns:
A Iterator which may be used to traverse the search results in order of decreasing relevance to the query.
See Also:
FullTextIndex.Options.INDEXER_TIMEOUT

search

public Hiterator search(String query,
                        String languageCode,
                        boolean prefixMatch)

search

public Hiterator search(String query,
                        String languageCode,
                        double minCosine,
                        int maxRank)
Performs a full text search against indexed documents returning a hit list using the configured default timeout.

Parameters:
query - The query (it will be parsed into tokens).
languageCode - The language code that should be used when tokenizing the query -or- null to use the default Locale).
minCosine - The minimum cosine that will be returned.
maxRank - The upper bound on the #of hits in the result set.
Returns:
The hit list.
See Also:
FullTextIndex.Options.INDEXER_TIMEOUT

search

public Hiterator search(String query,
                        String languageCode,
                        boolean prefixMatch,
                        double minCosine,
                        int maxRank,
                        long timeout,
                        TimeUnit unit)
Performs a full text search against indexed documents returning a hit list.

The basic algorithm computes cosine between the term-frequency vector of the query and the indexed "documents". The cosine may be directly interpreted as the "relevance" of a "document" to the query. The query and document term-frequency vectors are normalized, so the cosine values are bounded in [0.0:1.0]. The higher the cosine the more relevant the document is to the query. A cosine of less than .4 is rarely of any interest.

The implementation creates and runs a set parallel tasks, one for each distinct token found in the query, and waits for those tasks to complete or for a timeout to occur. Each task uses a key-range scan on the terms index, collecting metadata for the matching "documents" and aggregating it on a "hit" for that document. Since the tasks run concurrently, there are concurrent writers on the "hits". On a timeout, the remaining tasks are interrupted.

The collection of hits is scored and hits that fail a threshold are discarded. The remaining hits are placed into a total order and the caller is returned an iterator which can read from that order. If the operation is interrupted, then only those IHits that have already been computed will be returned.

Parameters:
query - The query (it will be parsed into tokens).
languageCode - The language code that should be used when tokenizing the query -or- null to use the default Locale).
minCosine - The minimum cosine that will be returned.
maxRank - The upper bound on the #of hits in the result set.
prefixMatch - Option is not implemented yet
timeout - The timeout -or- ZERO (0) for NO timeout (this is equivalent to using Long.MAX_VALUE).
unit - The unit in which the timeout is expressed.
Returns:
The hit list.
TODO:
note that we can not incrementally materialize the search results since they are being delivered in rank order and the search algorithm achieves rank order by a post-search sort. mg4j supports incremental evaluation and would be a full-featured replacement for this package., manage the life of the result sets and perhaps serialize them onto an index backed by a TemporaryStore. The fromIndex/toIndex might be with respect to that short-term result set. Reclaim result sets after N seconds., consider other kinds of queries that we might write here. For example, full text search should support AND OR NOT operators for tokens., allow search within field(s). This will be a filter on the range iterator that is sent to the data service such that the search terms are visited only when they occur in the matching field(s).

delete

public long delete(IChunkedOrderedIterator itr)
Description copied from interface: IMutableRelation
Remove elements from the relation.

Parameters:
itr - An iterator visiting the elements to be removed. Existing elements in the relation having a key equal to the key formed from the visited elements will be removed from the relation.
Returns:
The #of elements that were actually removed from the relation.

insert

public long insert(IChunkedOrderedIterator itr)
Description copied from interface: IMutableRelation
Write elements on the relation.

Parameters:
itr - An iterator visiting the elements to be written.
Returns:
The #of elements that were actually written on the relation.

getAccessPath

public IAccessPath getAccessPath(IPredicate predicate)
Description copied from interface: IRelation
Return the best IAccessPath for a relation given a predicate with zero or more unbound variables.

If there is an IIndex that directly corresponeds to the natural order implied by the variable pattern on the predicate then the access path should use that index. Otherwise you should choose the best index given the constraints and make sure that the IAccessPath incorporates additional filters that will allow you to filter out the irrelevant ITuples during the scan - this is very important when the index is remote!

If there are any IElementFilters then the access path MUST incorporate those constraints such that only elements that satisify the constraints may be visited.

Whether the constraints arise because of the lack of a perfect index for the access path or because they were explicitly specified for the IPredicate, those constraints should be translated into constraints imposed on the underlying ITupleIterator and sent with it to be evaluated local to the data.

Note: Filters should be specified when the IAccessPath is constructed so that they will be evalated on the data service rather than materializing the elements and then filtering then. This can be accomplished by adding the filter as a constraint on the predicate when specifying the access path.

Parameters:
predicate - The constraint on the elements to be visited.
Returns:
The best IAccessPath for that IPredicate.

getIndexNames

public Set<String> getIndexNames()
Description copied from interface: IRelation
Return the fully qualified name of each index maintained by this relation.

Returns:
An immutable set of the index names for the relation.

newElement

public Object newElement(IPredicate predicate,
                         IBindingSet bindingSet)
Description copied from interface: IRelation
Create and return a new element. The element is constructed from the predicate given the bindings. Typically, this is used when generating an ISolution for an IRule during either a query or mutation operations. The element is NOT inserted into the relation.

Parameters:
predicate - The predicate that is the head of some IRule.
bindingSet - A set of bindings for that IRule.
Returns:
The new element.

getElementClass

public Class<?> getElementClass()
Description copied from interface: IRelation
Return the class for the generic type of this relation. This information is used to dynamically create arrays of that generic type.



Copyright © 2006-2009 SYSTAP, LLC. All Rights Reserved.