|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectcom.bigdata.relation.AbstractResource<IRelation<E>>
com.bigdata.relation.AbstractRelation
com.bigdata.search.FullTextIndex
public class FullTextIndex
Full text indexing and search support.
The basic data model consists of documents, fields in documents, and tokens extracted by an analyzer from those fields.
The frequency distributions may be normalized to account for a variety of effects producing "term weights". For example, normalizing for document length or relative frequency of a term in the overall collection. Therefore the logical model is:
token : {docId, freq?, weight?}+
(For RDF, docId is the term identifier as assigned by the term:id index.)
The freq and weight are optional values that are representative of the kinds of statistical data that are kept on a per-token-document basis. The freq is the token frequency (the frequency of occurrence of the token in the document). The weight is generally a normalized token frequency weight for the token in that document in the context of the overall collection.
In fact, we actually represent the data as follows:
{sortKey(token), docId, fldId} : {freq?, weight?, sorted(pos)+}
That is, there is a distinct entry in the full text B+Tree for each field in
each document in which a given token was recognized. The text of the token is
not stored in the key, just the Unicode sort key generated from the token
text. The value associated with the B+Tree entry is optional - it is simply
not used unless we are storing statistics for the token-document pair. The
advantages of this approach are: (a) it reuses the existing B+Tree data
structures efficiently; (b) we are never faced with the possibility overflow
when a token is used in a large number of documents. The entries for the
token will simply be spread across several leaves in the B+Tree; (c) leading
key compression makes the resulting B+Tree very efficient; and (d) in a
scale-out range partitioned index we can load balance the resulting index
partitions by choosing the partition based on an even token boundary.
A field is any pre-identified text container within a document. Field
identifiers are integers, so there are 32^2 distinct possible
field identifiers. It is possible to manage the field identifiers through a
secondary index, but that has no direct bearing on the structure of the full
text index itself. Field identifies appear after the token in the key so that
queries may be expressed that will be matched against any field in the
document. Likewise, field identifiers occur before the document identifier in
the key since we always search across documents (in a search key, the
document identifier is always Long.MIN_VALUE and the field identifier
is always Integer.MIN_VALUE). There are many applications for fields:
for example, distinct fields may be used for the title, abstract, and full
text of a document or for the CDATA section of each distinct element in
documents corresponding to some DTD. The application is responsible for
recognizing the fields in the document and producing the appropriate token
stream, each of which must be tagged by the field.
A query is tokenized, producing a (possibly normalized) token-frequency vector. The relevance of documents to the query is generally taken as the cosine between the query's and each document's (possibly normalized) token-frequency vectors. The main effort of search is assembling a token frequency vector for just those documents with which there is an overlap with the query. This is done using a key range scan for each token in the query against the full text index.
fromKey := token, Long.MIN_VALUE
toKey := successor(token), Long.MIN_VALUE
and extracting the appropriate token frequency, normalized token weight, or
other statistic. When no value is associated with the entry we follow the
convention of assuming a token frequency of ONE (1) for each document in
which the token appears.
Tokenization is informed by the language code (when declared) and by the
configured Locale for the database otherwise. An appropriate
Analyzer is chosen based on the language code or Locale and
the "document" is broken into a token-frequency distribution (alternatively a
set of tokens). The same process is used to tokenize queries, and the API
allows the caller to specify the language code used to select the
Analyzer to tokenize the query.
Once the tokens are formed the language code / Locale used to produce
the token is discarded (it is not represented in the index). The reason for
this is that we never utilize the total ordering of the full text index,
merely the manner in which it groups tokens that map onto the same Unicode
sort key together. Further, we use only a single Unicode collator
configuration regardless of the language family in which the token was
originally expressed. Unlike the collator used by the terms index (which
often is set at IDENTICAL strength), the collector used by the full text
index should be chosen such that it makes relatively few distinctions in
order to increase recall (e.g., set at PRIMARY strength). Since a total order
over the full text index is not critical from the perspective of its IR
application, the Locale for the collator is likewise not critical and
PRIMARY strength will produce significantly shorter Unicode sort keys.
The term frequency within that literal is an optional property associated with each term identifier, as is the computed weight for the token in the term.
Note: Documents should be tokenized using an Analyzer appropriate for
their declared language code (if any). However, once tokenized, the language
code is discarded and we perform search purely on the Unicode sort keys
resulting from the extracted tokens.
In order to removal of the fields for a document we need to know either which fields were indexed for the document and the tokens found in those fields and then scatter the removal request (additional space requirements) or we need to flood a delete procedure across the terms index (expensive)., provide M/R alternatives for indexing or computing/updating global weights., Consider model in which fields are declared and then a "Document" is indexed. This lets us encapsulate the "driver" for indexing. The "field" can be a String or a Reader, etc.
Note that lucene handles declaration of the data that will be stored
for a field on a per Document basis {none, character offsets, character
offsets + token positions}. There is also an option to store the term
vector itself. Finally, there are options to store, compress+store, or
not store the field value. You can also choose {None, IndexTokenized,
IndexUntokenized} and an option dealing with norms., lucene Analyzers may be problematic. For example, it is
difficult to tokenize numbers. consider replacing the lucene
analyzer/tokenizer with our own stuff. this might help with
tokenization of numbers, etc. and with tokenization of native html or
xml with intact offsets., lucene analyzers will strip stopwords by default. There should be a
configuration option to strip out stopwords and another to enable
stemming. how we do that should depend on the language family.
Likewise, there should be support for language family specific stopword
lists and language family specific exclusions., key compression: prefix compression, possibly with hamming codes for
the docId and fieldId., support more term weighting schemes and make them easy to configure., is there an appropriate generic type for the full text index?
Basically, it contains "document fields", or at least their indexed
terms.
| Nested Class Summary | |
|---|---|
static interface |
FullTextIndex.Options
Options understood by the FullTextIndex. |
| Field Summary | |
|---|---|
protected static org.apache.log4j.Logger |
log
|
static String |
NAME_SEARCH
The basename of the search index. |
| Constructor Summary | |
|---|---|
FullTextIndex(IIndexManager indexManager,
String namespace,
Long timestamp,
Properties properties)
Ctor specified by DefaultResourceLocator. |
|
| Method Summary | |
|---|---|
protected void |
assertWritable()
|
void |
create()
Conditionally registers the necessary index(s). |
long |
delete(IChunkedOrderedIterator itr)
Remove elements from the relation. |
void |
destroy()
Destroy any logically contained resources (relations, indices). |
IAccessPath |
getAccessPath(IPredicate predicate)
Return the best IAccessPath for a relation given a predicate with
zero or more unbound variables. |
protected org.apache.lucene.analysis.Analyzer |
getAnalyzer(String languageCode)
Return the token analyzer to be used for the given language code. |
Class<?> |
getElementClass()
Return the class for the generic type of this relation. |
IIndex |
getIndex()
The index used to associate term identifiers with tokens parsed from documents. |
Set<String> |
getIndexNames()
Return the fully qualified name of each index maintained by this relation. |
protected IKeyBuilder |
getKeyBuilder()
Return a ThreadLocal IKeyBuilder instance configured to
support full text indexing and search. |
protected static byte[] |
getTokenKey(IKeyBuilder keyBuilder,
String termText,
boolean successor,
long docId,
int fieldId)
Create a key for a term. |
protected org.apache.lucene.analysis.TokenStream |
getTokenStream(String languageCode,
Reader r)
Tokenize text using an Analyzer that is appropriate to the
specified language family. |
protected byte[] |
getTokenValue(ByteArrayBuffer buf,
TermMetadata metadata)
Return the byte[] that is the encoded value for per-{token,docId,fieldId} entry in the index. |
void |
index(TokenBuffer buffer,
long docId,
int fieldId,
String languageCode,
Reader r)
Index a field in a document. |
long |
insert(IChunkedOrderedIterator itr)
Write elements on the relation. |
boolean |
isOverwrite()
Return the value configured by the FullTextIndex.Options.OVERWRITE property. |
boolean |
isReadOnly()
true unless {AbstractResource.getTimestamp() is ITx.UNISOLATED. |
Object |
newElement(IPredicate predicate,
IBindingSet bindingSet)
Create and return a new element. |
Hiterator |
search(String query,
String languageCode)
Performs a full text search against indexed documents returning a hit list using a default minCosine of 0.4, a default maxRank
of 10,000, and the configured default timeout. |
Hiterator |
search(String query,
String languageCode,
boolean prefixMatch)
|
Hiterator |
search(String query,
String languageCode,
boolean prefixMatch,
double minCosine,
int maxRank,
long timeout,
TimeUnit unit)
Performs a full text search against indexed documents returning a hit list. |
Hiterator |
search(String query,
String languageCode,
double minCosine,
int maxRank)
Performs a full text search against indexed documents returning a hit list using the configured default timeout. |
| Methods inherited from class com.bigdata.relation.AbstractRelation |
|---|
getFQN, getIndex, getIndex, newIndexMetadata |
| Methods inherited from class com.bigdata.relation.AbstractResource |
|---|
acquireExclusiveLock, getChunkCapacity, getChunkOfChunksCapacity, getChunkTimeout, getContainer, getContainerNamespace, getExecutorService, getFullyBufferedReadThreshold, getIndexManager, getMaxParallelSubqueries, getNamespace, getProperties, getProperty, getProperty, getTimestamp, isForceSerialExecution, isNestedSubquery, toString, unlock |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
| Methods inherited from interface com.bigdata.relation.IRelation |
|---|
getExecutorService, getIndexManager |
| Methods inherited from interface com.bigdata.relation.locator.ILocatableResource |
|---|
getContainerNamespace, getNamespace, getTimestamp |
| Field Detail |
|---|
protected static final transient org.apache.log4j.Logger log
public static final transient String NAME_SEARCH
| Constructor Detail |
|---|
public FullTextIndex(IIndexManager indexManager,
String namespace,
Long timestamp,
Properties properties)
DefaultResourceLocator.
client - The client. Configuration information is obtained from the
client. See FullTextIndex.Options.FullTextIndex.OptionsISplitHandler such that we never split a
{term,doc} tuple?, Make sure that the defaults for the split points (in terms of the
#of entries) are reasonable. The #of entries per split could be
smaller if we know that we are storing more data in the values.| Method Detail |
|---|
public IIndex getIndex()
public boolean isOverwrite()
FullTextIndex.Options.OVERWRITE property.
public final boolean isReadOnly()
true unless {AbstractResource.getTimestamp() is ITx.UNISOLATED.
protected void assertWritable()
public void create()
create in interface IMutableResourcecreate in class AbstractResourceIllegalStateException - if the client does not have write access.AbstractResource.acquireExclusiveLock() since I generally
allocate the text index inside of another relation and
AbstractResource.acquireExclusiveLock() is not reentrant for zookeeper.public void destroy()
IMutableResource
destroy in interface IMutableResourcedestroy in class AbstractResourceprotected org.apache.lucene.analysis.Analyzer getAnalyzer(String languageCode)
languageCode - The language code or null to use the default
Locale.
protected final IKeyBuilder getKeyBuilder()
ThreadLocal IKeyBuilder instance configured to
support full text indexing and search.
FullTextIndex.Options.INDEXER_COLLATOR_STRENGTH
public void index(TokenBuffer buffer,
long docId,
int fieldId,
String languageCode,
Reader r)
Note: This method does NOT force a write on the indices. If the buffer
overflows, then there will be an index write. Once the caller is done
indexing, they MUST invoke TokenBuffer.flush() to force any data
remaining in their buffer to the indices.
Note: If a document is pre-existing, then the existing data for that document MUST be removed unless you know that the fields to be found in the will not have changed (they may have different contents, but the same fields exist in the old and new versions of the document).
buffer - Used to buffer writes onto the text index.docId - The document identifier.fieldId - The field identifier.languageCode - The language code -or- null to use the default
Locale.r - A reader on the text to be indexed.TokenBuffer.flush()
protected org.apache.lucene.analysis.TokenStream getTokenStream(String languageCode,
Reader r)
Analyzer that is appropriate to the
specified language family.
languageCode - The language code -or- null to use the default
Locale).r - A reader on the text to be indexed.
protected static byte[] getTokenKey(IKeyBuilder keyBuilder,
String termText,
boolean successor,
long docId,
int fieldId)
keyBuilder - Used to construct the key.token - The token whose key will be formed.successor - When true the successor of the token's text
will be encoded into the key. This is useful when forming the
toKey in a search.docId - The document identifier - use Long.MIN_VALUE when forming a
search key.fieldId - The field identifier - use Integer.MIN_VALUE when forming a
search key.
protected byte[] getTokenValue(ByteArrayBuffer buf,
TermMetadata metadata)
buf - Used to encode the value.metadata - Metadata about the term.
public Hiterator search(String query,
String languageCode)
0.4, a default maxRank
of 10,000, and the configured default timeout.
query - The query (it will be parsed into tokens).languageCode - The language code that should be used when tokenizing the
query (an empty string will be interpreted as the default
Locale).
Iterator which may be used to traverse the search
results in order of decreasing relevance to the query.FullTextIndex.Options.INDEXER_TIMEOUT
public Hiterator search(String query,
String languageCode,
boolean prefixMatch)
public Hiterator search(String query,
String languageCode,
double minCosine,
int maxRank)
query - The query (it will be parsed into tokens).languageCode - The language code that should be used when tokenizing the
query -or- null to use the default
Locale).minCosine - The minimum cosine that will be returned.maxRank - The upper bound on the #of hits in the result set.
FullTextIndex.Options.INDEXER_TIMEOUT
public Hiterator search(String query,
String languageCode,
boolean prefixMatch,
double minCosine,
int maxRank,
long timeout,
TimeUnit unit)
The basic algorithm computes cosine between the term-frequency vector of the query and the indexed "documents". The cosine may be directly interpreted as the "relevance" of a "document" to the query. The query and document term-frequency vectors are normalized, so the cosine values are bounded in [0.0:1.0]. The higher the cosine the more relevant the document is to the query. A cosine of less than .4 is rarely of any interest.
The implementation creates and runs a set parallel tasks, one for each distinct token found in the query, and waits for those tasks to complete or for a timeout to occur. Each task uses a key-range scan on the terms index, collecting metadata for the matching "documents" and aggregating it on a "hit" for that document. Since the tasks run concurrently, there are concurrent writers on the "hits". On a timeout, the remaining tasks are interrupted.
The collection of hits is scored and hits that fail a threshold are
discarded. The remaining hits are placed into a total order and the
caller is returned an iterator which can read from that order. If the
operation is interrupted, then only those IHits that have
already been computed will be returned.
query - The query (it will be parsed into tokens).languageCode - The language code that should be used when tokenizing the
query -or- null to use the default
Locale).minCosine - The minimum cosine that will be returned.maxRank - The upper bound on the #of hits in the result set.prefixMatch - Option is not implemented yettimeout - The timeout -or- ZERO (0) for NO timeout (this is equivalent
to using Long.MAX_VALUE).unit - The unit in which the timeout is expressed.
TemporaryStore. The fromIndex/toIndex
might be with respect to that short-term result set. Reclaim result
sets after N seconds., consider other kinds of queries that we might write here. For
example, full text search should support AND OR NOT operators for
tokens., allow search within field(s). This will be a filter on the range
iterator that is sent to the data service such that the search
terms are visited only when they occur in the matching field(s).public long delete(IChunkedOrderedIterator itr)
IMutableRelation
itr - An iterator visiting the elements to be removed. Existing
elements in the relation having a key equal to the key formed
from the visited elements will be removed from the relation.
public long insert(IChunkedOrderedIterator itr)
IMutableRelation
itr - An iterator visiting the elements to be written.
public IAccessPath getAccessPath(IPredicate predicate)
IRelationIAccessPath for a relation given a predicate with
zero or more unbound variables.
If there is an IIndex that directly corresponeds to the natural
order implied by the variable pattern on the predicate then the access
path should use that index. Otherwise you should choose the best index
given the constraints and make sure that the IAccessPath
incorporates additional filters that will allow you to filter out the
irrelevant ITuples during the scan - this is very important when
the index is remote!
If there are any IElementFilters then the access path MUST
incorporate those constraints such that only elements that satisify the
constraints may be visited.
Whether the constraints arise because of the lack of a perfect index for
the access path or because they were explicitly specified for the
IPredicate, those constraints should be translated into
constraints imposed on the underlying ITupleIterator and sent
with it to be evaluated local to the data.
Note: Filters should be specified when the IAccessPath is
constructed so that they will be evalated on the data service rather than
materializing the elements and then filtering then. This can be
accomplished by adding the filter as a constraint on the predicate when
specifying the access path.
predicate - The constraint on the elements to be visited.
IAccessPath for that IPredicate.public Set<String> getIndexNames()
IRelation
public Object newElement(IPredicate predicate,
IBindingSet bindingSet)
IRelationISolution for an IRule during either a query or mutation
operations. The element is NOT inserted into the relation.
predicate - The predicate that is the head of some IRule.bindingSet - A set of bindings for that IRule.
public Class<?> getElementClass()
IRelation
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||