com.bigdata.rdf.lexicon
Class LexiconRelation

java.lang.Object
  extended by com.bigdata.relation.AbstractResource<IRelation<E>>
      extended by com.bigdata.relation.AbstractRelation<BigdataValue>
          extended by com.bigdata.rdf.lexicon.LexiconRelation
All Implemented Interfaces:
IMutableRelation<BigdataValue>, IMutableResource<IRelation<BigdataValue>>, IRelation<BigdataValue>, ILocatableResource<IRelation<BigdataValue>>

public class LexiconRelation
extends AbstractRelation<BigdataValue>

The LexiconRelation handles all things related to the indices mapping RDF Values onto internal 64-bit term identifiers.

The term2id index has all the distinct terms ever asserted. Those "terms" include {s:p:o} keys for statements IFF statement identifiers are in use. However, BNodes are NOT stored in the forward index, even though the forward index is used to assign globally unique term identifiers for blank nodes. See BigdataValueFactoryImpl.createBNode().

The id2term index only has URIs and Literals. It CAN NOT used to resolve either BNodes or statement identifiers. In fact, there is NO means to resolve either a statement identifier or a blank node. Both are always assigned (consistently) within a context in which their referent (if any) is defined. For a statement identifier the referent MUST be defined by an instance of the statement itself. The RIO parser integration and the IStatementBuffer implementations handle all of this stuff.

See KeyBuilder.Options for properties that control how the sort keys are generated for the URIs and Literals.

Version:
$Id: LexiconRelation.java 2265 2009-10-26 12:51:06Z thompsonbry $
Author:
Bryan Thompson

Nested Class Summary
 
Nested classes/interfaces inherited from class com.bigdata.relation.AbstractResource
AbstractResource.Options
 
Field Summary
protected static org.apache.log4j.Logger log
           
static String NAME_LEXICON_RELATION
          Constant for the LexiconRelation namespace component.
 
Constructor Summary
LexiconRelation(IIndexManager indexManager, String namespace, Long timestamp, Properties properties)
          Note: The term:id and id:term indices MUST use unisolated write operation to ensure consistency without write-write conflicts.
 
Method Summary
 void addStatementIdentifiers(ISPO[] a, int n)
          Assign unique statement identifiers to triples.
 void addTerms(BigdataValue[] terms, int numTerms, boolean readOnly)
          Batch insert of terms into the database.
 void create()
          Create any logically contained resources (relations, indices).
 long delete(IChunkedOrderedIterator<BigdataValue> itr)
          Note : this method is part of the mutation api.
 void destroy()
          Destroy any logically contained resources (relations, indices).
 StringBuilder dumpTerms()
          Dumps the lexicon in a variety of ways.
 boolean exists()
           
 KVO<BigdataValue>[] generateSortKeys(LexiconKeyBuilder keyBuilder, BigdataValue[] terms, int numTerms)
          Generate the sort keys for the terms.
 IAccessPath<BigdataValue> getAccessPath(IPredicate<BigdataValue> predicate)
          Return the best IAccessPath for a relation given a predicate with zero or more unbound variables.
 AbstractTripleStore getContainer()
          Strengthens the return type.
 Class<BigdataValue> getElementClass()
          Return the class for the generic type of this relation.
 IIndex getId2TermIndex()
           
protected  IndexMetadata getId2TermIndexMetadata(String name)
           
 IIndex getIndex(IKeyOrder<? extends BigdataValue> keyOrder)
          Overridden to return the hard reference for the index.
 Set<String> getIndexNames()
          Return the fully qualified name of each index maintained by this relation.
 FullTextIndex getSearchEngine()
          A factory returning the softly held singleton for the FullTextIndex.
 BigdataValue getTerm(long id)
          Note: BNodes are not stored in the reverse lexicon and are recognized using AbstractTripleStore.isBNode(long).
 IIndex getTerm2IdIndex()
           
protected  IndexMetadata getTerm2IdIndexMetadata(String name)
           
 long getTermId(Value value)
          Note: If BigdataValue.getTermId() is set, then returns that value immediately.
 int getTermIdBitsToReverse()
          The #of low bits from the term identifier that are reversed and rotated into the high bits when it is assigned.
 Map<Long,BigdataValue> getTerms(Collection<Long> ids)
          Batch resolution of term identifiers to BigdataValues.
 BigdataValueFactoryImpl getValueFactory()
          The canonical BigdataValueFactoryImpl reference (JVM wide) for the lexicon namespace.
 Iterator<Value> idTermIndexScan()
          Iterator visits all terms in order by their assigned term identifiers (efficient index scan, but the terms are not in term order).
protected  void indexTermText(int capacity, Iterator<BigdataValue> itr)
           Add the terms to the full text index so that we can do fast lookup of the corresponding term identifiers.
 long insert(IChunkedOrderedIterator<BigdataValue> itr)
          Note : this method is part of the mutation api.
 boolean isStoreBlankNodes()
          true iff blank nodes are being stored in the lexicon's forward index.
 boolean isTextIndex()
          true iff the full text index is enabled.
 BigdataValue newElement(IPredicate<BigdataValue> predicate, IBindingSet bindingSet)
          Note : this method is part of the mutation api.
 Iterator<Long> prefixScan(Literal lit)
          A scan of all literals having the given literal as a prefix.
 Iterator<Long> prefixScan(Literal[] lits)
          A scan of all literals having any of the given literals as a prefix.
 Iterator<Long> termIdIndexScan()
          Iterator visits all term identifiers in order by the term key (efficient index scan).
 Iterator<Value> termIterator()
          Visits all terms in term key order (random index operation).
 
Methods inherited from class com.bigdata.relation.AbstractRelation
getFQN, getIndex, newIndexMetadata
 
Methods inherited from class com.bigdata.relation.AbstractResource
acquireExclusiveLock, getChunkCapacity, getChunkOfChunksCapacity, getChunkTimeout, getContainerNamespace, getExecutorService, getFullyBufferedReadThreshold, getIndexManager, getMaxParallelSubqueries, getNamespace, getProperties, getProperty, getProperty, getTimestamp, isForceSerialExecution, isNestedSubquery, toString, unlock
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface com.bigdata.relation.IRelation
getExecutorService, getIndexManager
 
Methods inherited from interface com.bigdata.relation.locator.ILocatableResource
getContainerNamespace, getNamespace, getTimestamp
 

Field Detail

log

protected static final org.apache.log4j.Logger log

NAME_LEXICON_RELATION

public static final transient String NAME_LEXICON_RELATION
Constant for the LexiconRelation namespace component.

Note: To obtain the fully qualified name of an index in the LexiconRelation you need to append a "." to the relation's namespace, then this constant, then a "." and then the local name of the index.

See Also:
AbstractRelation.getFQN(IKeyOrder), Constant Field Values
Constructor Detail

LexiconRelation

public LexiconRelation(IIndexManager indexManager,
                       String namespace,
                       Long timestamp,
                       Properties properties)
Note: The term:id and id:term indices MUST use unisolated write operation to ensure consistency without write-write conflicts. The only exception would be a read-historical view.

Parameters:
indexManager -
namespace -
timestamp -
properties -
Method Detail

getValueFactory

public BigdataValueFactoryImpl getValueFactory()
The canonical BigdataValueFactoryImpl reference (JVM wide) for the lexicon namespace.


getContainer

public AbstractTripleStore getContainer()
Strengthens the return type.

Overrides:
getContainer in class AbstractResource<IRelation<BigdataValue>>
Returns:
The container -or- null if there is no container.

exists

public boolean exists()

create

public void create()
Description copied from interface: IMutableResource
Create any logically contained resources (relations, indices).

Specified by:
create in interface IMutableResource<IRelation<BigdataValue>>
Overrides:
create in class AbstractResource<IRelation<BigdataValue>>

destroy

public void destroy()
Description copied from interface: IMutableResource
Destroy any logically contained resources (relations, indices).

Specified by:
destroy in interface IMutableResource<IRelation<BigdataValue>>
Overrides:
destroy in class AbstractResource<IRelation<BigdataValue>>

getTermIdBitsToReverse

public final int getTermIdBitsToReverse()
The #of low bits from the term identifier that are reversed and rotated into the high bits when it is assigned.

See Also:
AbstractTripleStore.Options#TERMID_BITS_TO_REVERSE

isStoreBlankNodes

public final boolean isStoreBlankNodes()
true iff blank nodes are being stored in the lexicon's forward index.

See Also:
AbstractTripleStore.Options#STORE_BLANK_NODES

isTextIndex

public final boolean isTextIndex()
true iff the full text index is enabled.

See Also:
AbstractTripleStore.Options#TEXT_INDEX

getIndex

public IIndex getIndex(IKeyOrder<? extends BigdataValue> keyOrder)
Overridden to return the hard reference for the index.

Overrides:
getIndex in class AbstractRelation<BigdataValue>
Parameters:
keyOrder - The natural index order.
Returns:
The index -or- null iff the index does not exist as of the timestamp for this view of the relation.
See Also:
FIXME For efficiency the concrete implementations need to override this saving a hard reference to the index and then using a switch like construct to return the correct hard reference. This behavior should be encapsulated.

getTerm2IdIndex

public final IIndex getTerm2IdIndex()

getId2TermIndex

public final IIndex getId2TermIndex()

getSearchEngine

public FullTextIndex getSearchEngine()
A factory returning the softly held singleton for the FullTextIndex.

See Also:
Options#TEXT_INDEX
TODO:
replace with the use of the IResourceLocator since it already imposes a canonicalizing mapping within for the index name and timestamp inside of a JVM.

getTerm2IdIndexMetadata

protected IndexMetadata getTerm2IdIndexMetadata(String name)

getId2TermIndexMetadata

protected IndexMetadata getId2TermIndexMetadata(String name)

getIndexNames

public Set<String> getIndexNames()
Description copied from interface: IRelation
Return the fully qualified name of each index maintained by this relation.

Returns:
An immutable set of the index names for the relation.

getAccessPath

public IAccessPath<BigdataValue> getAccessPath(IPredicate<BigdataValue> predicate)
Description copied from interface: IRelation
Return the best IAccessPath for a relation given a predicate with zero or more unbound variables.

If there is an IIndex that directly corresponeds to the natural order implied by the variable pattern on the predicate then the access path should use that index. Otherwise you should choose the best index given the constraints and make sure that the IAccessPath incorporates additional filters that will allow you to filter out the irrelevant ITuples during the scan - this is very important when the index is remote!

If there are any IElementFilters then the access path MUST incorporate those constraints such that only elements that satisify the constraints may be visited.

Whether the constraints arise because of the lack of a perfect index for the access path or because they were explicitly specified for the IPredicate, those constraints should be translated into constraints imposed on the underlying ITupleIterator and sent with it to be evaluated local to the data.

Note: Filters should be specified when the IAccessPath is constructed so that they will be evalated on the data service rather than materializing the elements and then filtering then. This can be accomplished by adding the filter as a constraint on the predicate when specifying the access path.

Parameters:
predicate - The constraint on the elements to be visited.
Returns:
The best IAccessPath for that IPredicate.
Throws:
UnsupportedOperationException
TODO:
Not implemented yet. This could be used for high-level query, but there are no rules written so far that join against the LexiconRelation.

newElement

public BigdataValue newElement(IPredicate<BigdataValue> predicate,
                               IBindingSet bindingSet)
Note : this method is part of the mutation api. it is primarily (at this point, only) invoked by the rule execution layer and, at present, no rules can entail terms into the lexicon.

Parameters:
predicate - The predicate that is the head of some IRule.
bindingSet - A set of bindings for that IRule.
Returns:
The new element.
Throws:
UnsupportedOperationException

getElementClass

public Class<BigdataValue> getElementClass()
Description copied from interface: IRelation
Return the class for the generic type of this relation. This information is used to dynamically create arrays of that generic type.


delete

public long delete(IChunkedOrderedIterator<BigdataValue> itr)
Note : this method is part of the mutation api. it is primarily (at this point, only) invoked by the rule execution layer and, at present, no rules can entail terms into the lexicon.

Parameters:
itr - An iterator visiting the elements to be removed. Existing elements in the relation having a key equal to the key formed from the visited elements will be removed from the relation.
Returns:
The #of elements that were actually removed from the relation.
Throws:
UnsupportedOperationException

insert

public long insert(IChunkedOrderedIterator<BigdataValue> itr)
Note : this method is part of the mutation api. it is primarily (at this point, only) invoked by the rule execution layer and, at present, no rules can entail terms into the lexicon.

Parameters:
itr - An iterator visiting the elements to be written.
Returns:
The #of elements that were actually written on the relation.
Throws:
UnsupportedOperationException

prefixScan

public Iterator<Long> prefixScan(Literal lit)
A scan of all literals having the given literal as a prefix.

Parameters:
lit - A literal.
Returns:
An iterator visiting the term identifiers for the matching Literals.

prefixScan

public Iterator<Long> prefixScan(Literal[] lits)
A scan of all literals having any of the given literals as a prefix.

Parameters:
lits - An array of literals.
Returns:
An iterator visiting the term identifiers for the matching Literals.
TODO:
The prefix scan can be refactored as an IElementFilter applied to the lexicon. This would let it be used directly from IRules. (There is no direct dependency on this class other than for access to the index, and the rules already provide that).

generateSortKeys

public final KVO<BigdataValue>[] generateSortKeys(LexiconKeyBuilder keyBuilder,
                                                  BigdataValue[] terms,
                                                  int numTerms)
Generate the sort keys for the terms.

Parameters:
keyBuilder - The object used to generate the sort keys.
terms - The terms whose sort keys will be generated.
numTerms - The #of terms in that array.
Returns:
An array of correlated key-value-object tuples.

Note that KVO.val is null until we know that we need to write it on the reverse index.

See Also:
LexiconKeyBuilder

addTerms

public void addTerms(BigdataValue[] terms,
                     int numTerms,
                     boolean readOnly)
Batch insert of terms into the database.

Note: Duplicate BigdataValue references and BigdataValues that already have an assigned term identifiers are ignored by this operation.

Note: This implementation is designed to use unisolated batch writes on the terms and ids index that guarantee consistency.

If the full text index is enabled, then the terms will also be inserted into the full text index.

Parameters:
terms - An array whose elements [0:nterms-1] will be inserted.
numTerms - The #of terms to insert.
readOnly - When true, unknown terms will not be inserted into the database. Otherwise unknown terms are inserted into the database.

addStatementIdentifiers

public void addStatementIdentifiers(ISPO[] a,
                                    int n)
Assign unique statement identifiers to triples.

Each distinct StatementEnum.Explicit {s,p,o} is assigned a unique statement identifier using the LexiconKeyOrder.TERM2ID index. The assignment of statement identifiers is consistent using an unisolated atomic write operation similar to addTerms(BigdataValue[], int, boolean)

Note: Statement identifiers are NOT inserted into the reverse (id:term) index. Instead, they are written into the values associated with the {s,p,o} in each of the statement indices. That is handled by AbstractTripleStore.addStatements(AbstractTripleStore, boolean, IChunkedOrderedIterator, IElementFilter) , which is also responsible for invoking this method in order to have the statement identifiers on hand before it writes on the statement indices.

Note: The caller's ISPO[] is sorted into SPO order as a side-effect.

Note: The statement identifiers are assigned to the ISPOs as a side-effect.

Note: SIDs are NOT supported for quads, so this code is never executed for quads.


indexTermText

protected void indexTermText(int capacity,
                             Iterator<BigdataValue> itr)

Add the terms to the full text index so that we can do fast lookup of the corresponding term identifiers. Literals that have a language code property are parsed using a tokenizer appropriate for the specified language family. Other literals and URIs are tokenized using the default Locale.

Parameters:
itr - Iterator visiting the terms to be indexed.
See Also:
#textSearch(String, String)
TODO:
allow registeration of datatype specific tokenizers (we already have language family based lookup).

getTerms

public final Map<Long,BigdataValue> getTerms(Collection<Long> ids)
Batch resolution of term identifiers to BigdataValues.

Parameters:
ids - An collection of term identifiers.
Returns:
A map from term identifier to the BigdataValue. If a term identifier was not resolved then the map will not contain an entry for that term identifier.
TODO:
performance tuning for statement pattern scans with resolution to terms, e.g., LUBM Q6.

getTerm

public final BigdataValue getTerm(long id)
Note: BNodes are not stored in the reverse lexicon and are recognized using AbstractTripleStore.isBNode(long).

Note: Statement identifiers (when enabled) are not stored in the reverse lexicon and are recognized using AbstractTripleStore.isStatement(long). If the term identifier is recognized as being, in fact, a statement identifier, then it is externalized as a BNode. This fits rather well with the notion in a quad store that the context position may be either a URI or a BNode and the fact that you can use BNodes to "stamp" statement identifiers.

Note: Handles both unisolatable and isolatable indices.

Note: Sets BigdataValue.getTermId() as a side-effect.

Note: this always mints a new BNode instance when the term identifier identifies a BNode or a Statement.

Returns:
The BigdataValue -or- null iff there is no BigdataValue for that term identifier in the lexicon.

getTermId

public final long getTermId(Value value)
Note: If BigdataValue.getTermId() is set, then returns that value immediately. Otherwise looks up the termId in the index and sets the term identifier as a side-effect.


idTermIndexScan

public Iterator<Value> idTermIndexScan()
Iterator visits all terms in order by their assigned term identifiers (efficient index scan, but the terms are not in term order).

See Also:
termIdIndexScan(), termIterator()

termIdIndexScan

public Iterator<Long> termIdIndexScan()
Iterator visits all term identifiers in order by the term key (efficient index scan).


termIterator

public Iterator<Value> termIterator()
Visits all terms in term key order (random index operation).

Note: While this operation visits the terms in their index order it is significantly less efficient than idTermIndexScan(). This is because the keys in the term:id index are formed using an un-reversable technique such that it is not possible to re-materialize the term from the key. Therefore visiting the terms in term order requires traversal of the term:id index (so that you are in term order) plus term-by-term resolution against the id:term index (to decode the term). Since the two indices are not mutually ordered, that resolution will result in random hits on the id:term index.


dumpTerms

public StringBuilder dumpTerms()
Dumps the lexicon in a variety of ways.



Copyright © 2006-2009 SYSTAP, LLC. All Rights Reserved.