com.bigdata.rdf.rio
Class AbstractStatementBuffer<F extends Statement,G extends BigdataStatement>

java.lang.Object
  extended by com.bigdata.rdf.rio.AbstractStatementBuffer<F,G>
Type Parameters:
F - The generic type of the source Statement added to the buffer by the callers.
G - The generic type of the BigdataStatements stored in the buffer.
All Implemented Interfaces:
IStatementBuffer<F>, IBuffer<F>
Direct Known Subclasses:
AbstractStatementBuffer.StatementBuffer2

public abstract class AbstractStatementBuffer<F extends Statement,G extends BigdataStatement>
extends Object
implements IStatementBuffer<F>

Class for efficiently converting Statements into BigdataStatements, including resolving term identifiers (or adding entries to the lexicon for unknown terms) as required. The class does not write the converted BigdataStatements onto the database, but that can be easily done using a resolving iterator pattern.

Version:
$Id: AbstractStatementBuffer.java 2265 2009-10-26 12:51:06Z thompsonbry $
Author:
Bryan Thompson
TODO:
In fact, RIO also keeps a blank node map so that it can reuse the same blank node object if it sees the same ID more than once., the StatementBuffer does not appear to correctly canonicalize terms when statement identifiers are enabled. Per below, this just needs to be rewritten. The code could be simplified dramatically. If the value is a BNode, then it goes into a map for canonicalizing blank nodes with a life cycle of the document being loaded. If a statement uses blank nodes then it must be deferred (this is true whether or not statement identifiers are in use) so do NOT make the {s,p,o} canonical since the statement and its terms will be processed later. Otherwise it goes into a canonicalizing Set (add iff not found and return, otherwise return the existing Value). The canonicalized value is used by the statement. An incremental write will cause all terms in the Value[] to be assigned term identifiers, so they should be BigdataValue objects. The statements now have term identifiers and they are written onto the DB. When the end of the document is reached, there will be deferred statements iff there were blank nodes. Those are then processed per the existing code. (If statement identifiers exist, then unify blank nodes with statment identifiers otherwise just assign term identifiers to blank nodes.) Note that the Value[] should be empty after each incremental write. If there are deferred statements, then they already have BigdataValue objects binding their term identifiers. When we process the deferred statements we should only be assigning term identifiers for blank nodes -- everything else should already have its term identifier assigned for the deferred statements.

Nested Class Summary
static class AbstractStatementBuffer.StatementBuffer2<F extends Statement,G extends BigdataStatement>
          Loads Statements into an RDF database.
 
Field Summary
protected static boolean DEBUG
           
protected static boolean INFO
           
protected static org.apache.log4j.Logger log
           
protected  boolean readOnly
          When true, Values will be resolved against the LexiconRelation and Statements will be resolved against the SPORelation, but unknown Values and unknown Statements WILL NOT be inserted into the corresponding relations.
protected  G[] statementBuffer
          Buffer for accepted BigdataStatements.
 
Constructor Summary
AbstractStatementBuffer(AbstractTripleStore db, boolean readOnly, int capacity)
           
 
Method Summary
 void add(F e)
          Imposes a canonical mapping on the subject, predicate, and objects of the given Statements and stores a new BigdataStatement instance in the internal buffer.
 void add(Resource s, URI p, Value o)
          Add an "explicit" statement to the buffer with a "null" context.
 void add(Resource s, URI p, Value o, Resource c)
          Add an "explicit" statement to the buffer.
 void add(Resource s, URI p, Value o, Resource c, StatementEnum type)
          Add a statement to the buffer.
protected  void clear()
          Clears the state associated with the BigdataStatements in the internal buffer but does not discard the blank nodes or deferred statements.
protected  BigdataValue convertValue(Value value)
          Return a canonical BigdataValue instance representing the given value.
 long flush()
          Converts any buffered statements and any deferred statements and then invokes overflow() to flush anything remaining in the buffer.
 AbstractTripleStore getDatabase()
          The database from the ctor.
 AbstractTripleStore getStatementStore()
          Note: Returns the same value as getDatabase() since the distinction is not captured by this class.
 BigdataValueFactory getValueFactory()
          The ValueFactory for Statements and Values created by this class.
protected abstract  int handleProcessedStatements(G[] a)
          Invoked by overflow().
 boolean isEmpty()
          true if there are no buffered statements and no buffered deferred statements
protected  void overflow()
          Invoked each time the statementBuffer buffer would overflow.
protected  void processBufferedValues()
          Efficiently resolves/adds term identifiers for the buffered BigdataValues.
protected  void processDeferredStatements()
          Processes any BigdataStatements in the deferredStatementBuffer, adding them to the statementBuffer, which may cause the latter to overflow().
 void reset()
          Discards all state (term map, bnodes, deferred statements, the buffered statements, and the counter whose value is reported by flush()).
 void setBNodeMap(Map<String,BigdataBNodeImpl> bnodes)
          Set the canonicalizing map for blank nodes based on their ID.
 int size()
          #of buffered statements plus the #of buffered statements that are being deferred.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

log

protected static final org.apache.log4j.Logger log

INFO

protected static final boolean INFO

DEBUG

protected static final boolean DEBUG

readOnly

protected final boolean readOnly
When true, Values will be resolved against the LexiconRelation and Statements will be resolved against the SPORelation, but unknown Values and unknown Statements WILL NOT be inserted into the corresponding relations.


statementBuffer

protected final G extends BigdataStatement[] statementBuffer
Buffer for accepted BigdataStatements. This buffer is cleared each time it would overflow.

Constructor Detail

AbstractStatementBuffer

public AbstractStatementBuffer(AbstractTripleStore db,
                               boolean readOnly,
                               int capacity)
Parameters:
db - The database against which the Values will be resolved (or added). If this database supports statement identifiers, then statement identifiers for the converted statements will be resolved (or added) to the lexicon.
readOnly - When true, Values (and statement identifiers iff enabled) will be resolved against the LexiconRelation, but entries WILL NOT be inserted into the LexiconRelation for unknown Values (or for statement identifiers for unknown Statements when statement identifiers are enabled).
capacity - The capacity of the backing buffer.
Method Detail

getDatabase

public AbstractTripleStore getDatabase()
The database from the ctor.

Specified by:
getDatabase in interface IStatementBuffer<F extends Statement>

getStatementStore

public AbstractTripleStore getStatementStore()
Note: Returns the same value as getDatabase() since the distinction is not captured by this class. This MUST be overriden in derived classes which make this distinction.

Specified by:
getStatementStore in interface IStatementBuffer<F extends Statement>

getValueFactory

public BigdataValueFactory getValueFactory()
The ValueFactory for Statements and Values created by this class.


setBNodeMap

public void setBNodeMap(Map<String,BigdataBNodeImpl> bnodes)
Description copied from interface: IStatementBuffer
Set the canonicalizing map for blank nodes based on their ID. This allows you to reuse the same map across multiple IStatementBuffer instances. For example, the BigdataSail does this so that the same bnode map is used throughout the life of a SailConnection. While RIO provides blank node correlation within a given source, it does NOT provide blank node correlation across sources. You need to use this method to do that.

Note: It is reasonable to expect that the bnodes map is used by concurrent threads. For this reason, the map SHOULD be thread-safe. This can be accomplished either using Collections.synchronizedMap(Map) or a ConcurrentHashMap. However, implementations MUST still be synchronized on the map reference across operations which conditionally insert into the map in order to make that update atomic and thread-safe. Otherwise a race condition exists for the conditional insert and different threads could get incoherent answers.

Specified by:
setBNodeMap in interface IStatementBuffer<F extends Statement>
Parameters:
bnodes - The blank nodes map.

convertValue

protected BigdataValue convertValue(Value value)
Return a canonical BigdataValue instance representing the given value. The scope of the canonical instance is until the next internal buffer overflow (URIs and Literals) or until flush() (BNodes, since blank nodes are global for a given source). The purpose of the canonicalizing mapping is to reduce the buffered BigdataValues to the minimum variety required to represent the buffered BigdataStatements, which improves throughput significantly (40%) when resolving terms to the corresponding term identifiers using the LexiconRelation.

Note: This is not a true canonicalizing map when statement identifiers are used since values used in deferred statements will be held over until the buffer is flush()ed. This relaxation of the canonicalizing mapping is not a problem since the purpose of the mapping is to provide better throughput and nothign relies on a pure canonicalization of the Values.

Parameters:
value - A value.
Returns:
The corresponding canonical BigdataValue for the target BigdataValueFactory. This will be null iff the value is null (allows for the context to be undefined).

isEmpty

public boolean isEmpty()
true if there are no buffered statements and no buffered deferred statements

Specified by:
isEmpty in interface IBuffer<F extends Statement>

size

public int size()
#of buffered statements plus the #of buffered statements that are being deferred.

Specified by:
size in interface IBuffer<F extends Statement>

add

public void add(F e)
Imposes a canonical mapping on the subject, predicate, and objects of the given Statements and stores a new BigdataStatement instance in the internal buffer. If the given statement is a BigdataStatement then its StatementEnum will be used. Otherwise the new statement will be StatementEnum.Explicit.

Note: Unlike the Values, a canonicalizing mapping is NOT imposed for the statements. This is because, unlike the Values, there tends to be little duplication in Statements when processing RDF.

Specified by:
add in interface IStatementBuffer<F extends Statement>
Specified by:
add in interface IBuffer<F extends Statement>
Parameters:
e - The statement. If stmt implements BigdataStatement then the StatementEnum will be used (this makes it possible to load axioms into the database as axioms) but the term identifiers on the stmt's values will be ignored.

add

public void add(Resource s,
                URI p,
                Value o)
Description copied from interface: IStatementBuffer
Add an "explicit" statement to the buffer with a "null" context.

Specified by:
add in interface IStatementBuffer<F extends Statement>
Parameters:
s - The subject.
p - The predicate.
o - The object.

add

public void add(Resource s,
                URI p,
                Value o,
                Resource c)
Description copied from interface: IStatementBuffer
Add an "explicit" statement to the buffer.

Specified by:
add in interface IStatementBuffer<F extends Statement>
Parameters:
s - The subject.
p - The predicate.
o - The object.
c - The context (optional).

add

public void add(Resource s,
                URI p,
                Value o,
                Resource c,
                StatementEnum type)
Description copied from interface: IStatementBuffer
Add a statement to the buffer.

Note: The context parameter (c) is NOT used. The database at this time is either a triple store or a triple store with statement identifiers, and in neither case is the context used.

Specified by:
add in interface IStatementBuffer<F extends Statement>
Parameters:
s - The subject.
p - The predicate.
o - The object.
c - The context (optional).
type - The statement type (optional).

processBufferedValues

protected void processBufferedValues()
Efficiently resolves/adds term identifiers for the buffered BigdataValues.

If readOnly), then the term identifier for unknown values will remain IRawTripleStore.NULL.


processDeferredStatements

protected void processDeferredStatements()
Processes any BigdataStatements in the deferredStatementBuffer, adding them to the statementBuffer, which may cause the latter to overflow().


overflow

protected final void overflow()
Invoked each time the statementBuffer buffer would overflow. This method is responsible for bulk resolving / adding the buffered BigdataValues against the db and adding the fully resolved BigdataStatements to the queue on which the #iterator() is reading.


handleProcessedStatements

protected abstract int handleProcessedStatements(G[] a)
Invoked by overflow().

Parameters:
a - An array of processed BigdataStatements.
Returns:
The delta that will be added to the counter reported by flush().

flush

public long flush()
Converts any buffered statements and any deferred statements and then invokes overflow() to flush anything remaining in the buffer.

Specified by:
flush in interface IBuffer<F extends Statement>
Returns:
The total #of converted statements processed so far. (The counter is reset to zero as a side-effect.)

reset

public void reset()
Discards all state (term map, bnodes, deferred statements, the buffered statements, and the counter whose value is reported by flush()).

Specified by:
reset in interface IBuffer<F extends Statement>

clear

protected void clear()
Clears the state associated with the BigdataStatements in the internal buffer but does not discard the blank nodes or deferred statements.



Copyright © 2006-2009 SYSTAP, LLC. All Rights Reserved.