com.bigdata.rdf.store
Class DataLoader

java.lang.Object
  extended by com.bigdata.rdf.store.DataLoader

public class DataLoader
extends Object

A utility class to load RDF data into an AbstractTripleStore without using Sesame API. This class does not parallelize the RDF parsing and writing on the database. This class is not efficient for scale-out.

Version:
$Id: DataLoader.java 6045 2012-02-27 17:33:44Z thompsonbry $
Author:
Bryan Thompson

Nested Class Summary
static class DataLoader.ClosureEnum
          A type-safe enumeration of options effecting whether and when entailments are computed as documents are loaded into the database using the DataLoader.
static class DataLoader.CommitEnum
          A type-safe enumeration of options effecting whether and when the database will be committed.
static interface DataLoader.Options
          Options for the DataLoader.
 
Field Summary
protected static org.apache.log4j.Logger log
          Logger.
 
Constructor Summary
DataLoader(AbstractTripleStore database)
          Configure DataLoader using properties used to configure the database.
DataLoader(Properties properties, AbstractTripleStore database)
          Configure a data loader with overridden properties.
 
Method Summary
 ClosureStats doClosure()
          Compute closure as configured.
 void endSource()
          Flush the StatementBuffer to the backing store.
protected  StatementBuffer<?> getAssertionBuffer()
          Return the assertion buffer.
 DataLoader.ClosureEnum getClosureEnum()
          How the DataLoader will maintain closure on the database.
 DataLoader.CommitEnum getCommitEnum()
          Whether and when the DataLoader will invoke ITripleStore.commit()
 AbstractTripleStore getDatabase()
          The target database.
 boolean getFlush()
          When true (the default) the StatementBuffer is flushed by each loadData(String, String, RDFFormat) or loadData(String[], String[], RDFFormat[]) operation and when doClosure() is requested.
 InferenceEngine getInferenceEngine()
          The object used to compute entailments for the database.
 LoadStats loadData(InputStream is, String baseURL, RDFFormat rdfFormat)
          Load from an input stream.
 LoadStats loadData(Reader reader, String baseURL, RDFFormat rdfFormat)
          Load from a reader.
 LoadStats loadData(String[] resource, String[] baseURL, RDFFormat[] rdfFormat)
          Load a set of RDF resources into the database.
 LoadStats loadData(String resource, String baseURL, RDFFormat rdfFormat)
          Load a resource into the database.
 LoadStats loadData(URL url, String baseURL, RDFFormat rdfFormat)
          Load from a URL.
protected  void loadData2(LoadStats totals, String resource, String baseURL, RDFFormat rdfFormat, boolean endOfBatch)
          Load an RDF resource into the database.
 void loadData3(LoadStats totals, Object source, String baseURL, RDFFormat rdfFormat, String defaultGraph, boolean endOfBatch)
          Loads data from the source.
 LoadStats loadFiles(File file, String baseURI, RDFFormat rdfFormat, String defaultGraph, FilenameFilter filter)
           
protected  void loadFiles(LoadStats totals, int depth, File file, String baseURI, RDFFormat rdfFormat, String defaultGraph, FilenameFilter filter, boolean endOfBatch)
           
static void main(String[] args)
          Utility method may be used to create and/or load RDF data into a local database instance.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

log

protected static final transient org.apache.log4j.Logger log
Logger.

Constructor Detail

DataLoader

public DataLoader(AbstractTripleStore database)
Configure DataLoader using properties used to configure the database.

Parameters:
database - The database.

DataLoader

public DataLoader(Properties properties,
                  AbstractTripleStore database)
Configure a data loader with overridden properties.

Parameters:
properties - Configuration properties - see DataLoader.Options.
database - The database.
Method Detail

getDatabase

public AbstractTripleStore getDatabase()
The target database.


getInferenceEngine

public InferenceEngine getInferenceEngine()
The object used to compute entailments for the database.


getAssertionBuffer

protected StatementBuffer<?> getAssertionBuffer()
Return the assertion buffer.

The assertion buffer is used to buffer statements that are being asserted so as to maximize the opportunity for batch writes. Truth maintenance (if enabled) will be performed no later than the commit of the transaction.

Note: The same buffer is reused by each loader so that we can on the one hand minimize heap churn and on the other hand disable auto-flush when loading a series of small documents. However, we obtain a new buffer each time we perform incremental truth maintenance.

Note: When non-null and non-empty, the buffer MUST be flushed (a) if a transaction completes (otherwise writes will not be stored on the database); or (b) if there is a read against the database during a transaction (otherwise reads will not see the unflushed statements).

Note: if #truthMaintenance is enabled then this buffer is backed by a temporary store which accumulates the SPOs to be asserted. Otherwise it will write directly on the database each time it is flushed, including when it overflows.

TODO:
this should be refactored as an IStatementBufferFactory where the appropriate factory is required for TM vs non-TM scenarios (or where the factory is parameterize for tm vs non-TM).

getFlush

public boolean getFlush()
When true (the default) the StatementBuffer is flushed by each loadData(String, String, RDFFormat) or loadData(String[], String[], RDFFormat[]) operation and when doClosure() is requested. When false the caller is responsible for flushing the buffer.

This behavior MAY be disabled if you want to chain load a bunch of small documents without flushing to the backing store after each document and loadData(String[], String[], RDFFormat[]) is not well-suited to your purposes. This can be much more efficient, approximating the throughput for large document loads. However, the caller MUST invoke endSource() once all documents are loaded successfully. If an error occurs during the processing of one or more documents then the entire data load should be discarded.

Returns:
The current value.
See Also:
DataLoader.Options.FLUSH

endSource

public void endSource()
Flush the StatementBuffer to the backing store.

Note: If you disable auto-flush AND you are not using truth maintenance then you MUST explicitly invoke this method once you are done loading data sets in order to flush the last chunk of data to the store. In all other conditions you do NOT need to call this method. However it is always safe to invoke this method - if the buffer is empty the method will be a NOP.


getClosureEnum

public DataLoader.ClosureEnum getClosureEnum()
How the DataLoader will maintain closure on the database.


getCommitEnum

public DataLoader.CommitEnum getCommitEnum()
Whether and when the DataLoader will invoke ITripleStore.commit()


loadData

public final LoadStats loadData(String resource,
                                String baseURL,
                                RDFFormat rdfFormat)
                         throws IOException
Load a resource into the database.

Parameters:
resource -
baseURL -
rdfFormat -
Returns:
Throws:
IOException

loadData

public final LoadStats loadData(String[] resource,
                                String[] baseURL,
                                RDFFormat[] rdfFormat)
                         throws IOException
Load a set of RDF resources into the database.

Parameters:
resource -
baseURL -
rdfFormat -
Returns:
Throws:
IOException

loadData

public LoadStats loadData(Reader reader,
                          String baseURL,
                          RDFFormat rdfFormat)
                   throws IOException
Load from a reader.

Parameters:
reader -
baseURL -
rdfFormat -
Returns:
Throws:
IOException

loadData

public LoadStats loadData(InputStream is,
                          String baseURL,
                          RDFFormat rdfFormat)
                   throws IOException
Load from an input stream.

Parameters:
is -
baseURL -
rdfFormat -
Returns:
Throws:
IOException

loadData

public LoadStats loadData(URL url,
                          String baseURL,
                          RDFFormat rdfFormat)
                   throws IOException
Load from a URL.

Parameters:
url -
baseURL -
rdfFormat -
Returns:
Throws:
IOException

loadData2

protected void loadData2(LoadStats totals,
                         String resource,
                         String baseURL,
                         RDFFormat rdfFormat,
                         boolean endOfBatch)
                  throws IOException
Load an RDF resource into the database.

Parameters:
resource - Either the name of a resource which can be resolved using the CLASSPATH, or the name of a resource in the local file system, or a URL.
baseURL -
rdfFormat -
endOfBatch -
Throws:
IOException - if the resource can not be resolved or loaded.

loadFiles

public LoadStats loadFiles(File file,
                           String baseURI,
                           RDFFormat rdfFormat,
                           String defaultGraph,
                           FilenameFilter filter)
                    throws IOException
Parameters:
file - The file or directory (required).
baseURI - The baseURI (optional, when not specified the name of the each file load is converted to a URL and used as the baseURI for that file).
rdfFormat - The format of the file (optional, when not specified the format is deduced for each file in turn using the RDFFormat static methods).
defaultGraph - The value that will be used for the graph/context co-ordinate when loading data represented in a triple format into a quad store.
filter - A filter selecting the file names that will be loaded (optional). When specified, the filter MUST accept directories if directories are to be recursively processed.
Returns:
The aggregated load statistics.
Throws:
IOException

loadFiles

protected void loadFiles(LoadStats totals,
                         int depth,
                         File file,
                         String baseURI,
                         RDFFormat rdfFormat,
                         String defaultGraph,
                         FilenameFilter filter,
                         boolean endOfBatch)
                  throws IOException
Throws:
IOException

loadData3

public void loadData3(LoadStats totals,
                      Object source,
                      String baseURL,
                      RDFFormat rdfFormat,
                      String defaultGraph,
                      boolean endOfBatch)
               throws IOException
Loads data from the source. The caller is responsible for closing the source if there is an error.

Parameters:
totals - Used to report out the total LoadStats.
source - A Reader or InputStream.
baseURL - The baseURI (optional, when not specified the name of the each file load is converted to a URL and used as the baseURI for that file).
rdfFormat - The format of the file (optional, when not specified the format is deduced for each file in turn using the RDFFormat static methods).
defaultGraph - The value that will be used for the graph/context co-ordinate when loading data represented in a triple format into a quad store.
endOfBatch - Signal indicates the end of a batch.
Throws:
IOException

doClosure

public ClosureStats doClosure()
Compute closure as configured. If DataLoader.ClosureEnum.None was selected then this MAY be used to (re-)compute the full closure of the database.

Throws:
IllegalStateException - if assertion buffer is null
See Also:
#removeEntailments()

main

public static void main(String[] args)
                 throws IOException
Utility method may be used to create and/or load RDF data into a local database instance. Directories will be recursively processed. The data files may be compressed using zip or gzip, but the loader does not support multiple data files within a single archive.

Parameters:
args - [-quiet][-closure][-verbose][-namespace namespace] propertyFile (fileOrDir)* where
-quiet
Suppress all stdout messages.
-verbose
Show additional messages detailing the load performance.
-closure
Compute the RDF(S)+ closure.
-namespace
The namespace of the KB instance.
propertyFile
The configuration file for the database instance.
fileOrDir
Zero or more files or directories containing the data to be loaded.
Throws:
IOException


Copyright © 2006-2011 SYSTAP, LLC. All Rights Reserved.