Sometimes it is nice to be able to say things about statements, such as where they came from and who asserted them. The RDF data model does not provide a convenient mechanism for assigning identity to particular statements or for making statements about statements. RDF reification is cumbersome, results in a huge expansion in number of triples in the database, and is incompatible with most inference and rule engines.
Named graphs (quads) is one way to approach provenance. By grouping triples into named graphs and assigning a URI as the graph identifier, you can then make statements about the named graph to identify the provenance of the group of triples (the group size could even theoretically be one). Unfortunately this approach has a few drawbacks as well. Partitioning the knowledge base into groups creates challenges for inference and rule engines, and full named graph support in bigdata requires twice as many statement indices as triples.
If all you need is an unpartitioned, inference-capable knowledge base with the ability to make assertions about statements, bigdata provides you with a third alternative to simple triples or fully indexed quads: statement identifiers (SIDs). With SIDs, the database acts as if it is triples mode, but each triple is assigned a statement identifier (on demand) that can be used in additional statements (meta-statements):
(s, p, o, c)
1. (<mike>, <likes>, <RDF>, :sid1)
2. (:sid1, <source>, <http://bigdata.com>)
Statement 1 asserts that “Mike likes RDF”, and uses a bnode in the context position to assign identity to that statement. Statement 2 (the meta-statement), asserts the provenance of statement 1. The bnode is not used to internally represent the SID in the database, only to unify the SID between statements during load and lookup. The SID is available as the context position on a statement via the Sesame API.
To turn on statement identifiers, first set the following property in your bigdata configuration:
com.bigdata.rdf.store.AbstractTripleStore.statementIdentifiers=true
You can then load data with statement identifiers using either a simple RDF/XML extension or via the Sesame API. The RDF/XML extension allows you to assign bnode statement identifiers inside your data file and then use that SID in other statements:
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:myns="http://mynamespace.com/"
xmlns:bigdata="http://www.bigdata.com/rdf#">
<!-- statement (assigns S1 as the statement’s identifier)-->
<rdf:Description rdf:about="http://mynamespace.com/Mike">
<myns:likes bigdata:sid="S1"
rdf:resource="http://mynamespace.com/RDF"/>
</rdf:Description>
<!-- meta-statement (makes a statement about S1) -->
<rdf:Description rdf:nodeID="S1">
<myns:source rdf:resource="http://bigdata.com"/>
</rdf:Description>
</rdf:RDF>
The actual bnode ID is meaningless outside of the context of the load transaction – statement identifiers are represented internally as inline values inside the statement indices themselves (more on that later).
Alternatively, you can use the Sesame API to load data with SIDs:
final BigdataSail sail = getSail();
final BigdataSailRepository repo = new BigdataSailRepository(sail);
final BigdataSailRepositoryConnection cxn =
(BigdataSailRepositoryConnection) repo.getUnisolatedConnection();
cxn.setAutoCommit(false);
try {
final ValueFactory vf = sail.getValueFactory();
final URI mike = vf.createURI("http://mynamespace.com/Mike");
final URI likes = vf.createURI("http://mynamespace.com/likes");
final URI rdf = vf.createURI("http://mynamespace.com/RDF");
final URI source = vf.createURI("http://mynamespace.com/source");
final URI src1 = vf.createURI("http://bigdata.com");
final Statement s1 =
vf.createStatement(mike, likes, rdf, vf.createBNode());
cxn.add(s1);
cxn.add(s1.getContext(), source, src1);
cxn.commit();
if (log.isInfoEnabled()) {
log.info(sail.getDatabase().dumpStore());
}
} finally {
cxn.close();
}
Once you have some statements and meta-statements loaded, you can then run SPARQL queries to do forward and reverse lookups. Forward lookups are when you have a statement or set of statements and want to retrieve the provenance. Reverse lookups are when you want to find statements based on some piece of known provenance, such as “get me all statements from a particular source”. To do this, we exploit the context position in SPARQL. By binding the SID a graph variable, we can achieve both forward and reverse lookups:
Forward lookup (“get me all provenance about a particular statement”):
select ?sid ?p ?o
where {
graph ?sid { <mike> <likes> <RDF> . }
?sid ?p ?o .
}
Reverse lookup (“get me all statements from a particular source”):
select ?s ?p ?o ?sid
where {
graph ?sid { ?s ?p ?o . }
?sid <source> <http://bigdata.com> .
}
The most recent bigdata release includes a refactoring of our statement identifier mechanism. With the recent refactor, SIDs are no longer stored in the lexicon and accessed indirectly via term identifers. Instead, SIDs are now represented as inline internal values (IVs) inside the statement indices themselves. The statement indices use combinations of IVs in different key orderings to represent statements. A SID IV is directly decodable into both a statement (useful for reverse lookup) and a bnode (useful for serialization of SIDs in result sets). Old journals creates in SIDs mode are not compatible with the most recent release, although old journals created without SIDs enabled (the default) will be fine.
This is a bigdata (R) release. This release is capable of loading 1B triples in under one hour on a 15 node cluster. JDK 1.6 is required.
See [1,2] for instructions on installing bigdata(R), [4] for the javadoc, and [3,5,6] for news, questions, and the latest developments. For more information about SYSTAP, LLC and bigdata, see [7].
Please note that we recommend checking out the code from SVN using the tag for this release. The code will build automatically under eclipse. You can also build the code using the ant script. The cluster installer requires the use of the ant script. You can checkout this release from the following URL:
https://bigdata.svn.sourceforge.net/svnroot/bigdata/branches/BIGDATA_RELEASE_0_84_0
New features:
- Inlining provenance metadata into the statement indices and fast reverse lookup of provenance metadata using statement identifiers (SIDs).
Significant bug fixes:
- The journal size could double in some cases following a restart due to a type in the WORMStrategy constructor.
See https://sourceforge.net/apps/trac/bigdata/ticket/236
- Fixed a concurrency hole in the commit protocol for the Journal which could result in a concurrent modification to the B+Tree during the commit protocol.
- Fixed a problem in the abort protocol for the BigdataSail.
- Fixed a problem where the BigdataSail would permit the same thread to obtain more than one UNISOLATED connection:
See https://sourceforge.net/apps/trac/bigdata/ticket/278
See https://sourceforge.net/apps/trac/bigdata/ticket/284
See https://sourceforge.net/apps/trac/bigdata/ticket/288
See https://sourceforge.net/apps/trac/bigdata/ticket/289
The road map [3] for the next releases includes:
- Single machine data storage to 10B+ triples;
- Simple embedded and/or webapp deployment;
- 100% native SPARQL evaluation with lots of query optimizations;
- High-volume analytic query and SPARQL 1.1 query, including aggregations;
- Simplified deployment, configuration, and administration for clusters.
- High availability for the journal and the cluster;
For more information, please see the following links:
[1] https://sourceforge.net/apps/mediawiki/bigdata/index.php?title=Main_Page
[2] https://sourceforge.net/apps/mediawiki/bigdata/index.php?title=GettingStarted
[3] https://sourceforge.net/apps/mediawiki/bigdata/index.php?title=Roadmap
[4] http://www.bigdata.com/bigdata/docs/api/
[5] http://sourceforge.net/projects/bigdata/
[6] http://www.bigdata.com/blog
[7] http://www.systap.com/bigdata.htm
About bigdata:
Bigdata® is a horizontally-scaled, general purpose storage and computing fabric for ordered data (B+Trees), designed to operate on either a single server or a cluster of commodity hardware. Bigdata® uses dynamically partitioned key-range shards in order to remove any realistic scaling limits – in principle, bigdata® may be deployed on 10s, 100s, or even thousands of machines and new capacity may be added incrementally without requiring the full reload of all data. The bigdata® RDF database supports RDFS and OWL Lite reasoning, high-level query (SPARQL), and datum level provenance.