We’ve completed an implementation of SPARQL 1.1 Property Paths to be included in an upcoming bigdata minor release. For an early preview of this feature, grab the latest code from the bigdata 1.2 maintenance branch in SVN (https://bigdata.svn.sourceforge.net/svnroot/bigdata/branches/BIGDATA_RELEASE_1_2_0). Feedback is greatly appreciated!
Did you know that bigdata has a built-in REST API for client-server access to the RDF database? We call this interface the “NanoSparqlServer”, and it’s API is outlined in detail on the wiki:
https://sourceforge.net/apps/mediawiki/bigdata/index.php?title=NanoSparqlServer
What’s new with the NSS is that we’ve recently added a Java API around it so that you can write client code without having to understand the HTTP API or make HTTP calls directly. This is why there is suddenly a new dependency on Apache’s HTTP Components in the codebase. The Java wrapper is called “RemoteRepository”. If you’re comfortable writing application code against the Sesame SAIL/Repository API you should feel pretty at home with the RemoteRepository class. Not exactly the same, but very very similar.
The class itself is pretty self-explanatory but if you like examples, there is a test case for every API call in RemoteRepository in the class TestNanoSparqlClient. (That test case also conveniently demonstrates how to launch a NanoSparqlServer wrapping a bigdata journal using Jetty, which it does at the beginning of every test.)
I put together a more useful example of how to write a custom SPARQL function with bigdata. It’s up on the wiki here:
https://sourceforge.net/apps/mediawiki/bigdata/index.php?title=CustomFunction
The example details a common use case – filtering out solutions based on security credentials for a particular user. For example, if you wanted to return a list of document visible to the user “John”, you could do it with a custom SPARQL function:
PREFIX ex: <http://www.example.com/>
SELECT ?doc
{
?doc rdf:type ex:Document .
filter(ex:validate(?doc, ?user)) .
}
BINDINGS ?user {
(ex:John)
}
The function is called by referencing its unique URI, in this case ex:validate. This URI must be registered with bigdata’s FunctionRegistry along with an appropriate factory and operator. The wiki details how to do that. In the query above, the function is called with two arguments, the document to be validated and the user to validate against. The user in this simple example is a constant included in the BINDINGS clause. Always remember that bigdata custom functions are executed one solution at a time – they do not yet benefit from vectored execution and thus are not suitable for reading data from the indices. (The functions must operate without reading from the index on a per execution call basis.) A custom service (distinct from a custom function) is a more appropriate choice when execution requires touching indices. This is how we implement SPARQL 1.1 Federation.
Bigdata is a high-performance RDF database that uses B+tree indices to index RDF statements and terms. But did you know that bigdata also uses these same B+tree indices to provide built-in support for Lucene-style free text search? If the text index is enabled (via a property when the database is created), then each literal added to the database (by appearing in the “O” position of a statement) is also added to the text index. This index can then be accessed directly, via the bigdata API, or indirectly, via high-level query. To accomplish this integration of free text search with high-level query, bigdata defines several magic predicates that are given special meaning, and when encountered in a SPARQL query are interpreted as service calls to the text index.
Before you get started, make sure you have enabled the free text index in your properties file:
com.bigdata.rdf.store.AbstractTripleStore.textIndex=true.
The full list of magic predicates related to free text search is defined and documented in the class com.bigdata.rdf.store.BD. The simplest way to integrate free text search into a SPARQL query in bigdata is to use the magic predicate “bd:search” inside of a SPARQL join group. The predicate bd:search is used to search the full text index using the pattern in the “O” position of the search and to bind the hits (Literals) to the variable defined in the “S” position of the search. For example:
?lit bd:search “mike” .
will search the full text index for literals that contain the token “mike” and bind those literals onto the ?lit variable for use in subsequent joins. To find statements that use literals that contain the token “mike”, the SPARQL query would look as follows:
prefix bd: <http://www.bigdata.com/rdf#search>
select ?s, ?p, ?o
where {
?o bd:search “mike” .
?s ?p ?o .
}
In addition to simple search, additional metadata about the search can be defined inside the SPARQL query using other magic predicates (also defined in the BD class). These predicates, when attached to the same variable as the search, will help narrow the search or bind additional metadata about search hits to other variables. We could expand the SPARQL query as follows:
prefix bd: <http://www.bigdata.com/rdf#search>
select ?s, ?p, ?o, ?score, ?rank
where {
?o bd:search “mike personick” .
?o bd:matchAllTerms “true” .
?o bd:minRelevance “0.25” .
?o bd:relevance ?score .
?o bd:maxRank “1000” .
?o bd:rank ?rank .
?s ?p ?o .
}
The magic predicate bd:matchAllTerms indicates that only literals that contain all of the specified search terms should be considered. Similarly, literals can be constrained by min and max relevance (a 0 to 1 score signifying how closely the literal matches the search terms) and by min and max rank (hits are ordered by relevance and the rank describes where the literal appears in that ordered list). If the relevance or rank is relevant to the application, those pieces of metadata can be bound to variables in the search results using the predicates bd:relevance and bd:rank.
I’ve been working recently with a customer that is making extensive use of the free text search feature within its application. A number of interesting challenges arise when working with a large database inside an application designed to answer user queries with very low latency. The number of hits bound to a free text search can be very large (unconstrained search) or very small (highly constrained search), and when joined with the statement indices inside a SPARQL query, these hits can produce either huge numbers of relevant statements (a condition we’ve been referring to as “overflow”) or hardly any statements at all (“underflow”). By playing tricks with the SPARQL query we have been cutting the hits into “rank slices” by specifying the min and max ranks to be considered for any particular query, and then running those rank slices through the rest of the joins in the query until we find just enough results to paint the first page of search results, but no more. By starting with very small rank slices, we try to ensure we don’t overload and stall out the application while the user waits. This trick of rank slicing is made possible by caching the ordered list of free text hits temporarily inside the query engine, so that subsequent calls to the text index with the exact same parameters (search and metadata) will be costless.
Sometimes it is nice to be able to say things about statements, such as where they came from and who asserted them. The RDF data model does not provide a convenient mechanism for assigning identity to particular statements or for making statements about statements. RDF reification is cumbersome, results in a huge expansion in number of triples in the database, and is incompatible with most inference and rule engines.
Named graphs (quads) is one way to approach provenance. By grouping triples into named graphs and assigning a URI as the graph identifier, you can then make statements about the named graph to identify the provenance of the group of triples (the group size could even theoretically be one). Unfortunately this approach has a few drawbacks as well. Partitioning the knowledge base into groups creates challenges for inference and rule engines, and full named graph support in bigdata requires twice as many statement indices as triples.
If all you need is an unpartitioned, inference-capable knowledge base with the ability to make assertions about statements, bigdata provides you with a third alternative to simple triples or fully indexed quads: statement identifiers (SIDs). With SIDs, the database acts as if it is triples mode, but each triple is assigned a statement identifier (on demand) that can be used in additional statements (meta-statements):
(s, p, o, c)
1. (<mike>, <likes>, <RDF>, :sid1)
2. (:sid1, <source>, <http://bigdata.com>)
Statement 1 asserts that “Mike likes RDF”, and uses a bnode in the context position to assign identity to that statement. Statement 2 (the meta-statement), asserts the provenance of statement 1. The bnode is not used to internally represent the SID in the database, only to unify the SID between statements during load and lookup. The SID is available as the context position on a statement via the Sesame API.
To turn on statement identifiers, first set the following property in your bigdata configuration:
com.bigdata.rdf.store.AbstractTripleStore.statementIdentifiers=true
You can then load data with statement identifiers using either a simple RDF/XML extension or via the Sesame API. The RDF/XML extension allows you to assign bnode statement identifiers inside your data file and then use that SID in other statements:
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:myns="http://mynamespace.com/"
xmlns:bigdata="http://www.bigdata.com/rdf#">
<!-- statement (assigns S1 as the statement’s identifier)-->
<rdf:Description rdf:about="http://mynamespace.com/Mike">
<myns:likes bigdata:sid="S1"
rdf:resource="http://mynamespace.com/RDF"/>
</rdf:Description>
<!-- meta-statement (makes a statement about S1) -->
<rdf:Description rdf:nodeID="S1">
<myns:source rdf:resource="http://bigdata.com"/>
</rdf:Description>
</rdf:RDF>
The actual bnode ID is meaningless outside of the context of the load transaction – statement identifiers are represented internally as inline values inside the statement indices themselves (more on that later).
Alternatively, you can use the Sesame API to load data with SIDs:
final BigdataSail sail = getSail();
final BigdataSailRepository repo = new BigdataSailRepository(sail);
final BigdataSailRepositoryConnection cxn =
(BigdataSailRepositoryConnection) repo.getUnisolatedConnection();
cxn.setAutoCommit(false);
try {
final ValueFactory vf = sail.getValueFactory();
final URI mike = vf.createURI("http://mynamespace.com/Mike");
final URI likes = vf.createURI("http://mynamespace.com/likes");
final URI rdf = vf.createURI("http://mynamespace.com/RDF");
final URI source = vf.createURI("http://mynamespace.com/source");
final URI src1 = vf.createURI("http://bigdata.com");
final Statement s1 =
vf.createStatement(mike, likes, rdf, vf.createBNode());
cxn.add(s1);
cxn.add(s1.getContext(), source, src1);
cxn.commit();
if (log.isInfoEnabled()) {
log.info(sail.getDatabase().dumpStore());
}
} finally {
cxn.close();
}
Once you have some statements and meta-statements loaded, you can then run SPARQL queries to do forward and reverse lookups. Forward lookups are when you have a statement or set of statements and want to retrieve the provenance. Reverse lookups are when you want to find statements based on some piece of known provenance, such as “get me all statements from a particular source”. To do this, we exploit the context position in SPARQL. By binding the SID a graph variable, we can achieve both forward and reverse lookups:
Forward lookup (“get me all provenance about a particular statement”):
select ?sid ?p ?o
where {
graph ?sid { <mike> <likes> <RDF> . }
?sid ?p ?o .
}
Reverse lookup (“get me all statements from a particular source”):
select ?s ?p ?o ?sid
where {
graph ?sid { ?s ?p ?o . }
?sid <source> <http://bigdata.com> .
}
The most recent bigdata release includes a refactoring of our statement identifier mechanism. With the recent refactor, SIDs are no longer stored in the lexicon and accessed indirectly via term identifers. Instead, SIDs are now represented as inline internal values (IVs) inside the statement indices themselves. The statement indices use combinations of IVs in different key orderings to represent statements. A SID IV is directly decodable into both a statement (useful for reverse lookup) and a bnode (useful for serialization of SIDs in result sets). Old journals creates in SIDs mode are not compatible with the most recent release, although old journals created without SIDs enabled (the default) will be fine.
We’ve just released a new version of Bigdata®. This is a bigdata® snapshot release. This release is capable of loading 1B triples in under one hour on a 15 node cluster and has been used to load up to 13B triples on the same cluster. JDK 1.6 is required.
See [1] for instructions on installing bigdata®, [2] for the javadoc and [3] and [4] for news, questions, and the latest developments. For more information about SYSTAP, LLC and bigdata®, see [5].
Please note that we recommend checking out the code from SVN using the tag for this release. The code will build automatically under eclipse. You can also build the code using the ant script. The cluster installer requires the use of the ant script. You can checkout this release from the following URL:
https://bigdata.svn.sourceforge.net/svnroot/bigdata/branches/BIGDATA_RELEASE_0_83_2
New features:
- This release provides a bug fix for issue#118. Upgrade to this release is advised. See https://sourceforge.net/apps/trac/bigdata/ticket/118 for details.
- Inlining XSD numerics, xsd:boolean, or custom datatype extensions into the statement indices. Inlining provides a smaller footprint and faster queries for data using XSD numeric datatypes. In order to introduce inlining we were forced to make a change in the physical schema for the RDF database which breaks binary compatibility for existing stores. The recommended migration path is to export the data and import it into a new bigdata instance.
- Refactor of the dynamic sharding mechanism for higher performance.
- The SparseRowStore has been modified to make Unicode primary keys decodable by representing Unicode primary keys using UTF8 rather than Unicode sort keys. This change also allows the SparseRowStore to work with the JDK collator option which embeds nul bytes into Unicode sort keys. This change breaks binary compatibility, but there is an option for historical compatibility.
The roadmap for the next releases include:
- Query optimizations;
- Support for high-volume analytic query workloads and SPARQL aggregations;
- High availability for the journal and the cluster;
- Simplified deployment, configuration, and administration for clusters.
For more information, please see the following links:
[1] http://bigdata.wiki.sourceforge.net/GettingStarted
[2] http://www.bigdata.com/bigdata/docs/api/
[3] http://sourceforge.net/projects/bigdata/
[4] http://www.bigdata.com/blog
[5] http://www.systap.com/bigdata.htm
About bigdata:
Bigdata® is a horizontally-scaled, general purpose storage and computing fabric for ordered data (B+Trees), designed to operate on either a single server or a cluster of commodity hardware. Bigdata® uses dynamically partitioned key-range shards in order to remove any realistic scaling limits – in principle, bigdata® may be deployed on 10s, 100s, or even thousands of machines and new capacity may be added incrementally without requiring the full reload of all data. The bigdata® RDF database supports RDFS and OWL Lite reasoning, high-level query (SPARQL), and datum level provenance.
Here is the slide deck for those that missed our presentation at SemTech this year on the upcoming high-availability architecture for bigdata. Also see our HA whitepapers:
[1] http://www.bigdata.com/whitepapers/bigdata_ha_whitepaper.pdf
[2] http://www.bigdata.com/whitepapers/Bigdata-HA-Quorum-Detailed-Design.pdf
We’ve just undergone a fairly extensive refactor of how RDF terms are represented in bigdata.
Bigdata is all about high-performance B+Tree indices. With these as basic building blocks, you can model all sorts of interesting data, like RDF. Our RDF store uses two relations – one for statements and one we call the lexicon. Up until now, all RDF terms (i.e. URIs, Literals, and BNodes) were assigned long integer term identifiers to uniquely represent them in the statement indices. The lexicon relation was responsible for keeping this RDF term to long term identifier mapping using two indices, Term2Id (the forward index) and Id2Term (the reverse index). As new terms appeared in statements written to the database, they would first go through a forward mapping process. A new tuple would be inserted into the Term2Id index using a Unicode sort key generated from the string value of the RDF term along with the a new unique long term identifier as the tuple value. This same term identifier would then be used as a key in Id2Term, the reverse index, with the RDF term serialized directly into the tuple value. Once indexed, RDF terms could be quickly resolved (or materialized) to (or from) term identifiers. These lightweight long term identifiers were then used inside the statement indices rather than embedding the much bulkier RDF terms themselves.
While this keeps the statement indices relatively slim, it does necessitate additional index lookups against the lexicon to materialize RDF terms from their term identifiers. This makes it impossible to do numeric filtering during join processing using the statement indices alone, because filters like (?age < 35) are impossible to evaluate when ?age is represented as a term identifier. This becomes especially problematic in scale-out when the RDF term for a particular term identifier might not even be on the same machine as the shard where the filter is being evaluated during join processing.
The lexicon refactor moves from a model where all RDF terms are referenced by term identifiers from the statement indices to one where certain terms are embedded inline directly into the statement indices themselves based on the datatype of the RDF term. Terms that can be represented compactly in statement keys directly are inlined by value rather than assigned a term identifier to reference them. For example, why create a long integer term identifier to reference an RDF Literal that has a datatype of long integer? It makes much more sense to embed that Literal directly, rather than go through an unnecessary mapping to a term identifier in the lexicon. This applies to all numerical RDF Literal values. Once these numerics are inlined directly in the statement indices, it becomes possible to apply numerical filters during join processing without having to materialize terms from the lexicon.
In addition to numerics, it’s also possible to inline some non-numeric datatypes, such as UUIDs. This is useful in “told bnode mode”, a mode where BNode ids are maintained by the database. BNodes using UUIDs for their id can be inlined into the statement indices without the need for term identifiers and the lexicon.
The statement indices use keys that are built from the encoded versions of the RDF terms that compose the particular statement for a given key order. For example, the SPO index uses keys that are built by first encoding the S term, then appending the encoded P term, and finally appending the encoded O term. The OSP index keys use the same encoded terms but in a different order. Before the lexicon refactor, terms were always encoded as simple long integers (8 bytes). Thus statement keys were always a fixed length – 24 bytes for triples (3 terms), and 32 bytes for quads (4 terms).
With the lexicon refactor, we now use one additional byte to represent metadata about the term – whether or not it is inline (1 bit), what type of term it is, URI, Literal, or BNode (2 bits), what the datatype is (4 bits), and 1 additional bit for extensions (explained later). Currently supported intrinsic datatypes are boolean, byte, short, int, long, float, double, BigInteger, BigDecimal, and UUID. Support is also planned for unsigned byte, unsigned short, unsigned int, and unsigned long. These intrinsic types correspond directly to XML Schema datatypes. RDF Literals that are datatyped using XML Schema datatypes will be automatically inlined using one of the intrinsic types. In addition the term itself is no longer necessarily exactly 8 bytes – only term identifiers, inline longs and inline doubles will need 8 bytes, most other intrinsic types can be represented using less. The exceptions to this are BigInteger, BigDecimal, and UUID, which require more than 8 bytes.
What this means for developers is that almost everywhere we used to use java.lang.Long, we now use com.bigdata.rdf.internal.IV. IV stands for “internal value”, which covers term identifiers and all varieties of inline values. Read through the javadoc in this package for much more detail on everything related to the refactor.
You can also define custom extensions that are projected onto one of the intrinsic types using the extension bit. For example, a datatype for “milliseconds since the epoch” can be defined and used to datatype literals with a long integer string value. These literals can then be inlined directly using the long integer intrinsic type as a delegate. Custom extensions are Java classes that are responsible for round-tripping an RDF term to and from an inline internal value for a particular datatype. Two such extensions are provided as examples in the bigdata codebase – one for milliseconds since the epoch and one that demonstrates how to map an enum of colors onto the xsd:byte intrinsic type.
It is also important to note that this change breaks binary compatibility for the RDF store. It was simply not possible to maintain backward compatibility under this refactor without creating very difficult code maintenance problems. The recommended migration path is a full export of old databases and then an import using the new version.
Preliminary testing shows no penalty for inlining in queries that don’t take advantage of numerical filter operators and significant improvement for queries that do (for example BSBM query 4). Inlining is also expected to dramatically improve query performance for scale-out, which we intend to test in the upcoming weeks as we also work on query optimizations for scale-out quad query.
We’ve just release a new version of bigdata. This is a bigdata® snapshot release. This release is capable of loading 1B triples in under one hour on a 15 node cluster and has been used to load up to 13B triples on the same cluster. JDK 1.6 is required.
See [1] for instructions on installing bigdata(R), [2] for the javadoc and [3] and [4] for news, questions, and the latest developments. For more information about SYSTAP, LLC and bigdata, see [5].
Please note that we recommend checking out the code from SVN using the tag for this release. The code will build automatically under eclipse. You can also build the code using the ant script. The cluster installer requires the use of the ant script. You can checkout this release from the following URL:
https://bigdata.svn.sourceforge.net/svnroot/bigdata/branches/BIGDATA_RELEASE_0_83_1
New features:
- Inlining XSD numerics, xsd:boolean, or custom datatype extensions into the statement indices. Inlining provides a smaller footprint and faster queries for data using XSD numeric datatypes. In order to introduce inlining we were forced to make a change in the physical schema for the RDF database which breaks binary compatibility for existing stores. The recommended migration path is to export the data and import it into a new bigdata instance.
- Refactor of the dynamic sharding mechanism for higher performance.
- The SparseRowStore has been modified to make Unicode primary keys decodable by representing Unicode primary keys using UTF8 rather than Unicode sort keys. This change also allows the SparseRowStore to work with the JDK collator option which embeds nul bytes into Unicode sort keys. This change breaks binary compatibility, but there is an option for historical compatibility.
The roadmap for the next releases include:
- Query optimizations;
- Support for high-volume analytic query workloads and SPARQL aggregations;
- High availability for the journal and the cluster;
- Simplified deployment, configuration, and administration for clusters.
For more information, please see the following links:
[1] http://bigdata.wiki.sourceforge.net/GettingStarted
[2] http://www.bigdata.com/bigdata/docs/api/
[3] http://sourceforge.net/projects/bigdata/
[4] http://www.bigdata.com/blog
[5] http://www.systap.com/bigdata.htm
About bigdata:
Bigdata® is a horizontally-scaled, general purpose storage and computing fabric for ordered data (B+Trees), designed to operate on either a single server or a cluster of commodity hardware. Bigdata® uses dynamically partitioned key-range shards in order to remove any realistic scaling limits – in principle, bigdata® may be deployed on 10s, 100s, or even thousands of machines and new capacity may be added incrementally without requiring the full reload of all data. The bigdata® RDF database supports RDFS and OWL Lite reasoning, high-level query (SPARQL), and datum level provenance.
After reading a couple papers on research being done on geospatial data with RDF [1,2], we’ve been thinking about how we can best support geospatial RDF data at very large scale.
Bigdata is all about sharded B+Trees. These indices are the basic building blocks that allow us to achieve massive scale for all sorts of data models, most notably RDF. Our RDF database uses two relations to model its data – the lexicon relation and the statement relation. The lexicon relation currently uses two sharded B+Tree indices to forward and reverse map RDF Values (URIs, Literals, and BNodes) to 64-bit long term identifiers. These term identifiers are then used inside of keys in the statement relation indices to model the statements themselves. The statement relation implements a “perfect access path” strategy to efficiently process arbitrary joins, which requires three indices for triples or six for quads as described in the YARS paper on Optimized Index Structures.
In the scale-out architecture, B+Tree indices are sharded and distributed across a cluster of data services. The indices are dynamically key-range partitioned such that each data service in the cluster handles roughly the same amount of data. As shards grow, they are split and possibly moved across the cluster to another data service (depending on relative load). This approach to scale-out implies that B+Tree tuples close to each other in an index key space will be located physically close to each other on the cluster and on disk and gives us good locality for data within individual index shards. It would be nice to take advantage of this property for geospatial data as well – we want data that is “nearby” in physical or temporal space to also be nearby on disk, preferably on the same index shard.
One approach to modeling geospatial data in RDF is to tag RDF Literals with custom data types describing geospatial extents and then use those geospatial literals inside of statements (and queries). An RDF resource can be tagged with a geospatial literal to give it a physical location in space and/or time. Our current set of indices would do nothing to ensure good locality for resources physically or temporally near one another, but if we introduced an additional index that used a space-filling curve (like Z-ordering) as a prefix to its keys, we could cluster nearby geospatial literals together in the index. For example, two geospatial point literals right next to each other in an X-Y plane will tend to have Z-values very close to one another. If the key in the geospatial index was composed of the Z-value plus the long term identifier, this would be a means of clustering those two literals near one another in the index.
By using a space-filling curve and an additional geospatial index, we now have a means of achieving good locality for geospatial literals inside our RDF database. We can then use this geospatial index to resolve physical or temporal lookups and queries into sets of geospatial literal term identifiers, which can then be used in joins against the statement indices.
We’ve tried to capture some of this thinking in a geospatial slide deck. Feedback is appreciated!