Mike Personick

MapGraph is a Massively Parallel Graph Processing API (previously known as “MPGraph”) that lets you express graph analytics (e.g. BFS, Shortest Path) in a vertex-centric programming abstraction known as GAS (Gather-Apply-Scatter). The API is based on the same Gather-Apply-Scatter model used in GraphLab. MapGraph comes in two flavors – a CPU version that is currently integrated into bigdata and a standalone GPU version that delivers up to 3 billion traversed edges per second on a single GPU. MapGraph on the GPU is up to two order of magnitude faster than parallel CPU implementations on up 24 CPU cores and has performance comparable to a state-of-the-art manually optimized GPU implementation of the same analytic. MapGraph’s easy-to-use GAS API allows new algorithms to be implemented in a few hours that can then fully exploit the data-level parallelism of the GPU.

The CPU version of MapGraph operates over graph data inside bigdata and is exposed via a SPARQL 1.1 Service Call:


PREFIX gas: <http://www.bigdata.com/rdf/gas#>
SELECT ?s ?p ?o ?depth {
  # run the Shortest Path algorithm
  SERVICE gas:service {
    gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.SSSP" .
    gas:program gas:in <:112.14.24.90> . # starting point
    gas:program gas:target <:135.174.12.12> . # target vertices
    gas:program gas:out ?s . # bound to the visited vertices.
    gas:program gas:out1 ?depth . # bound to the depth of the visited vertices.
  }
  # join with the statement indices
  ?s ?p ?o . # extract all links and attributes for the visited vertices.
}

This query combines a shortest path operation to find all vertices on the shortest path between two nodes, then does a join against the statement indices to fill in all the edges along that shortest path. The example above shows how you might create a connected graph between IP addresses using traceroute data.

We’ve completed an implementation of SPARQL 1.1 Property Paths to be included in an upcoming bigdata minor release. For an early preview of this feature, grab the latest code from the bigdata 1.2 maintenance branch in SVN (https://bigdata.svn.sourceforge.net/svnroot/bigdata/branches/BIGDATA_RELEASE_1_2_0). Feedback is greatly appreciated!

See https://sourceforge.net/apps/trac/bigdata/ticket/495.

Did you know that bigdata has a built-in REST API for client-server access to the RDF database? We call this interface the “NanoSparqlServer”, and it’s API is outlined in detail on the wiki:

https://sourceforge.net/apps/mediawiki/bigdata/index.php?title=NanoSparqlServer

What’s new with the NSS is that we’ve recently added a Java API around it so that you can write client code without having to understand the HTTP API or make HTTP calls directly. This is why there is suddenly a new dependency on Apache’s HTTP Components in the codebase. The Java wrapper is called “RemoteRepository”. If you’re comfortable writing application code against the Sesame SAIL/Repository API you should feel pretty at home with the RemoteRepository class. Not exactly the same, but very very similar.

The class itself is pretty self-explanatory but if you like examples, there is a test case for every API call in RemoteRepository in the class TestNanoSparqlClient. (That test case also conveniently demonstrates how to launch a NanoSparqlServer wrapping a bigdata journal using Jetty, which it does at the beginning of every test.)

I put together a more useful example of how to write a custom SPARQL function with bigdata. It’s up on the wiki here:

https://sourceforge.net/apps/mediawiki/bigdata/index.php?title=CustomFunction

The example details a common use case – filtering out solutions based on security credentials for a particular user. For example, if you wanted to return a list of document visible to the user “John”, you could do it with a custom SPARQL function:

PREFIX ex: <http://www.example.com/>
SELECT ?doc
{
  ?doc rdf:type ex:Document .
  filter(ex:validate(?doc, ?user)) .
}
BINDINGS ?user {
  (ex:John)
}

The function is called by referencing its unique URI, in this case ex:validate. This URI must be registered with bigdata’s FunctionRegistry along with an appropriate factory and operator. The wiki details how to do that. In the query above, the function is called with two arguments, the document to be validated and the user to validate against. The user in this simple example is a constant included in the BINDINGS clause. Always remember that bigdata custom functions are executed one solution at a time – they do not yet benefit from vectored execution and thus are not suitable for reading data from the indices. (The functions must operate without reading from the index on a per execution call basis.) A custom service (distinct from a custom function) is a more appropriate choice when execution requires touching indices. This is how we implement SPARQL 1.1 Federation.

Bigdata is a high-performance RDF database that uses B+tree indices to index RDF statements and terms. But did you know that bigdata also uses these same B+tree indices to provide built-in support for Lucene-style free text search?

See the wiki for more details.

Sometimes it is nice to be able to say things about statements, such as where they came from and who asserted them. The RDF data model does not provide a convenient mechanism for assigning identity to particular statements or for making statements about statements. RDF reification is cumbersome, results in a huge expansion in number of triples in the database, and is incompatible with most inference and rule engines.

Named graphs (quads) is one way to approach provenance. By grouping triples into named graphs and assigning a URI as the graph identifier, you can then make statements about the named graph to identify the provenance of the group of triples (the group size could even theoretically be one). Unfortunately this approach has a few drawbacks as well. Partitioning the knowledge base into groups creates challenges for inference and rule engines, and full named graph support in bigdata requires twice as many statement indices as triples.

If all you need is an unpartitioned, inference-capable knowledge base with the ability to make assertions about statements, bigdata provides you with a third alternative to simple triples or fully indexed quads: statement identifiers (SIDs). With SIDs, the database acts as if it is triples mode, but each triple is assigned a statement identifier (on demand) that can be used in additional statements (meta-statements):

(s, p, o, c)
1. (<mike>, <likes>, <RDF>, :sid1)
2. (:sid1, <source>, <http://bigdata.com>)

Statement 1 asserts that

We

Here is the slide deck for those that missed our presentation at SemTech this year on the upcoming high-availability architecture for bigdata. Also see our HA whitepapers:

[1] http://www.bigdata.com/whitepapers/bigdata_ha_whitepaper.pdf

[2] http://www.bigdata.com/whitepapers/Bigdata-HA-Quorum-Detailed-Design.pdf

We

We’ve just release a new version of bigdata. This is a bigdata