The Graph Data Management 2012 workshop was last week in Washington, DC. The workshop brought together an interesting mixture of people from several different background. There were people people focused on data mining and prediction, people focused on graph algorithms (iterative algorithms over materialized graphs), and several presentations on “graph databases” (3 on RDF databases and one on HyperGraphDB). Many thanks to the workshop organizers for pulling together such an interesting event!

It is clear that the “graph database” space is currently handicapped by a lack of standards. SPARQL can certainly solve many of the problems there, but it lacks a standardized way for dealing with provenance (aka link attributes). We have efficient extensions for this and it sounds like at least Virtuoso will be picking them up as well, so maybe we can drive standardization that way. SPARQL has support for property paths, but it lacks a means to express iterative refinement algorithms so they could be executed efficiently within the database. It is possible to use SPARQL update commands to operate iteratively on data sets on the server without round-tripping large graphs to the client, but it is not yet possible to specify control logic for such updates in a standardized manner, and without extensions which clarify which graphs or solutions should be durable and which should be wired into main memory it is difficult to use SPARQL update for iterative algorithms which assemble an annotated graph. Equally worrisome, it appears that it is not yet possible to create good benchmarks for graph databases right now because the low level APIs wipe out the tremendous advantage which you gain from vectored evaluation in a database.

We will be announcing some new features over the next few weeks and the coming months designed to address some of these issues. The first feature will extend SPARQL 1.1 UPDATE to let you provision and manage solutions sets. A preview of this SPARQL UPDATE extension is published on the bigdata wiki. The extension adds just a little bit of syntax, but a whole lot of power. It was originally envisioned to give people the ability to page through large result sets without re-evaluating complex joins – a use case which is illustrated on the wiki. However, we see lots opportunities beyond an application aware SPARQL cache.

Another feature which will come out later this year is a distributed client/server graph protocol. This is designed to address the tight coupling of applications with graph databases, provide a fast, scalable object level cache for graph data, and provide both fast in-memory traversal on the client and efficient subgraph matching on the server. Clients will also be able to create “graph transactions” and post updates back to the server and write through cache fabric. We plan to have multiple client language bindings for this, providing graph database access within the browser, in Java, etc. We are even looking at a GPU binding for pure computational speed. The language bindings will be generated based on metadata describing the object models.