I had an interesting conversation with one of the blueprints developers (Joshua Shinavier) about graph databases during the CSHALS 2012 conference.

Blueprints appear to follow an object model very similar to the one that Martyn Cutcher had developed in the Generic Persistent Object (GPO) / Generic Object Model (GOM). GOM allows schema flexible objects, object link sets, and link properties via link “reification”. In fact, we have had most of a GPO/GOM implementation for bigdata since 2006. The main hangup has been getting the free time to support a horizontally scaled GPO/GOM model (in particular, horizontal scaling for GPO link sets). A very similar technology was used in the core of the K42 engine by STEP UK (K42 was a high performance object database engine underlying an XML Topic Maps engine back in 2000).

Bigdata also has a native provenance mode for the SAIL interface featuring Statement Identifiers (SIDs). This mode was developed to support the intelligence and topic maps community and allows statements about statements. We’ve blogged on this in the past. The SIDs mode let’s you attach attributes to “links” efficiently, and even let’s you attach links to links (statement identifiers can appear in any position of a statement), which is more general than the blueprints API.

The SIDs mode is extremely efficient. The representation of a statement is just its “{s,p,o}” representation using the internal values (IVs) for that statement. This means that there is no indirection through indices when performing reverse traversal from a statement about a statement to the statement which is being described.

However, all scalable (persistence class) graph databases use indices. Even if you represent the object identifier as an integer, that integer is still indirected through an object index (and through the file system) in order to resolve the object. The GPO model caches all weakly referenced objects in RAM, so once retrieved traversal is O(1), but access to disk is never less than O(log n) since it always implies indices for an updatable data model.

There is a long history (read war) in the network (CODASYL) and relational database spaces. To my mind, both groups had useful things to say. The main argument of the relational group was that an independence between the logical representation and the physical data model was necessary and allowed for declarative query semantics which in turn allowed for sophisticated query optimization. Query optimizers can generally produce query plans that do as well as all but the very best hand coded queries. The network database camp pointed out the flexibility of the data model and eventually showed that it was possible to produce declarative query languages for network databases. The issue was eventually settled in the market place, with the relational model taking the lead for several decades. See “What goes around comes around” by Stonebreaker and Hellerstein for a somewhat slanted take on all of this.

Object database and graph databases are very closely related to the earlier network databases. The same benefits (schema flexibility) and cautions (lack of declarative query model) apply. API such as blueprints can provide great convenience, but they force all query optimization onto the application writer.

Bigdata puts a lot of effort into query optimization. The most obvious place is simply the join ordering. If you want to traverse from some vertex through some edges to some set of vertices, bigdata will do fast range counts on the access paths and decide on a join order for that computation which can be several orders of magnitude faster than naive traversal. Bigdata can also use hash joins and variable pruning to tremendously speed up queries which visit intermediate vertex sets which are not required in the final solution set. This is possible through a combination of high level declarative query (SPARQL) and dedicated query optimization code. When using bigdata in its SIDs (aka provenance aka graph database mode) you can get all of the benefit of that performance for “path traversal” in a “graph”. And you can have 50 billion edges in a graph on a single machine and efficiently scale that graph out across a cluster of machines. All in open source.

There are graph traversal patterns which do not fit neatly into a high level query language without loop constructs. SPARQL actually does provide for some of these via property paths, which we are in the process of building into bigdata. However, you can also drop into the SAIL API with bigdata and run access paths based on triple patterns which correspond more or less directly to the vertex-edge traversal patterns of blueprints, and which support not only link attributes but also links for links.

There is no native blueprints implementation for bigdata. You can certainly try the blueprints to Sail integration against the BigdataSail, but I would also encourage people to try running bigdata in its SIDs mode and enjoy the performance that you can get from optimized high level query against a high performance graph database. If you need vertex/edge traversal, you can get that from the Sail, but you will have much higher performance if you avoid the RDF Value materialization step and stay within the bigdata native API.

This is a 1.0.x maintenance release of bigdata(R). New users are encouraged to go directly to the 1.1.0 release. Bigdata is a horizontally-scaled, open-source architecture for indexed data with an emphasis on RDF capable of loading 1B triples in under one hour on a 15 node cluster. Bigdata operates in both a single machine mode (Journal) and a cluster mode (Federation). The Journal provides fast scalable ACID indexed storage for very large data sets, up to 50 billion triples / quads. The federation provides fast scalable shard-wise parallel indexed storage using dynamic sharding and shard-wise ACID updates and incremental cluster size growth. Both platforms support fully concurrent readers with snapshot isolation.

Distributed processing offers greater throughput but does not reduce query or update latency. Choose the Journal when the anticipated scale and throughput requirements permit. Choose the Federation when the administrative and machine overhead associated with operating a cluster is an acceptable tradeoff to have essentially unlimited data scaling and throughput.

See [1,2,8] for instructions on installing bigdata(R), [4] for the javadoc, and [3,5,6] for news, questions, and the latest developments. For more information about SYSTAP, LLC and bigdata, see [7].

Starting with the 1.0.0 release, we offer a WAR artifact [8] for easy installation of the single machine RDF database. For custom development and cluster installations we recommend checking out the code from SVN using the tag for this release. The code will build automatically under eclipse. You can also build the code using the ant script. The cluster installer requires the use of the ant script.

You can download the WAR from:

http://sourceforge.net/projects/bigdata/

You can checkout this release from:

https://bigdata.svn.sourceforge.net/svnroot/bigdata/tags/BIGDATA_RELEASE_1_0_6

Feature summary:

- Single machine data storage to ~50B triples/quads (RWStore);
- Clustered data storage is essentially unlimited;
- Simple embedded and/or webapp deployment (NanoSparqlServer);
- Triples, quads, or triples with provenance (SIDs);
- 100% native SPARQL 1.0 evaluation with lots of query optimizations;
- Fast RDFS+ inference and truth maintenance;
- Fast statement level provenance mode (SIDs).

Road map [3]:

- High-volume analytic query and SPARQL 1.1 query, including aggregations;
- SPARQL 1.1 Update, Property Paths, and Federation support;
- Simplified deployment, configuration, and administration for clusters; and
- High availability for the journal and the cluster.

Change log:

Note: Versions with (*) require data migration. For details, see [9].

1.0.6

- http://sourceforge.net/apps/trac/bigdata/ticket/473 (PhysicalAddressResolutionException after reopen using RWStore and recycler)

1.0.5

- http://sourceforge.net/apps/trac/bigdata/ticket/362 (Fix incompatible with log4j – slf4j bridge.)
- http://sourceforge.net/apps/trac/bigdata/ticket/440 (BTree can not be cast to Name2Addr)
- http://sourceforge.net/apps/trac/bigdata/ticket/453 (Releasing blob DeferredFree record)
- http://sourceforge.net/apps/trac/bigdata/ticket/467 (IllegalStateException trying to access lexicon index using RWStore with recycling)

1.0.4

- http://sourceforge.net/apps/trac/bigdata/ticket/443 (Logger for RWStore transaction service and recycler)
- http://sourceforge.net/apps/trac/bigdata/ticket/445 (RWStore does not track tx release correctly)
- http://sourceforge.net/apps/trac/bigdata/ticket/437 (Thread-local cache combined with unbounded thread pools causes effective memory leak: termCache memory leak & thread-local buffers)

1.0.3

– http://sourceforge.net/apps/trac/bigdata/ticket/217 (BTreeCounters does not track bytes released)
– http://sourceforge.net/apps/trac/bigdata/ticket/269 (Refactor performance counters using accessor interface)
– http://sourceforge.net/apps/trac/bigdata/ticket/329 (B+Tree should delete bloom filter when it is disabled.)
– http://sourceforge.net/apps/trac/bigdata/ticket/372 (RWStore does not prune the CommitRecordIndex)
– http://sourceforge.net/apps/trac/bigdata/ticket/375 (Persistent memory leaks (RWStore/DISK))
– http://sourceforge.net/apps/trac/bigdata/ticket/385 (FastRDFValueCoder2: ArrayIndexOutOfBoundsException)
– http://sourceforge.net/apps/trac/bigdata/ticket/391 (Release age advanced on WORM mode journal)
– http://sourceforge.net/apps/trac/bigdata/ticket/392 (Add a DELETE by access path method to the NanoSparqlServer)
– http://sourceforge.net/apps/trac/bigdata/ticket/393 (Add “context-uri” request parameter to specify the default context for INSERT in the REST API)
– http://sourceforge.net/apps/trac/bigdata/ticket/394 (log4j configuration error message in WAR deployment)
– http://sourceforge.net/apps/trac/bigdata/ticket/399 (Add a fast range count method to the REST API)
– http://sourceforge.net/apps/trac/bigdata/ticket/422 (Support temp triple store wrapped by a BigdataSail)
– http://sourceforge.net/apps/trac/bigdata/ticket/424 (NQuads support for NanoSparqlServer)
– http://sourceforge.net/apps/trac/bigdata/ticket/425 (Bug fix to DEFAULT_RDF_FORMAT for bulk data loader in scale-out)
– http://sourceforge.net/apps/trac/bigdata/ticket/426 (Support either lockfile (procmail) and dotlockfile (liblockfile1) in scale-out)
– http://sourceforge.net/apps/trac/bigdata/ticket/427 (BigdataSail#getReadOnlyConnection() race condition with concurrent commit)
– http://sourceforge.net/apps/trac/bigdata/ticket/435 (Address is 0L)
– http://sourceforge.net/apps/trac/bigdata/ticket/436 (TestMROWTransactions failure in CI)

1.0.2

– http://sourceforge.net/apps/trac/bigdata/ticket/32 (Query time expansion of (foo rdf:type rdfs:Resource) drags in SPORelation for scale-out.)
– http://sourceforge.net/apps/trac/bigdata/ticket/181 (Scale-out LUBM “how to” in wiki and build.xml are out of date.)
– http://sourceforge.net/apps/trac/bigdata/ticket/356 (Query not terminated by error.)
– http://sourceforge.net/apps/trac/bigdata/ticket/359 (NamedGraph pattern fails to bind graph variable if only one binding exists.)
– http://sourceforge.net/apps/trac/bigdata/ticket/361 (IRunningQuery not closed promptly.)
– http://sourceforge.net/apps/trac/bigdata/ticket/371 (DataLoader fails to load resources available from the classpath.)
– http://sourceforge.net/apps/trac/bigdata/ticket/376 (Support for the streaming of bigdata IBindingSets into a sparql query.)
– http://sourceforge.net/apps/trac/bigdata/ticket/378 (ClosedByInterruptException during heavy query mix.)
– http://sourceforge.net/apps/trac/bigdata/ticket/379 (NotSerializableException for SPOAccessPath.)
– http://sourceforge.net/apps/trac/bigdata/ticket/382 (Change dependencies to Apache River 2.2.0)

1.0.1 (*)

– http://sourceforge.net/apps/trac/bigdata/ticket/107 (Unicode clean schema names in the sparse row store).
– http://sourceforge.net/apps/trac/bigdata/ticket/124 (TermIdEncoder should use more bits for scale-out).
– http://sourceforge.net/apps/trac/bigdata/ticket/225 (OSX requires specialized performance counter collection classes).
– http://sourceforge.net/apps/trac/bigdata/ticket/348 (BigdataValueFactory.asValue() must return new instance when DummyIV is used).
– http://sourceforge.net/apps/trac/bigdata/ticket/349 (TermIdEncoder limits Journal to 2B distinct RDF Values per triple/quad store instance).
– http://sourceforge.net/apps/trac/bigdata/ticket/351 (SPO not Serializable exception in SIDS mode (scale-out)).
– http://sourceforge.net/apps/trac/bigdata/ticket/352 (ClassCastException when querying with binding-values that are not known to the database).
– http://sourceforge.net/apps/trac/bigdata/ticket/353 (UnsupportedOperatorException for some SPARQL queries).
– http://sourceforge.net/apps/trac/bigdata/ticket/355 (Query failure when comparing with non materialized value).
– http://sourceforge.net/apps/trac/bigdata/ticket/357 (RWStore reports “FixedAllocator returning null address, with freeBits”.)
– http://sourceforge.net/apps/trac/bigdata/ticket/359 (NamedGraph pattern fails to bind graph variable if only one binding exists.)
– http://sourceforge.net/apps/trac/bigdata/ticket/362 (log4j – slf4j bridge.)

For more information about bigdata, please see the following links:

[1] http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=Main_Page
[2] http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=GettingStarted
[3] http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=Roadmap
[4] http://www.bigdata.com/bigdata/docs/api/
[5] http://sourceforge.net/projects/bigdata/
[6] http://www.bigdata.com/blog
[7] http://www.systap.com/bigdata.htm
[8] http://sourceforge.net/projects/bigdata/files/bigdata/
[9] http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=DataMigration

About bigdata:

Bigdata

This is a 1.0.x maintenance release of bigdata(R). New users are encouraged to go directly to the 1.1.0 release. Bigdata is a horizontally-scaled, open-source architecture for indexed data with an emphasis on RDF capable of loading 1B triples in under one hour on a 15 node cluster. Bigdata operates in both a single machine mode (Journal) and a cluster mode (Federation). The Journal provides fast scalable ACID indexed storage for very large data sets, up to 50 billion triples / quads. The federation provides fast scalable shard-wise parallel indexed storage using dynamic sharding and shard-wise ACID updates and incremental cluster size growth. Both platforms support fully concurrent readers with snapshot isolation.

Distributed processing offers greater throughput but does not reduce query or update latency. Choose the Journal when the anticipated scale and throughput requirements permit. Choose the Federation when the administrative and machine overhead associated with operating a cluster is an acceptable tradeoff to have essentially unlimited data scaling and throughput.

See [1,2,8] for instructions on installing bigdata(R), [4] for the javadoc, and [3,5,6] for news, questions, and the latest developments. For more information about SYSTAP, LLC and bigdata, see [7].

Starting with the 1.0.0 release, we offer a WAR artifact [8] for easy installation of the single machine RDF database. For custom development and cluster installations we recommend checking out the code from SVN using the tag for this release. The code will build automatically under eclipse. You can also build the code using the ant script. The cluster installer requires the use of the ant script.

You can download the WAR from:

http://sourceforge.net/projects/bigdata/

You can checkout this release from:

https://bigdata.svn.sourceforge.net/svnroot/bigdata/tags/BIGDATA_RELEASE_1_0_5

Feature summary:

- Single machine data storage to ~50B triples/quads (RWStore);
- Clustered data storage is essentially unlimited;
- Simple embedded and/or webapp deployment (NanoSparqlServer);
- Triples, quads, or triples with provenance (SIDs);
- 100% native SPARQL 1.0 evaluation with lots of query optimizations;
- Fast RDFS+ inference and truth maintenance;
- Fast statement level provenance mode (SIDs).

Road map [3]:

- High-volume analytic query and SPARQL 1.1 query, including aggregations;
- SPARQL 1.1 Update, Property Paths, and Federation support;
- Simplified deployment, configuration, and administration for clusters; and
- High availability for the journal and the cluster.

Change log:

Note: Versions with (*) require data migration. For details, see [9].

1.0.5

- http://sourceforge.net/apps/trac/bigdata/ticket/362 (Fix incompatible with log4j – slf4j bridge.)
- http://sourceforge.net/apps/trac/bigdata/ticket/440 (BTree can not be cast to Name2Addr)
- http://sourceforge.net/apps/trac/bigdata/ticket/453 (Releasing blob DeferredFree record)
- http://sourceforge.net/apps/trac/bigdata/ticket/467 (IllegalStateException trying to access lexicon index using RWStore with recycling)

1.0.4

- http://sourceforge.net/apps/trac/bigdata/ticket/443 (Logger for RWStore transaction service and recycler)
- http://sourceforge.net/apps/trac/bigdata/ticket/445 (RWStore does not track tx release correctly)
- http://sourceforge.net/apps/trac/bigdata/ticket/437 (Thread-local cache combined with unbounded thread pools causes effective memory leak: termCache memory leak & thread-local buffers)

1.0.3

– http://sourceforge.net/apps/trac/bigdata/ticket/217 (BTreeCounters does not track bytes released)
– http://sourceforge.net/apps/trac/bigdata/ticket/269 (Refactor performance counters using accessor interface)
– http://sourceforge.net/apps/trac/bigdata/ticket/329 (B+Tree should delete bloom filter when it is disabled.)
– http://sourceforge.net/apps/trac/bigdata/ticket/372 (RWStore does not prune the CommitRecordIndex)
– http://sourceforge.net/apps/trac/bigdata/ticket/375 (Persistent memory leaks (RWStore/DISK))
– http://sourceforge.net/apps/trac/bigdata/ticket/385 (FastRDFValueCoder2: ArrayIndexOutOfBoundsException)
– http://sourceforge.net/apps/trac/bigdata/ticket/391 (Release age advanced on WORM mode journal)
– http://sourceforge.net/apps/trac/bigdata/ticket/392 (Add a DELETE by access path method to the NanoSparqlServer)
– http://sourceforge.net/apps/trac/bigdata/ticket/393 (Add “context-uri” request parameter to specify the default context for INSERT in the REST API)
– http://sourceforge.net/apps/trac/bigdata/ticket/394 (log4j configuration error message in WAR deployment)
– http://sourceforge.net/apps/trac/bigdata/ticket/399 (Add a fast range count method to the REST API)
– http://sourceforge.net/apps/trac/bigdata/ticket/422 (Support temp triple store wrapped by a BigdataSail)
– http://sourceforge.net/apps/trac/bigdata/ticket/424 (NQuads support for NanoSparqlServer)
– http://sourceforge.net/apps/trac/bigdata/ticket/425 (Bug fix to DEFAULT_RDF_FORMAT for bulk data loader in scale-out)
– http://sourceforge.net/apps/trac/bigdata/ticket/426 (Support either lockfile (procmail) and dotlockfile (liblockfile1) in scale-out)
– http://sourceforge.net/apps/trac/bigdata/ticket/427 (BigdataSail#getReadOnlyConnection() race condition with concurrent commit)
– http://sourceforge.net/apps/trac/bigdata/ticket/435 (Address is 0L)
– http://sourceforge.net/apps/trac/bigdata/ticket/436 (TestMROWTransactions failure in CI)

1.0.2

– http://sourceforge.net/apps/trac/bigdata/ticket/32 (Query time expansion of (foo rdf:type rdfs:Resource) drags in SPORelation for scale-out.)
– http://sourceforge.net/apps/trac/bigdata/ticket/181 (Scale-out LUBM “how to” in wiki and build.xml are out of date.)
– http://sourceforge.net/apps/trac/bigdata/ticket/356 (Query not terminated by error.)
– http://sourceforge.net/apps/trac/bigdata/ticket/359 (NamedGraph pattern fails to bind graph variable if only one binding exists.)
– http://sourceforge.net/apps/trac/bigdata/ticket/361 (IRunningQuery not closed promptly.)
– http://sourceforge.net/apps/trac/bigdata/ticket/371 (DataLoader fails to load resources available from the classpath.)
– http://sourceforge.net/apps/trac/bigdata/ticket/376 (Support for the streaming of bigdata IBindingSets into a sparql query.)
– http://sourceforge.net/apps/trac/bigdata/ticket/378 (ClosedByInterruptException during heavy query mix.)
– http://sourceforge.net/apps/trac/bigdata/ticket/379 (NotSerializableException for SPOAccessPath.)
– http://sourceforge.net/apps/trac/bigdata/ticket/382 (Change dependencies to Apache River 2.2.0)

1.0.1 (*)

– http://sourceforge.net/apps/trac/bigdata/ticket/107 (Unicode clean schema names in the sparse row store).
– http://sourceforge.net/apps/trac/bigdata/ticket/124 (TermIdEncoder should use more bits for scale-out).
– http://sourceforge.net/apps/trac/bigdata/ticket/225 (OSX requires specialized performance counter collection classes).
– http://sourceforge.net/apps/trac/bigdata/ticket/348 (BigdataValueFactory.asValue() must return new instance when DummyIV is used).
– http://sourceforge.net/apps/trac/bigdata/ticket/349 (TermIdEncoder limits Journal to 2B distinct RDF Values per triple/quad store instance).
– http://sourceforge.net/apps/trac/bigdata/ticket/351 (SPO not Serializable exception in SIDS mode (scale-out)).
– http://sourceforge.net/apps/trac/bigdata/ticket/352 (ClassCastException when querying with binding-values that are not known to the database).
– http://sourceforge.net/apps/trac/bigdata/ticket/353 (UnsupportedOperatorException for some SPARQL queries).
– http://sourceforge.net/apps/trac/bigdata/ticket/355 (Query failure when comparing with non materialized value).
– http://sourceforge.net/apps/trac/bigdata/ticket/357 (RWStore reports “FixedAllocator returning null address, with freeBits”.)
– http://sourceforge.net/apps/trac/bigdata/ticket/359 (NamedGraph pattern fails to bind graph variable if only one binding exists.)
– http://sourceforge.net/apps/trac/bigdata/ticket/362 (log4j – slf4j bridge.)

For more information about bigdata, please see the following links:

[1] http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=Main_Page
[2] http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=GettingStarted
[3] http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=Roadmap
[4] http://www.bigdata.com/bigdata/docs/api/
[5] http://sourceforge.net/projects/bigdata/
[6] http://www.bigdata.com/blog
[7] http://www.systap.com/bigdata.htm
[8] http://sourceforge.net/projects/bigdata/files/bigdata/
[9] http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=DataMigration

About bigdata:

Bigdata