This is a major version release of bigdata(R). Bigdata is a horizontally-scaled, open-source architecture for indexed data with an emphasis on RDF capable of loading 1B triples in under one hour on a 15 node cluster. Bigdata operates in both a single machine mode (Journal) and a cluster mode (Federation). The Journal provides fast scalable ACID indexed storage for very large data sets, up to 50 billion triples / quads. The federation provides fast scalable shard-wise parallel indexed storage using dynamic sharding and shard-wise ACID updates and incremental cluster size growth. Both platforms support fully concurrent readers with snapshot isolation.

Distributed processing offers greater throughput but does not reduce query or update latency. Choose the Journal when the anticipated scale and throughput requirements permit. Choose the Federation when the administrative and machine overhead associated with operating a cluster is an acceptable tradeoff to have essentially unlimited data scaling and throughput.

See [1,2,8] for instructions on installing bigdata(R), [4] for the javadoc, and [3,5,6] for news, questions, and the latest developments. For more information about SYSTAP, LLC and bigdata, see [7].

Starting with the 1.0.0 release, we offer a WAR artifact [8] for easy installation of the single machine RDF database. For custom development and cluster installations we recommend checking out the code from SVN using the tag for this release. The code will build automatically under eclipse. You can also build the code using the ant script. The cluster installer requires the use of the ant script.

You can download the WAR from:

http://sourceforge.net/projects/bigdata/

You can checkout this release from:

https://bigdata.svn.sourceforge.net/svnroot/bigdata/tags/BIGDATA_RELEASE_1_1_0

New features:

- Fast, scalable native support for SPARQL 1.1 analytic queries;
- %100 Java memory manager leverages the JVM native heap (no GC);
- New extensible hash tree index structure;

Feature summary:

- Single machine data storage to ~50B triples/quads (RWStore);
- Clustered data storage is essentially unlimited;
- Simple embedded and/or webapp deployment (NanoSparqlServer);
- Triples, quads, or triples with provenance (SIDs);
- Fast 100% native SPARQL 1.0 evaluation;
- Integrated “analytic” query package;
- Fast RDFS+ inference and truth maintenance;
- Fast statement level provenance mode (SIDs).

Road map [3]:

- Simplified deployment, configuration, and administration for clusters; and
- High availability for the journal and the cluster.

Change log:

Note: Versions with (*) require data migration. For details, see [9].

1.1.0 (*)

– http://sourceforge.net/apps/trac/bigdata/ticket/23 (Lexicon joins)
– http://sourceforge.net/apps/trac/bigdata/ticket/109 (Store large literals as “blobs”)
– http://sourceforge.net/apps/trac/bigdata/ticket/181 (Scale-out LUBM “how to” in wiki and build.xml are out of date.)
– http://sourceforge.net/apps/trac/bigdata/ticket/203 (Implement an persistence capable hash table to support analytic query)
– http://sourceforge.net/apps/trac/bigdata/ticket/209 (AccessPath should visit binding sets rather than elements for high level query.)
– http://sourceforge.net/apps/trac/bigdata/ticket/227 (SliceOp appears to be necessary when operator plan should suffice without)
– http://sourceforge.net/apps/trac/bigdata/ticket/232 (Bottom-up evaluation semantics).
– http://sourceforge.net/apps/trac/bigdata/ticket/246 (Derived xsd numeric data types must be inlined as extension types.)
– http://sourceforge.net/apps/trac/bigdata/ticket/254 (Revisit pruning of intermediate variable bindings during query execution)
– http://sourceforge.net/apps/trac/bigdata/ticket/261 (Lift conditions out of subqueries.)
– http://sourceforge.net/apps/trac/bigdata/ticket/300 (Native ORDER BY)
– http://sourceforge.net/apps/trac/bigdata/ticket/324 (Inline predeclared URIs and namespaces in 2-3 bytes)
– http://sourceforge.net/apps/trac/bigdata/ticket/330 (NanoSparqlServer does not locate “html” resources when run from jar)
– http://sourceforge.net/apps/trac/bigdata/ticket/334 (Support inlining of unicode data in the statement indices.)
– http://sourceforge.net/apps/trac/bigdata/ticket/364 (Scalable default graph evaluation)
– http://sourceforge.net/apps/trac/bigdata/ticket/368 (Prune variable bindings during query evaluation)
– http://sourceforge.net/apps/trac/bigdata/ticket/370 (Direct translation of openrdf AST to bigdata AST)
– http://sourceforge.net/apps/trac/bigdata/ticket/373 (Fix StrBOp and other IValueExpressions)
– http://sourceforge.net/apps/trac/bigdata/ticket/377 (Optimize OPTIONALs with multiple statement patterns.)
– http://sourceforge.net/apps/trac/bigdata/ticket/380 (Native SPARQL evaluation on cluster)
– http://sourceforge.net/apps/trac/bigdata/ticket/387 (Cluster does not compute closure)
– http://sourceforge.net/apps/trac/bigdata/ticket/395 (HTree hash join performance)
– http://sourceforge.net/apps/trac/bigdata/ticket/401 (inline xsd:unsigned datatypes)
– http://sourceforge.net/apps/trac/bigdata/ticket/408 (xsd:string cast fails for non-numeric data)
– http://sourceforge.net/apps/trac/bigdata/ticket/421 (New query hints model.)
– http://sourceforge.net/apps/trac/bigdata/ticket/431 (Use of read-only tx per query defeats cache on cluster)

1.0.3

– http://sourceforge.net/apps/trac/bigdata/ticket/217 (BTreeCounters does not track bytes released)
– http://sourceforge.net/apps/trac/bigdata/ticket/269 (Refactor performance counters using accessor interface)
– http://sourceforge.net/apps/trac/bigdata/ticket/329 (B+Tree should delete bloom filter when it is disabled.)
– http://sourceforge.net/apps/trac/bigdata/ticket/372 (RWStore does not prune the CommitRecordIndex)
– http://sourceforge.net/apps/trac/bigdata/ticket/375 (Persistent memory leaks (RWStore/DISK))
– http://sourceforge.net/apps/trac/bigdata/ticket/385 (FastRDFValueCoder2: ArrayIndexOutOfBoundsException)
– http://sourceforge.net/apps/trac/bigdata/ticket/391 (Release age advanced on WORM mode journal)
– http://sourceforge.net/apps/trac/bigdata/ticket/392 (Add a DELETE by access path method to the NanoSparqlServer)
– http://sourceforge.net/apps/trac/bigdata/ticket/393 (Add “context-uri” request parameter to specify the default context for INSERT in the REST API)
– http://sourceforge.net/apps/trac/bigdata/ticket/394 (log4j configuration error message in WAR deployment)
– http://sourceforge.net/apps/trac/bigdata/ticket/399 (Add a fast range count method to the REST API)
– http://sourceforge.net/apps/trac/bigdata/ticket/422 (Support temp triple store wrapped by a BigdataSail)
– http://sourceforge.net/apps/trac/bigdata/ticket/424 (NQuads support for NanoSparqlServer)
– http://sourceforge.net/apps/trac/bigdata/ticket/425 (Bug fix to DEFAULT_RDF_FORMAT for bulk data loader in scale-out)
– http://sourceforge.net/apps/trac/bigdata/ticket/426 (Support either lockfile (procmail) and dotlockfile (liblockfile1) in scale-out)
– http://sourceforge.net/apps/trac/bigdata/ticket/427 (BigdataSail#getReadOnlyConnection() race condition with concurrent commit)
– http://sourceforge.net/apps/trac/bigdata/ticket/435 (Address is 0L)
– http://sourceforge.net/apps/trac/bigdata/ticket/436 (TestMROWTransactions failure in CI)

1.0.2

– http://sourceforge.net/apps/trac/bigdata/ticket/32 (Query time expansion of (foo rdf:type rdfs:Resource) drags in SPORelation for scale-out.)
– http://sourceforge.net/apps/trac/bigdata/ticket/181 (Scale-out LUBM “how to” in wiki and build.xml are out of date.)
– http://sourceforge.net/apps/trac/bigdata/ticket/356 (Query not terminated by error.)
– http://sourceforge.net/apps/trac/bigdata/ticket/359 (NamedGraph pattern fails to bind graph variable if only one binding exists.)
– http://sourceforge.net/apps/trac/bigdata/ticket/361 (IRunningQuery not closed promptly.)
– http://sourceforge.net/apps/trac/bigdata/ticket/371 (DataLoader fails to load resources available from the classpath.)
– http://sourceforge.net/apps/trac/bigdata/ticket/376 (Support for the streaming of bigdata IBindingSets into a sparql query.)
– http://sourceforge.net/apps/trac/bigdata/ticket/378 (ClosedByInterruptException during heavy query mix.)
– http://sourceforge.net/apps/trac/bigdata/ticket/379 (NotSerializableException for SPOAccessPath.)
– http://sourceforge.net/apps/trac/bigdata/ticket/382 (Change dependencies to Apache River 2.2.0)

1.0.1 (*)

– http://sourceforge.net/apps/trac/bigdata/ticket/107 (Unicode clean schema names in the sparse row store).
– http://sourceforge.net/apps/trac/bigdata/ticket/124 (TermIdEncoder should use more bits for scale-out).
– http://sourceforge.net/apps/trac/bigdata/ticket/225 (OSX requires specialized performance counter collection classes).
– http://sourceforge.net/apps/trac/bigdata/ticket/348 (BigdataValueFactory.asValue() must return new instance when DummyIV is used).
– http://sourceforge.net/apps/trac/bigdata/ticket/349 (TermIdEncoder limits Journal to 2B distinct RDF Values per triple/quad store instance).
– http://sourceforge.net/apps/trac/bigdata/ticket/351 (SPO not Serializable exception in SIDS mode (scale-out)).
– http://sourceforge.net/apps/trac/bigdata/ticket/352 (ClassCastException when querying with binding-values that are not known to the database).
– http://sourceforge.net/apps/trac/bigdata/ticket/353 (UnsupportedOperatorException for some SPARQL queries).
– http://sourceforge.net/apps/trac/bigdata/ticket/355 (Query failure when comparing with non materialized value).
– http://sourceforge.net/apps/trac/bigdata/ticket/357 (RWStore reports “FixedAllocator returning null address, with freeBits”.)
– http://sourceforge.net/apps/trac/bigdata/ticket/359 (NamedGraph pattern fails to bind graph variable if only one binding exists.)
– http://sourceforge.net/apps/trac/bigdata/ticket/362 (log4j – slf4j bridge.)

For more information about bigdata(R), please see the following links:

[1] http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=Main_Page
[2] http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=GettingStarted
[3] http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=Roadmap
[4] http://www.bigdata.com/bigdata/docs/api/
[5] http://sourceforge.net/projects/bigdata/
[6] http://www.bigdata.com/blog
[7] http://www.systap.com/bigdata.htm
[8] http://sourceforge.net/projects/bigdata/files/bigdata/
[9] http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=DataMigration

About bigdata:

Bigdata

This is a 1.0.x maintenance release of bigdata(R). New users are encouraged to go directly to the 1.1.0 release. Bigdata is a horizontally-scaled, open-source architecture for indexed data with an emphasis on RDF capable of loading 1B triples in under one hour on a 15 node cluster. Bigdata operates in both a single machine mode (Journal) and a cluster mode (Federation). The Journal provides fast scalable ACID indexed storage for very large data sets, up to 50 billion triples / quads. The federation provides fast scalable shard-wise parallel indexed storage using dynamic sharding and shard-wise ACID updates and incremental cluster size growth. Both platforms support fully concurrent readers with snapshot isolation.

Distributed processing offers greater throughput but does not reduce query or update latency. Choose the Journal when the anticipated scale and throughput requirements permit. Choose the Federation when the administrative and machine overhead associated with operating a cluster is an acceptable tradeoff to have essentially unlimited data scaling and throughput.

See [1,2,8] for instructions on installing bigdata(R), [4] for the javadoc, and [3,5,6] for news, questions, and the latest developments. For more information about SYSTAP, LLC and bigdata, see [7].

Starting with the 1.0.0 release, we offer a WAR artifact [8] for easy installation of the single machine RDF database. For custom development and cluster installations we recommend checking out the code from SVN using the tag for this release. The code will build automatically under eclipse. You can also build the code using the ant script. The cluster installer requires the use of the ant script.

You can download the WAR from:

http://sourceforge.net/projects/bigdata/

You can checkout this release from:

https://bigdata.svn.sourceforge.net/svnroot/bigdata/tags/BIGDATA_RELEASE_1_0_3

Feature summary:

- Single machine data storage to ~50B triples/quads (RWStore);
- Clustered data storage is essentially unlimited;
- Simple embedded and/or webapp deployment (NanoSparqlServer);
- Triples, quads, or triples with provenance (SIDs);
- 100% native SPARQL 1.0 evaluation with lots of query optimizations;
- Fast RDFS+ inference and truth maintenance;
- Fast statement level provenance mode (SIDs).

Road map [3]:

- High-volume analytic query and SPARQL 1.1 query, including aggregations;
- SPARQL 1.1 Update, Property Paths, and Federation support;
- Simplified deployment, configuration, and administration for clusters; and
- High availability for the journal and the cluster.

Change log:

Note: Versions with (*) require data migration. For details, see [9].

1.0.3

– http://sourceforge.net/apps/trac/bigdata/ticket/217 (BTreeCounters does not track bytes released)
– http://sourceforge.net/apps/trac/bigdata/ticket/269 (Refactor performance counters using accessor interface)
– http://sourceforge.net/apps/trac/bigdata/ticket/329 (B+Tree should delete bloom filter when it is disabled.)
– http://sourceforge.net/apps/trac/bigdata/ticket/372 (RWStore does not prune the CommitRecordIndex)
– http://sourceforge.net/apps/trac/bigdata/ticket/375 (Persistent memory leaks (RWStore/DISK))
– http://sourceforge.net/apps/trac/bigdata/ticket/385 (FastRDFValueCoder2: ArrayIndexOutOfBoundsException)
– http://sourceforge.net/apps/trac/bigdata/ticket/391 (Release age advanced on WORM mode journal)
– http://sourceforge.net/apps/trac/bigdata/ticket/392 (Add a DELETE by access path method to the NanoSparqlServer)
– http://sourceforge.net/apps/trac/bigdata/ticket/393 (Add “context-uri” request parameter to specify the default context for INSERT in the REST API)
– http://sourceforge.net/apps/trac/bigdata/ticket/394 (log4j configuration error message in WAR deployment)
– http://sourceforge.net/apps/trac/bigdata/ticket/399 (Add a fast range count method to the REST API)
– http://sourceforge.net/apps/trac/bigdata/ticket/422 (Support temp triple store wrapped by a BigdataSail)
– http://sourceforge.net/apps/trac/bigdata/ticket/424 (NQuads support for NanoSparqlServer)
– http://sourceforge.net/apps/trac/bigdata/ticket/425 (Bug fix to DEFAULT_RDF_FORMAT for bulk data loader in scale-out)
– http://sourceforge.net/apps/trac/bigdata/ticket/426 (Support either lockfile (procmail) and dotlockfile (liblockfile1) in scale-out)
– http://sourceforge.net/apps/trac/bigdata/ticket/427 (BigdataSail#getReadOnlyConnection() race condition with concurrent commit)
– http://sourceforge.net/apps/trac/bigdata/ticket/435 (Address is 0L)
– http://sourceforge.net/apps/trac/bigdata/ticket/436 (TestMROWTransactions failure in CI)

1.0.2

– http://sourceforge.net/apps/trac/bigdata/ticket/32 (Query time expansion of (foo rdf:type rdfs:Resource) drags in SPORelation for scale-out.)
– http://sourceforge.net/apps/trac/bigdata/ticket/181 (Scale-out LUBM “how to” in wiki and build.xml are out of date.)
– http://sourceforge.net/apps/trac/bigdata/ticket/356 (Query not terminated by error.)
– http://sourceforge.net/apps/trac/bigdata/ticket/359 (NamedGraph pattern fails to bind graph variable if only one binding exists.)
– http://sourceforge.net/apps/trac/bigdata/ticket/361 (IRunningQuery not closed promptly.)
– http://sourceforge.net/apps/trac/bigdata/ticket/371 (DataLoader fails to load resources available from the classpath.)
– http://sourceforge.net/apps/trac/bigdata/ticket/376 (Support for the streaming of bigdata IBindingSets into a sparql query.)
– http://sourceforge.net/apps/trac/bigdata/ticket/378 (ClosedByInterruptException during heavy query mix.)
– http://sourceforge.net/apps/trac/bigdata/ticket/379 (NotSerializableException for SPOAccessPath.)
– http://sourceforge.net/apps/trac/bigdata/ticket/382 (Change dependencies to Apache River 2.2.0)

1.0.1 (*)

– http://sourceforge.net/apps/trac/bigdata/ticket/107 (Unicode clean schema names in the sparse row store).
– http://sourceforge.net/apps/trac/bigdata/ticket/124 (TermIdEncoder should use more bits for scale-out).
– http://sourceforge.net/apps/trac/bigdata/ticket/225 (OSX requires specialized performance counter collection classes).
– http://sourceforge.net/apps/trac/bigdata/ticket/348 (BigdataValueFactory.asValue() must return new instance when DummyIV is used).
– http://sourceforge.net/apps/trac/bigdata/ticket/349 (TermIdEncoder limits Journal to 2B distinct RDF Values per triple/quad store instance).
– http://sourceforge.net/apps/trac/bigdata/ticket/351 (SPO not Serializable exception in SIDS mode (scale-out)).
– http://sourceforge.net/apps/trac/bigdata/ticket/352 (ClassCastException when querying with binding-values that are not known to the database).
– http://sourceforge.net/apps/trac/bigdata/ticket/353 (UnsupportedOperatorException for some SPARQL queries).
– http://sourceforge.net/apps/trac/bigdata/ticket/355 (Query failure when comparing with non materialized value).
– http://sourceforge.net/apps/trac/bigdata/ticket/357 (RWStore reports “FixedAllocator returning null address, with freeBits”.)
– http://sourceforge.net/apps/trac/bigdata/ticket/359 (NamedGraph pattern fails to bind graph variable if only one binding exists.)
– http://sourceforge.net/apps/trac/bigdata/ticket/362 (log4j – slf4j bridge.)

For more information about bigdata, please see the following links:

[1] http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=Main_Page
[2] http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=GettingStarted
[3] http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=Roadmap
[4] http://www.bigdata.com/bigdata/docs/api/
[5] http://sourceforge.net/projects/bigdata/
[6] http://www.bigdata.com/blog
[7] http://www.systap.com/bigdata.htm
[8] http://sourceforge.net/projects/bigdata/files/bigdata/
[9] http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=DataMigration

About bigdata:

Bigdata

Starting with our 1.1 release, bigdata includes an optional “analytic query mode”. Enabling analytic query turns on support for the MemoryManager and the HTree and allows bigdata to scale to 4TB of data on the native process heap with zero GC overhead. In the future it will also turn on the runtime query optimizer (RTO). You can get a preview of this release by checking out the TERMS_REFACTOR_BRANCH from SVN.

Someone recently wrote about a query which was doing a SELECT DISTINCT on a large number of solutions. They thought that there might be a memory leak since the JVM eventually crashed with an OutOfMemoryError. However, this is not really a memory leak. Here is what is going on:

Heap pressure and the JVM

The problem is the Java architecture for managed memory. You can read about this, and about how we fix it, here. What you need to do for this query (and others like it) is turn on the “analytic” mode for bigdata. The easiest way to do this is to check the:

[x] analytic

option on the NanoSparqlServer’s SPARQL query form page. If you are using the NanoSparqlServer you can also specify the URL query parameter ...&analytic=true. Finally, you can enable this with a magic triple directly in the SPARQL query:

SELECT ...
...
hint:Query hint:analytic "true" .
...

Just put that triple somewhere in the WHERE clause of the query and the query will run with the “analytic” options enabled. You do not need to declare the “hint:” prefix, but if you want to the namespace should be “http://www.bigdata.com/queryHints#”.

What the analytic query mode will do for you is buffer the data on the native process heap rather than on the JVM object heap. This will reduce the GC overhead associated with the query to basically zero. It performs this feat entirely within Java by leveraging the java.nio package.

There are analytic and non-analytic versions of all the joins, distinct, etc. operators. The analytic versions use the MemoryManager and the HTree. The non-analytic versions use Java collection classes. The Java collection classes are somewhat faster as long as you are not materializing a lot of data on the Java object heap. For example, for the BSBM “explore” use case the Java operators are about 10% faster overall. DISTINCT is a special case. The Java version of that operator uses a ConcurrentHashMap under the covers and can give you much higher concurrency in the query. But, if you are trying to DISTINCT a large number solutions then you are going to run into trouble with the Garbage Collector.

Bottom line: if you are going to be materializing a LOT of data then you need to use the analytic operators. Those operators will scale to 4TB of RAM. If you try to materialize a lot of data on the Java object heap, you will run into heavy GC overhead and the application will slow down to a crawl and then die with “java.lang.OutOfMemoryError: GC overhead limit exceeded”.