This is a 1.0.x maintenance release of bigdata(R). New users are encouraged to go directly to the 1.1.0 release. Bigdata is a horizontally-scaled, open-source architecture for indexed data with an emphasis on RDF capable of loading 1B triples in under one hour on a 15 node cluster. Bigdata operates in both a single machine mode (Journal) and a cluster mode (Federation). The Journal provides fast scalable ACID indexed storage for very large data sets, up to 50 billion triples / quads. The federation provides fast scalable shard-wise parallel indexed storage using dynamic sharding and shard-wise ACID updates and incremental cluster size growth. Both platforms support fully concurrent readers with snapshot isolation.

Distributed processing offers greater throughput but does not reduce query or update latency. Choose the Journal when the anticipated scale and throughput requirements permit. Choose the Federation when the administrative and machine overhead associated with operating a cluster is an acceptable tradeoff to have essentially unlimited data scaling and throughput.

See [1,2,8] for instructions on installing bigdata(R), [4] for the javadoc, and [3,5,6] for news, questions, and the latest developments. For more information about SYSTAP, LLC and bigdata, see [7].

Starting with the 1.0.0 release, we offer a WAR artifact [8] for easy installation of the single machine RDF database. For custom development and cluster installations we recommend checking out the code from SVN using the tag for this release. The code will build automatically under eclipse. You can also build the code using the ant script. The cluster installer requires the use of the ant script.

You can download the WAR from:

http://sourceforge.net/projects/bigdata/

You can checkout this release from:

https://bigdata.svn.sourceforge.net/svnroot/bigdata/tags/BIGDATA_RELEASE_1_0_4

Feature summary:

- Single machine data storage to ~50B triples/quads (RWStore);
- Clustered data storage is essentially unlimited;
- Simple embedded and/or webapp deployment (NanoSparqlServer);
- Triples, quads, or triples with provenance (SIDs);
- 100% native SPARQL 1.0 evaluation with lots of query optimizations;
- Fast RDFS+ inference and truth maintenance;
- Fast statement level provenance mode (SIDs).

Road map [3]:

- High-volume analytic query and SPARQL 1.1 query, including aggregations;
- SPARQL 1.1 Update, Property Paths, and Federation support;
- Simplified deployment, configuration, and administration for clusters; and
- High availability for the journal and the cluster.

Change log:

Note: Versions with (*) require data migration. For details, see [9].

1.0.4

- http://sourceforge.net/apps/trac/bigdata/ticket/443 (Logger for RWStore transaction service and recycler)
- http://sourceforge.net/apps/trac/bigdata/ticket/445 (RWStore does not track tx release correctly)
- http://sourceforge.net/apps/trac/bigdata/ticket/437 (Thread-local cache combined with unbounded thread pools causes effective memory leak: termCache memory leak & thread-local buffers)

1.0.3

– http://sourceforge.net/apps/trac/bigdata/ticket/217 (BTreeCounters does not track bytes released)
– http://sourceforge.net/apps/trac/bigdata/ticket/269 (Refactor performance counters using accessor interface)
– http://sourceforge.net/apps/trac/bigdata/ticket/329 (B+Tree should delete bloom filter when it is disabled.)
– http://sourceforge.net/apps/trac/bigdata/ticket/372 (RWStore does not prune the CommitRecordIndex)
– http://sourceforge.net/apps/trac/bigdata/ticket/375 (Persistent memory leaks (RWStore/DISK))
– http://sourceforge.net/apps/trac/bigdata/ticket/385 (FastRDFValueCoder2: ArrayIndexOutOfBoundsException)
– http://sourceforge.net/apps/trac/bigdata/ticket/391 (Release age advanced on WORM mode journal)
– http://sourceforge.net/apps/trac/bigdata/ticket/392 (Add a DELETE by access path method to the NanoSparqlServer)
– http://sourceforge.net/apps/trac/bigdata/ticket/393 (Add “context-uri” request parameter to specify the default context for INSERT in the REST API)
– http://sourceforge.net/apps/trac/bigdata/ticket/394 (log4j configuration error message in WAR deployment)
– http://sourceforge.net/apps/trac/bigdata/ticket/399 (Add a fast range count method to the REST API)
– http://sourceforge.net/apps/trac/bigdata/ticket/422 (Support temp triple store wrapped by a BigdataSail)
– http://sourceforge.net/apps/trac/bigdata/ticket/424 (NQuads support for NanoSparqlServer)
– http://sourceforge.net/apps/trac/bigdata/ticket/425 (Bug fix to DEFAULT_RDF_FORMAT for bulk data loader in scale-out)
– http://sourceforge.net/apps/trac/bigdata/ticket/426 (Support either lockfile (procmail) and dotlockfile (liblockfile1) in scale-out)
– http://sourceforge.net/apps/trac/bigdata/ticket/427 (BigdataSail#getReadOnlyConnection() race condition with concurrent commit)
– http://sourceforge.net/apps/trac/bigdata/ticket/435 (Address is 0L)
– http://sourceforge.net/apps/trac/bigdata/ticket/436 (TestMROWTransactions failure in CI)

1.0.2

– http://sourceforge.net/apps/trac/bigdata/ticket/32 (Query time expansion of (foo rdf:type rdfs:Resource) drags in SPORelation for scale-out.)
– http://sourceforge.net/apps/trac/bigdata/ticket/181 (Scale-out LUBM “how to” in wiki and build.xml are out of date.)
– http://sourceforge.net/apps/trac/bigdata/ticket/356 (Query not terminated by error.)
– http://sourceforge.net/apps/trac/bigdata/ticket/359 (NamedGraph pattern fails to bind graph variable if only one binding exists.)
– http://sourceforge.net/apps/trac/bigdata/ticket/361 (IRunningQuery not closed promptly.)
– http://sourceforge.net/apps/trac/bigdata/ticket/371 (DataLoader fails to load resources available from the classpath.)
– http://sourceforge.net/apps/trac/bigdata/ticket/376 (Support for the streaming of bigdata IBindingSets into a sparql query.)
– http://sourceforge.net/apps/trac/bigdata/ticket/378 (ClosedByInterruptException during heavy query mix.)
– http://sourceforge.net/apps/trac/bigdata/ticket/379 (NotSerializableException for SPOAccessPath.)
– http://sourceforge.net/apps/trac/bigdata/ticket/382 (Change dependencies to Apache River 2.2.0)

1.0.1 (*)

– http://sourceforge.net/apps/trac/bigdata/ticket/107 (Unicode clean schema names in the sparse row store).
– http://sourceforge.net/apps/trac/bigdata/ticket/124 (TermIdEncoder should use more bits for scale-out).
– http://sourceforge.net/apps/trac/bigdata/ticket/225 (OSX requires specialized performance counter collection classes).
– http://sourceforge.net/apps/trac/bigdata/ticket/348 (BigdataValueFactory.asValue() must return new instance when DummyIV is used).
– http://sourceforge.net/apps/trac/bigdata/ticket/349 (TermIdEncoder limits Journal to 2B distinct RDF Values per triple/quad store instance).
– http://sourceforge.net/apps/trac/bigdata/ticket/351 (SPO not Serializable exception in SIDS mode (scale-out)).
– http://sourceforge.net/apps/trac/bigdata/ticket/352 (ClassCastException when querying with binding-values that are not known to the database).
– http://sourceforge.net/apps/trac/bigdata/ticket/353 (UnsupportedOperatorException for some SPARQL queries).
– http://sourceforge.net/apps/trac/bigdata/ticket/355 (Query failure when comparing with non materialized value).
– http://sourceforge.net/apps/trac/bigdata/ticket/357 (RWStore reports “FixedAllocator returning null address, with freeBits”.)
– http://sourceforge.net/apps/trac/bigdata/ticket/359 (NamedGraph pattern fails to bind graph variable if only one binding exists.)
– http://sourceforge.net/apps/trac/bigdata/ticket/362 (log4j – slf4j bridge.)

For more information about bigdata, please see the following links:

[1] http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=Main_Page
[2] http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=GettingStarted
[3] http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=Roadmap
[4] http://www.bigdata.com/bigdata/docs/api/
[5] http://sourceforge.net/projects/bigdata/
[6] http://www.bigdata.com/blog
[7] http://www.systap.com/bigdata.htm
[8] http://sourceforge.net/projects/bigdata/files/bigdata/
[9] http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=DataMigration

About bigdata:

Bigdata® is a horizontally-scaled, general purpose storage and computing fabric
for ordered data (B+Trees), designed to operate on either a single server or a
cluster of commodity hardware. Bigdata® uses dynamically partitioned key-range
shards in order to remove any realistic scaling limits – in principle, bigdata®
may be deployed on 10s, 100s, or even thousands of machines and new capacity may
be added incrementally without requiring the full reload of all data. The bigdata®
RDF database supports RDFS and OWL Lite reasoning, high-level query (SPARQL),
and datum level provenance.

Ganglia is one of the first and most scalable performance metric platforms (known scaling up to 2000 nodes) and also handles aggregations of clusters (grids). There has also been a lot of work recently on improving the web facing UIs for graphing performance metrics. The data are collected on each host by gmond. By default, each gmond instance discovers and listens to all other hosts on the network using UDP multicast and builds up a soft state model of the metrics for the entire cluster. A gmetad instance running on one (or more) nodes joins the ganglia network, aggregates the metrics, and writes them onto RRDtool files. A web UI may then be used to graph the performance metrics. In the new web UI, this is done using both RRDtool and flot (flot is used for interactive graphics). In addition to performance metric collection, there are a variety of hooks which can be used to integrate ganglia with nagios so you can integrate performance metrics into your monitoring solution as well.

Bigdata has long had an internal hierarchical metrics collection mechanism. Metrics are reported out using XML on both an embedded HTTP server (per service) and aggregated and reported out by the load balancer service (LBS). The aggregated metrics are available for plotting and sample Excel worksheets are available which show you how to do exactly that. You can also run scripts over the historical metrics data to obtain detailed histories of performance metrics. There are pluses and minuses to this approach. On the plus side, bigdata retains detailed performance metrics, collects a LOT of data on disk, ram, cpu and application metrics, and provides some sophisticated table oriented views of those metrics for plotting. Bigdata also collects “events”, which can be graphed by the LBS using flot. On the down side, the Excel “integration” is a clunky so we have been looking around for a way to leverage work that other people have done on metrics collection and reporting.

We’ve recently put together an embedded GangliaService for Java under the Apache 2.0 license. Unlike the other efforts we could find (hadoop-commons, embedded-ganglia), the GangliaService is a full ganglia peer. It both listens to the ganglia 3.1 protocol and reports out metrics from the application and/or host using the same protocol. It could even be used to run ganglia on operating systems where there is no existing port (for example, on Windows where we have a typeperf integration which collects platform level statistics). Since the GangliaService is a ganglia peer, it maintains the soft state of the metrics for the entire cluster and can produce load balanced report for the cluster just as if you were using gstat. However, the load balanced reports produced by the GangliaService can be customized to use different metrics and scoring routines and can be directly accessed and leveraged by the Java application.

The main entry point is GangliaService. It is trivial to setup with defaults and you can easily register your own metrics collection classes to report out on your application.

GangliaServer service = new GangliaService("MyService");
// Register to collect metrics.
service.addMetricCollector(new MyMetricsCollector());
// Join the ganglia network; Start collecting and reporting metrics.
service.run();

The following will return the default load balanced report, which contains exactly the same information that you would get from gstat -a. You can also use an alternative method signature to get a report based on your own list of metrics and/or have the report sorted by the metric (or even a synthetic metric) of your choice.

IHostReport[] hostReport = service.getHostReport();

The bigdata-ganglia module is available in the 1.1.0 branch in SVN. The bigdata-ganglia JAR is included in the bigdata 1.1.1 release. You can also get the bigdata-ganglia source and JAR here.

This version does not yet include the typeperf, vmstat, pidstat and similar host and service level collection modules. We plan to refactor those out of the core bigdata code base soon, but they currently collect and report using bigdata’s internal hierarchical counter set data model rather than the flatter ganglia metrics data model.

The GangliaService is not yet integrated in bigdata 1.1.1, but we will be taking that step shortly as part of our effort to simplify the deployment and management of a bigdata federation. First, we will use the GangliaService to replace the centralized LBS with a ganglia network. We have a bit of work to do still since bigdata also reports events to the LBS and uses a custom load balanced metric. However, ganglia is beginning to offer support for JSON based event descriptions which can then be overlaid on the generated graphs. Beyond replacing the LBS, we plan to leverage this effort again when we decompose the MetadataService (aka MDS aka shard locator) into a P2P service running on the DataService (DS) nodes. This will offer significantly better scaling and will remove one of the main barriers to petabyte scale for bigdata.

This is a major version release of bigdata(R). Bigdata is a horizontally-scaled, open-source architecture for indexed data with an emphasis on RDF capable of loading 1B triples in under one hour on a 15 node cluster. Bigdata operates in both a single machine mode (Journal) and a cluster mode (Federation). The Journal provides fast scalable ACID indexed storage for very large data sets, up to 50 billion triples / quads. The federation provides fast scalable shard-wise parallel indexed storage using dynamic sharding and shard-wise ACID updates and incremental cluster size growth. Both platforms support fully concurrent readers with snapshot isolation.

Distributed processing offers greater throughput but does not reduce query or update latency. Choose the Journal when the anticipated scale and throughput requirements permit. Choose the Federation when the administrative and machine overhead associated with operating a cluster is an acceptable tradeoff to have essentially unlimited data scaling and throughput.

See [1,2,8] for instructions on installing bigdata(R), [4] for the javadoc, and [3,5,6] for news, questions, and the latest developments. For more information about SYSTAP, LLC and bigdata, see [7].

Starting with the 1.0.0 release, we offer a WAR artifact [8] for easy installation of the single machine RDF database. For custom development and cluster installations we recommend checking out the code from SVN using the tag for this release. The code will build automatically under eclipse. You can also build the code using the ant script. The cluster installer requires the use of the ant script.

You can download the WAR from:

http://sourceforge.net/projects/bigdata/

You can checkout this release from:

https://bigdata.svn.sourceforge.net/svnroot/bigdata/tags/BIGDATA_RELEASE_1_1_0

New features:

- Fast, scalable native support for SPARQL 1.1 analytic queries;
- %100 Java memory manager leverages the JVM native heap (no GC);
- New extensible hash tree index structure;

Feature summary:

- Single machine data storage to ~50B triples/quads (RWStore);
- Clustered data storage is essentially unlimited;
- Simple embedded and/or webapp deployment (NanoSparqlServer);
- Triples, quads, or triples with provenance (SIDs);
- Fast 100% native SPARQL 1.0 evaluation;
- Integrated “analytic” query package;
- Fast RDFS+ inference and truth maintenance;
- Fast statement level provenance mode (SIDs).

Road map [3]:

- Simplified deployment, configuration, and administration for clusters; and
- High availability for the journal and the cluster.

Change log:

Note: Versions with (*) require data migration. For details, see [9].

1.1.0 (*)

– http://sourceforge.net/apps/trac/bigdata/ticket/23 (Lexicon joins)
– http://sourceforge.net/apps/trac/bigdata/ticket/109 (Store large literals as “blobs”)
– http://sourceforge.net/apps/trac/bigdata/ticket/181 (Scale-out LUBM “how to” in wiki and build.xml are out of date.)
– http://sourceforge.net/apps/trac/bigdata/ticket/203 (Implement an persistence capable hash table to support analytic query)
– http://sourceforge.net/apps/trac/bigdata/ticket/209 (AccessPath should visit binding sets rather than elements for high level query.)
– http://sourceforge.net/apps/trac/bigdata/ticket/227 (SliceOp appears to be necessary when operator plan should suffice without)
– http://sourceforge.net/apps/trac/bigdata/ticket/232 (Bottom-up evaluation semantics).
– http://sourceforge.net/apps/trac/bigdata/ticket/246 (Derived xsd numeric data types must be inlined as extension types.)
– http://sourceforge.net/apps/trac/bigdata/ticket/254 (Revisit pruning of intermediate variable bindings during query execution)
– http://sourceforge.net/apps/trac/bigdata/ticket/261 (Lift conditions out of subqueries.)
– http://sourceforge.net/apps/trac/bigdata/ticket/300 (Native ORDER BY)
– http://sourceforge.net/apps/trac/bigdata/ticket/324 (Inline predeclared URIs and namespaces in 2-3 bytes)
– http://sourceforge.net/apps/trac/bigdata/ticket/330 (NanoSparqlServer does not locate “html” resources when run from jar)
– http://sourceforge.net/apps/trac/bigdata/ticket/334 (Support inlining of unicode data in the statement indices.)
– http://sourceforge.net/apps/trac/bigdata/ticket/364 (Scalable default graph evaluation)
– http://sourceforge.net/apps/trac/bigdata/ticket/368 (Prune variable bindings during query evaluation)
– http://sourceforge.net/apps/trac/bigdata/ticket/370 (Direct translation of openrdf AST to bigdata AST)
– http://sourceforge.net/apps/trac/bigdata/ticket/373 (Fix StrBOp and other IValueExpressions)
– http://sourceforge.net/apps/trac/bigdata/ticket/377 (Optimize OPTIONALs with multiple statement patterns.)
– http://sourceforge.net/apps/trac/bigdata/ticket/380 (Native SPARQL evaluation on cluster)
– http://sourceforge.net/apps/trac/bigdata/ticket/387 (Cluster does not compute closure)
– http://sourceforge.net/apps/trac/bigdata/ticket/395 (HTree hash join performance)
– http://sourceforge.net/apps/trac/bigdata/ticket/401 (inline xsd:unsigned datatypes)
– http://sourceforge.net/apps/trac/bigdata/ticket/408 (xsd:string cast fails for non-numeric data)
– http://sourceforge.net/apps/trac/bigdata/ticket/421 (New query hints model.)
– http://sourceforge.net/apps/trac/bigdata/ticket/431 (Use of read-only tx per query defeats cache on cluster)

1.0.3

– http://sourceforge.net/apps/trac/bigdata/ticket/217 (BTreeCounters does not track bytes released)
– http://sourceforge.net/apps/trac/bigdata/ticket/269 (Refactor performance counters using accessor interface)
– http://sourceforge.net/apps/trac/bigdata/ticket/329 (B+Tree should delete bloom filter when it is disabled.)
– http://sourceforge.net/apps/trac/bigdata/ticket/372 (RWStore does not prune the CommitRecordIndex)
– http://sourceforge.net/apps/trac/bigdata/ticket/375 (Persistent memory leaks (RWStore/DISK))
– http://sourceforge.net/apps/trac/bigdata/ticket/385 (FastRDFValueCoder2: ArrayIndexOutOfBoundsException)
– http://sourceforge.net/apps/trac/bigdata/ticket/391 (Release age advanced on WORM mode journal)
– http://sourceforge.net/apps/trac/bigdata/ticket/392 (Add a DELETE by access path method to the NanoSparqlServer)
– http://sourceforge.net/apps/trac/bigdata/ticket/393 (Add “context-uri” request parameter to specify the default context for INSERT in the REST API)
– http://sourceforge.net/apps/trac/bigdata/ticket/394 (log4j configuration error message in WAR deployment)
– http://sourceforge.net/apps/trac/bigdata/ticket/399 (Add a fast range count method to the REST API)
– http://sourceforge.net/apps/trac/bigdata/ticket/422 (Support temp triple store wrapped by a BigdataSail)
– http://sourceforge.net/apps/trac/bigdata/ticket/424 (NQuads support for NanoSparqlServer)
– http://sourceforge.net/apps/trac/bigdata/ticket/425 (Bug fix to DEFAULT_RDF_FORMAT for bulk data loader in scale-out)
– http://sourceforge.net/apps/trac/bigdata/ticket/426 (Support either lockfile (procmail) and dotlockfile (liblockfile1) in scale-out)
– http://sourceforge.net/apps/trac/bigdata/ticket/427 (BigdataSail#getReadOnlyConnection() race condition with concurrent commit)
– http://sourceforge.net/apps/trac/bigdata/ticket/435 (Address is 0L)
– http://sourceforge.net/apps/trac/bigdata/ticket/436 (TestMROWTransactions failure in CI)

1.0.2

– http://sourceforge.net/apps/trac/bigdata/ticket/32 (Query time expansion of (foo rdf:type rdfs:Resource) drags in SPORelation for scale-out.)
– http://sourceforge.net/apps/trac/bigdata/ticket/181 (Scale-out LUBM “how to” in wiki and build.xml are out of date.)
– http://sourceforge.net/apps/trac/bigdata/ticket/356 (Query not terminated by error.)
– http://sourceforge.net/apps/trac/bigdata/ticket/359 (NamedGraph pattern fails to bind graph variable if only one binding exists.)
– http://sourceforge.net/apps/trac/bigdata/ticket/361 (IRunningQuery not closed promptly.)
– http://sourceforge.net/apps/trac/bigdata/ticket/371 (DataLoader fails to load resources available from the classpath.)
– http://sourceforge.net/apps/trac/bigdata/ticket/376 (Support for the streaming of bigdata IBindingSets into a sparql query.)
– http://sourceforge.net/apps/trac/bigdata/ticket/378 (ClosedByInterruptException during heavy query mix.)
– http://sourceforge.net/apps/trac/bigdata/ticket/379 (NotSerializableException for SPOAccessPath.)
– http://sourceforge.net/apps/trac/bigdata/ticket/382 (Change dependencies to Apache River 2.2.0)

1.0.1 (*)

– http://sourceforge.net/apps/trac/bigdata/ticket/107 (Unicode clean schema names in the sparse row store).
– http://sourceforge.net/apps/trac/bigdata/ticket/124 (TermIdEncoder should use more bits for scale-out).
– http://sourceforge.net/apps/trac/bigdata/ticket/225 (OSX requires specialized performance counter collection classes).
– http://sourceforge.net/apps/trac/bigdata/ticket/348 (BigdataValueFactory.asValue() must return new instance when DummyIV is used).
– http://sourceforge.net/apps/trac/bigdata/ticket/349 (TermIdEncoder limits Journal to 2B distinct RDF Values per triple/quad store instance).
– http://sourceforge.net/apps/trac/bigdata/ticket/351 (SPO not Serializable exception in SIDS mode (scale-out)).
– http://sourceforge.net/apps/trac/bigdata/ticket/352 (ClassCastException when querying with binding-values that are not known to the database).
– http://sourceforge.net/apps/trac/bigdata/ticket/353 (UnsupportedOperatorException for some SPARQL queries).
– http://sourceforge.net/apps/trac/bigdata/ticket/355 (Query failure when comparing with non materialized value).
– http://sourceforge.net/apps/trac/bigdata/ticket/357 (RWStore reports “FixedAllocator returning null address, with freeBits”.)
– http://sourceforge.net/apps/trac/bigdata/ticket/359 (NamedGraph pattern fails to bind graph variable if only one binding exists.)
– http://sourceforge.net/apps/trac/bigdata/ticket/362 (log4j – slf4j bridge.)

For more information about bigdata(R), please see the following links:

[1] http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=Main_Page
[2] http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=GettingStarted
[3] http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=Roadmap
[4] http://www.bigdata.com/bigdata/docs/api/
[5] http://sourceforge.net/projects/bigdata/
[6] http://www.bigdata.com/blog
[7] http://www.systap.com/bigdata.htm
[8] http://sourceforge.net/projects/bigdata/files/bigdata/
[9] http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=DataMigration

About bigdata:

Bigdata® is a horizontally-scaled, general purpose storage and computing fabric for ordered data (B+Trees), designed to operate on either a single server or a cluster of commodity hardware. Bigdata® uses dynamically partitioned key-range shards in order to remove any realistic scaling limits – in principle, bigdata® may be deployed on 10s, 100s, or even thousands of machines and new capacity may be added incrementally without requiring the full reload of all data. The bigdata® RDF database supports RDFS and OWL Lite reasoning, high-level query (SPARQL), and datum level provenance.

This is a 1.0.x maintenance release of bigdata(R). New users are encouraged to go directly to the 1.1.0 release. Bigdata is a horizontally-scaled, open-source architecture for indexed data with an emphasis on RDF capable of loading 1B triples in under one hour on a 15 node cluster. Bigdata operates in both a single machine mode (Journal) and a cluster mode (Federation). The Journal provides fast scalable ACID indexed storage for very large data sets, up to 50 billion triples / quads. The federation provides fast scalable shard-wise parallel indexed storage using dynamic sharding and shard-wise ACID updates and incremental cluster size growth. Both platforms support fully concurrent readers with snapshot isolation.

Distributed processing offers greater throughput but does not reduce query or update latency. Choose the Journal when the anticipated scale and throughput requirements permit. Choose the Federation when the administrative and machine overhead associated with operating a cluster is an acceptable tradeoff to have essentially unlimited data scaling and throughput.

See [1,2,8] for instructions on installing bigdata(R), [4] for the javadoc, and [3,5,6] for news, questions, and the latest developments. For more information about SYSTAP, LLC and bigdata, see [7].

Starting with the 1.0.0 release, we offer a WAR artifact [8] for easy installation of the single machine RDF database. For custom development and cluster installations we recommend checking out the code from SVN using the tag for this release. The code will build automatically under eclipse. You can also build the code using the ant script. The cluster installer requires the use of the ant script.

You can download the WAR from:

http://sourceforge.net/projects/bigdata/

You can checkout this release from:

https://bigdata.svn.sourceforge.net/svnroot/bigdata/tags/BIGDATA_RELEASE_1_0_3

Feature summary:

- Single machine data storage to ~50B triples/quads (RWStore);
- Clustered data storage is essentially unlimited;
- Simple embedded and/or webapp deployment (NanoSparqlServer);
- Triples, quads, or triples with provenance (SIDs);
- 100% native SPARQL 1.0 evaluation with lots of query optimizations;
- Fast RDFS+ inference and truth maintenance;
- Fast statement level provenance mode (SIDs).

Road map [3]:

- High-volume analytic query and SPARQL 1.1 query, including aggregations;
- SPARQL 1.1 Update, Property Paths, and Federation support;
- Simplified deployment, configuration, and administration for clusters; and
- High availability for the journal and the cluster.

Change log:

Note: Versions with (*) require data migration. For details, see [9].

1.0.3

– http://sourceforge.net/apps/trac/bigdata/ticket/217 (BTreeCounters does not track bytes released)
– http://sourceforge.net/apps/trac/bigdata/ticket/269 (Refactor performance counters using accessor interface)
– http://sourceforge.net/apps/trac/bigdata/ticket/329 (B+Tree should delete bloom filter when it is disabled.)
– http://sourceforge.net/apps/trac/bigdata/ticket/372 (RWStore does not prune the CommitRecordIndex)
– http://sourceforge.net/apps/trac/bigdata/ticket/375 (Persistent memory leaks (RWStore/DISK))
– http://sourceforge.net/apps/trac/bigdata/ticket/385 (FastRDFValueCoder2: ArrayIndexOutOfBoundsException)
– http://sourceforge.net/apps/trac/bigdata/ticket/391 (Release age advanced on WORM mode journal)
– http://sourceforge.net/apps/trac/bigdata/ticket/392 (Add a DELETE by access path method to the NanoSparqlServer)
– http://sourceforge.net/apps/trac/bigdata/ticket/393 (Add “context-uri” request parameter to specify the default context for INSERT in the REST API)
– http://sourceforge.net/apps/trac/bigdata/ticket/394 (log4j configuration error message in WAR deployment)
– http://sourceforge.net/apps/trac/bigdata/ticket/399 (Add a fast range count method to the REST API)
– http://sourceforge.net/apps/trac/bigdata/ticket/422 (Support temp triple store wrapped by a BigdataSail)
– http://sourceforge.net/apps/trac/bigdata/ticket/424 (NQuads support for NanoSparqlServer)
– http://sourceforge.net/apps/trac/bigdata/ticket/425 (Bug fix to DEFAULT_RDF_FORMAT for bulk data loader in scale-out)
– http://sourceforge.net/apps/trac/bigdata/ticket/426 (Support either lockfile (procmail) and dotlockfile (liblockfile1) in scale-out)
– http://sourceforge.net/apps/trac/bigdata/ticket/427 (BigdataSail#getReadOnlyConnection() race condition with concurrent commit)
– http://sourceforge.net/apps/trac/bigdata/ticket/435 (Address is 0L)
– http://sourceforge.net/apps/trac/bigdata/ticket/436 (TestMROWTransactions failure in CI)

1.0.2

– http://sourceforge.net/apps/trac/bigdata/ticket/32 (Query time expansion of (foo rdf:type rdfs:Resource) drags in SPORelation for scale-out.)
– http://sourceforge.net/apps/trac/bigdata/ticket/181 (Scale-out LUBM “how to” in wiki and build.xml are out of date.)
– http://sourceforge.net/apps/trac/bigdata/ticket/356 (Query not terminated by error.)
– http://sourceforge.net/apps/trac/bigdata/ticket/359 (NamedGraph pattern fails to bind graph variable if only one binding exists.)
– http://sourceforge.net/apps/trac/bigdata/ticket/361 (IRunningQuery not closed promptly.)
– http://sourceforge.net/apps/trac/bigdata/ticket/371 (DataLoader fails to load resources available from the classpath.)
– http://sourceforge.net/apps/trac/bigdata/ticket/376 (Support for the streaming of bigdata IBindingSets into a sparql query.)
– http://sourceforge.net/apps/trac/bigdata/ticket/378 (ClosedByInterruptException during heavy query mix.)
– http://sourceforge.net/apps/trac/bigdata/ticket/379 (NotSerializableException for SPOAccessPath.)
– http://sourceforge.net/apps/trac/bigdata/ticket/382 (Change dependencies to Apache River 2.2.0)

1.0.1 (*)

– http://sourceforge.net/apps/trac/bigdata/ticket/107 (Unicode clean schema names in the sparse row store).
– http://sourceforge.net/apps/trac/bigdata/ticket/124 (TermIdEncoder should use more bits for scale-out).
– http://sourceforge.net/apps/trac/bigdata/ticket/225 (OSX requires specialized performance counter collection classes).
– http://sourceforge.net/apps/trac/bigdata/ticket/348 (BigdataValueFactory.asValue() must return new instance when DummyIV is used).
– http://sourceforge.net/apps/trac/bigdata/ticket/349 (TermIdEncoder limits Journal to 2B distinct RDF Values per triple/quad store instance).
– http://sourceforge.net/apps/trac/bigdata/ticket/351 (SPO not Serializable exception in SIDS mode (scale-out)).
– http://sourceforge.net/apps/trac/bigdata/ticket/352 (ClassCastException when querying with binding-values that are not known to the database).
– http://sourceforge.net/apps/trac/bigdata/ticket/353 (UnsupportedOperatorException for some SPARQL queries).
– http://sourceforge.net/apps/trac/bigdata/ticket/355 (Query failure when comparing with non materialized value).
– http://sourceforge.net/apps/trac/bigdata/ticket/357 (RWStore reports “FixedAllocator returning null address, with freeBits”.)
– http://sourceforge.net/apps/trac/bigdata/ticket/359 (NamedGraph pattern fails to bind graph variable if only one binding exists.)
– http://sourceforge.net/apps/trac/bigdata/ticket/362 (log4j – slf4j bridge.)

For more information about bigdata, please see the following links:

[1] http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=Main_Page
[2] http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=GettingStarted
[3] http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=Roadmap
[4] http://www.bigdata.com/bigdata/docs/api/
[5] http://sourceforge.net/projects/bigdata/
[6] http://www.bigdata.com/blog
[7] http://www.systap.com/bigdata.htm
[8] http://sourceforge.net/projects/bigdata/files/bigdata/
[9] http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=DataMigration

About bigdata:

Bigdata® is a horizontally-scaled, general purpose storage and computing fabric
for ordered data (B+Trees), designed to operate on either a single server or a
cluster of commodity hardware. Bigdata® uses dynamically partitioned key-range
shards in order to remove any realistic scaling limits – in principle, bigdata®
may be deployed on 10s, 100s, or even thousands of machines and new capacity may
be added incrementally without requiring the full reload of all data. The bigdata®
RDF database supports RDFS and OWL Lite reasoning, high-level query (SPARQL),
and datum level provenance.

Starting with our 1.1 release, bigdata includes an optional “analytic query mode”. Enabling analytic query turns on support for the MemoryManager and the HTree and allows bigdata to scale to 4TB of data on the native process heap with zero GC overhead. In the future it will also turn on the runtime query optimizer (RTO). You can get a preview of this release by checking out the TERMS_REFACTOR_BRANCH from SVN.

Someone recently wrote about a query which was doing a SELECT DISTINCT on a large number of solutions. They thought that there might be a memory leak since the JVM eventually crashed with an OutOfMemoryError. However, this is not really a memory leak. Here is what is going on:

Heap pressure and the JVM

The problem is the Java architecture for managed memory. You can read about this, and about how we fix it, here. What you need to do for this query (and others like it) is turn on the “analytic” mode for bigdata. The easiest way to do this is to check the:

[x] analytic

option on the NanoSparqlServer’s SPARQL query form page. If you are using the NanoSparqlServer you can also specify the URL query parameter ...&analytic=true. Finally, you can enable this with a magic triple directly in the SPARQL query:

SELECT ...
...
hint:Query hint:analytic "true" .
...

Just put that triple somewhere in the WHERE clause of the query and the query will run with the “analytic” options enabled. You do not need to declare the “hint:” prefix, but if you want to the namespace should be “http://www.bigdata.com/queryHints#”.

What the analytic query mode will do for you is buffer the data on the native process heap rather than on the JVM object heap. This will reduce the GC overhead associated with the query to basically zero. It performs this feat entirely within Java by leveraging the java.nio package.

There are analytic and non-analytic versions of all the joins, distinct, etc. operators. The analytic versions use the MemoryManager and the HTree. The non-analytic versions use Java collection classes. The Java collection classes are somewhat faster as long as you are not materializing a lot of data on the Java object heap. For example, for the BSBM “explore” use case the Java operators are about 10% faster overall. DISTINCT is a special case. The Java version of that operator uses a ConcurrentHashMap under the covers and can give you much higher concurrency in the query. But, if you are trying to DISTINCT a large number solutions then you are going to run into trouble with the Garbage Collector.

Bottom line: if you are going to be materializing a LOT of data then you need to use the analytic operators. Those operators will scale to 4TB of RAM. If you try to materialize a lot of data on the Java object heap, you will run into heavy GC overhead and the application will slow down to a crawl and then die with “java.lang.OutOfMemoryError: GC overhead limit exceeded”.

This is the first of a series of posts on new features that will be introduced with our 1.1 release. That release will feature support for much of SPARQL 1.1 and includes sophisticated new data structures and algorithms for fast, scalable analytic query. The highlights of our forthcoming 1.1 release are:

- SPARQL 1.1 support (Sesame 2.5) (except property paths, minus and update).
- A new query optimizer (the RTO).
- Scalable analytic operators (hash joins).
- New extensible hash tree index (the HTree).
- 100% native Java solution gets data off the JVM object heap (the Memory Manager).

This article will begin at the bottom of the stack, focusing on the memory manager and its role in supporting scalable query.

Due to byte code optimization, Java applications can run as fast as hand coded C application. However, there is a hidden cost associated with Java application — the maintenance of the JVM object heap. For many applications the cost of object creation, retention, and garbage collection may be negligible. However, as illustrated in the diagram below, an application with a high object creation and retention rate can get into trouble with the Garbage Collector.

JVM Heap Pressure

As the application induced “heap pressure” increases, the garbage collector must run more and more frequently. Depending on the mode in which the garbage collector is running, it may either take cores away from the application or freeze out the application entirely during GC cycles. As the application heap pressure increases, the GC cycle time increases as well. Eventually, the garbage collector runs more than the application and application throughput plummets. Larger heaps can only mask this problem temporarily since larger heaps require more GC time.

With our 1.1 release, we are introducing the Bigdata MemoryManager class. We have been using the NIO package and native buffer allocations for some time in the clustered database, but with the introduction of the MemoryManager and the MemStrore, these native heap allocations are now easily reused for other purposes. The memory manager utilizes the Java NIO package to allocate large blocks of RAM on the native process heap using ByteBuffer.allocateDirect(…) and thus remains a 100% Java solution! These allocations are outside of the JVM managed memory and impose NO GC overhead. They are basically regions of the native C heap allocated by malloc inside of the JVM. Such allocations can not be deterministically released, so we maintain a pool of such native buffers and recycle buffers once they are no longer required for a specific purpose.

The memory manager is basically the RWStore technology repackaged for main memory. Like the RWStore, it can scale to terabytes. The key interface is com.bigdata.rwstore.sector.IMemoryManager. The IMemoryManager interface provides for hierarchical nesting of “allocation contexts” which share the same pool of backing buffers. This hierarchical allocation model makes it easier to ensure that allocations for the same purpose are grouped together on the same native buffers, and that all allocations are released no later than when their top allocation context goes out of scope. For example, an IRunningQuery on the QueryEngine has a top-level allocation context. Various operators in the query plan may create inner allocation contexts in which they store data, typically on an HTree instance. Since we always clear the top-level allocation context when the IRunningQuery is done, all such hash indices will be release no later than when the IRunningQuery is done.

The IMemoryManager is a low level interface which manages a logical address space over native buffers and provides methods to allocate, read, and delete slices of data on those buffers. The MemStore class in the same package provides a higher level IRawStore abstraction — the same abstraction which is used by the journal and the B+Tree and HTree classes. The MemStore makes it easy to use persistence capable data structures over native buffers. The main use of the MemStore in the 1.1 release is to support high level language constructs such as DISTINCT, scalable default graph evaluation, and highly efficient hash join operators. All of those features rely on the HTree operating over an MemStore. However, other applications are certainly possible, including very large application specific caches.

Bigdata is a high-performance RDF database that uses B+tree indices to index RDF statements and terms. But did you know that bigdata also uses these same B+tree indices to provide built-in support for Lucene-style free text search? If the text index is enabled (via a property when the database is created), then each literal added to the database (by appearing in the “O” position of a statement) is also added to the text index. This index can then be accessed directly, via the bigdata API, or indirectly, via high-level query. To accomplish this integration of free text search with high-level query, bigdata defines several magic predicates that are given special meaning, and when encountered in a SPARQL query are interpreted as service calls to the text index.

Before you get started, make sure you have enabled the free text index in your properties file:

com.bigdata.rdf.store.AbstractTripleStore.textIndex=true.

The full list of magic predicates related to free text search is defined and documented in the class com.bigdata.rdf.store.BD. The simplest way to integrate free text search into a SPARQL query in bigdata is to use the magic predicate “bd:search” inside of a SPARQL join group. The predicate bd:search is used to search the full text index using the pattern in the “O” position of the search and to bind the hits (Literals) to the variable defined in the “S” position of the search. For example:

?lit bd:search “mike” .

will search the full text index for literals that contain the token “mike” and bind those literals onto the ?lit variable for use in subsequent joins. To find statements that use literals that contain the token “mike”, the SPARQL query would look as follows:

select ?s, ?p, o
where {
?o bd:search “mike” .
?s ?p ?o .
}

In addition to simple search, additional metadata about the search can be defined inside the SPARQL query using other magic predicates (also defined in the BD class). These predicates, when attached to the same variable as the search, will help narrow the search or bind additional metadata about search hits to other variables. We could expand the SPARQL query as follows:

select ?s, ?p, ?o, ?score, ?rank
where {
?o bd:search “mike personick” .
?o bd:matchAllTerms “true” .
?o bd:minRelevance “0.25” .
?o bd:relevance ?score .
?o bd:maxRank “1000” .
?o bd:rank ?rank .
?s ?p ?o .
}

The magic predicate bd:matchAllTerms indicates that only literals that contain all of the specified search terms should be considered. Similarly, literals can be constrained by min and max relevance (a 0 to 1 score signifying how closely the literal matches the search terms) and by min and max rank (hits are ordered by relevance and the rank describes where the literal appears in that ordered list). If the relevance or rank is relevant to the application, those pieces of metadata can be bound to variables in the search results using the predicates bd:relevance and bd:rank.

I’ve been working recently with a customer that is making extensive use of the free text search feature within its application. A number of interesting challenges arise when working with a large database inside an application designed to answer user queries with very low latency. The number of hits bound to a free text search can be very large (unconstrained search) or very small (highly constrained search), and when joined with the statement indices inside a SPARQL query, these hits can produce either huge numbers of relevant statements (a condition we’ve been referring to as “overflow”) or hardly any statements at all (“underflow”). By playing tricks with the SPARQL query we have been cutting the hits into “rank slices” by specifying the min and max ranks to be considered for any particular query, and then running those rank slices through the rest of the joins in the query until we find just enough results to paint the first page of search results, but no more. By starting with very small rank slices, we try to ensure we don’t overload and stall out the application while the user waits. This trick of rank slicing is made possible by caching the ordered list of free text hits temporarily inside the query engine, so that subsequent calls to the text index with the exact same parameters (search and metadata) will be costless.

This is a minor version release of bigdata(R). Bigdata is a horizontally-scaled, open-source architecture for indexed data with an emphasis on RDF capable of loading 1B triples in under one hour on a 15 node cluster. Bigdata operates in both a single machine mode (Journal) and a cluster mode (Federation). The Journal provides fast scalable ACID indexed storage for very large data sets, up to 50 billion triples / quads. The federation provides fast scalable shard-wise parallel indexed storage using dynamic sharding and shard-wise ACID updates and incremental cluster size growth. Both platforms support fully concurrent readers with snapshot isolation.

Distributed processing offers greater throughput but does not reduce query or update latency. Choose the Journal when the anticipated scale and throughput requirements permit. Choose the Federation when the administrative and machine overhead associated with operating a cluster is an acceptable tradeoff to have essentially unlimited data scaling and throughput.

See [1,2,8] for instructions on installing bigdata(R), [4] for the javadoc, and [3,5,6] for news, questions, and the latest developments. For more information about SYSTAP, LLC and bigdata, see [7].

Starting with the 1.0.0 release, we offer a WAR artifact [8] for easy installation of the single machine RDF database. For custom development and cluster installations we recommend checking out the code from SVN using the tag for this release. The code will build automatically under eclipse. You can also build the code using the ant script. The cluster installer requires the use of the ant script.

You can download the WAR from:

https://sourceforge.net/projects/bigdata/

You can checkout this release from:

https://bigdata.svn.sourceforge.net/svnroot/bigdata/tags/BIGDATA_RELEASE_1_0_2

Feature summary:

- Single machine data storage to ~50B triples/quads (RWStore);
- Clustered data storage is essentially unlimited;
- Simple embedded and/or webapp deployment (NanoSparqlServer);
- Triples, quads, or triples with provenance (SIDs);
- 100% native SPARQL 1.0 evaluation with lots of query optimizations;
- Fast RDFS+ inference and truth maintenance;
- Fast statement level provenance mode (SIDs).

The road map [3] for the next releases includes:

- High-volume analytic query and SPARQL 1.1 query, including aggregations;
- Simplified deployment, configuration, and administration for clusters; and
- High availability for the journal and the cluster.

Change log:

1.0.2

– https://sourceforge.net/apps/trac/bigdata/ticket/32 (Query time expansion of (foo rdf:type rdfs:Resource) drags in SPORelation for scale-out.)
– https://sourceforge.net/apps/trac/bigdata/ticket/181 (Scale-out LUBM “how to” in wiki and build.xml are out of date.)
– https://sourceforge.net/apps/trac/bigdata/ticket/356 (Query not terminated by error.)
– https://sourceforge.net/apps/trac/bigdata/ticket/359 (NamedGraph pattern fails to bind graph variable if only one binding exists.)
– https://sourceforge.net/apps/trac/bigdata/ticket/361 (IRunningQuery not closed promptly.)
– https://sourceforge.net/apps/trac/bigdata/ticket/371 (DataLoader fails to load resources available from the classpath.)
– https://sourceforge.net/apps/trac/bigdata/ticket/376 (Support for the streaming of bigdata IBindingSets into a sparql query.)
– https://sourceforge.net/apps/trac/bigdata/ticket/378 (ClosedByInterruptException during heavy query mix.)
– https://sourceforge.net/apps/trac/bigdata/ticket/379 (NotSerializableException for SPOAccessPath.)
– https://sourceforge.net/apps/trac/bigdata/ticket/382 (Change dependencies to Apache River 2.2.0)

1.0.1

– https://sourceforge.net/apps/trac/bigdata/ticket/107 (Unicode clean schema names in the sparse row store).
– https://sourceforge.net/apps/trac/bigdata/ticket/124 (TermIdEncoder should use more bits for scale-out).
– https://sourceforge.net/apps/trac/bigdata/ticket/225 (OSX requires specialized performance counter collection classes).
– https://sourceforge.net/apps/trac/bigdata/ticket/348 (BigdataValueFactory.asValue() must return new instance when DummyIV is used).
– https://sourceforge.net/apps/trac/bigdata/ticket/349 (TermIdEncoder limits Journal to 2B distinct RDF Values per triple/quad store instance).
– https://sourceforge.net/apps/trac/bigdata/ticket/351 (SPO not Serializable exception in SIDS mode (scale-out)).
– https://sourceforge.net/apps/trac/bigdata/ticket/352 (ClassCastException when querying with binding-values that are not known to the database).
– https://sourceforge.net/apps/trac/bigdata/ticket/353 (UnsupportedOperatorException for some SPARQL queries).
– https://sourceforge.net/apps/trac/bigdata/ticket/355 (Query failure when comparing with non materialized value).
– https://sourceforge.net/apps/trac/bigdata/ticket/357 (RWStore reports “FixedAllocator returning null address, with freeBits”.)
– https://sourceforge.net/apps/trac/bigdata/ticket/359 (NamedGraph pattern fails to bind graph variable if only one binding exists.)
– https://sourceforge.net/apps/trac/bigdata/ticket/362 (log4j – slf4j bridge.)

Note: Some of these bug fixes in the 1.0.1 release require data migration.
For details, see https://sourceforge.net/apps/mediawiki/bigdata/index.php?title=DataMigration

For more information about bigdata, please see the following links:

[1] https://sourceforge.net/apps/mediawiki/bigdata/index.php?title=Main_Page
[2] https://sourceforge.net/apps/mediawiki/bigdata/index.php?title=GettingStarted
[3] https://sourceforge.net/apps/mediawiki/bigdata/index.php?title=Roadmap
[4] http://www.bigdata.com/bigdata/docs/api/
[5] http://sourceforge.net/projects/bigdata/
[6] http://www.bigdata.com/blog
[7] http://www.systap.com/bigdata.htm
[8] https://sourceforge.net/projects/bigdata/files/bigdata/

About bigdata:

Bigdata® is a horizontally-scaled, general purpose storage and computing fabric for ordered data (B+Trees), designed to operate on either a single server or a cluster of commodity hardware. Bigdata® uses dynamically partitioned key-range shards in order to remove any realistic scaling limits – in principle, bigdata® may be deployed on 10s, 100s, or even thousands of machines and new capacity may be added incrementally without requiring the full reload of all data. The bigdata® RDF database supports RDFS and OWL Lite reasoning, high-level query (SPARQL), and datum level provenance.

Over the weekend I setup our new “mini” cluster. This is a bit of a play on words. The cluster is actually (8) 2011 Mac Minis. The mini has one great advantage, especially when you drop an SSD into it. It is quiet. Suitable for running a bunch of them in the same room. The same can not be said for server grade hardware. To compensate for the relatively “light” resources in the Mini, we had them outfitted with 8G of RAM and purchased a 256G SSD to be installed into each one. The SSD should nicely complement for the lack on RAM since one of the main uses of RAM is to buffer the disk. We went back and forth on which Mini to get, but finally settled on:

2.7GHz Dual-Core Intel Core i7 (4 cores).
8GB 1333MHz DDR3 SDRAM – 2×4GB
AMD Radeon HD 6630M graphics processor (480 stream processors and 256MB of GDDR5 memory; integrated with the 2.7Ghz mini)
SATA3 256G SSD

The quad core “server” mini was an interesting alternative, but it has the same 1.5Mb cache per core as the non-server version and the core were significantly slower. The clincher for us was actually the AMD GPU in the 2.7Ghz mini. We plan to try out some interesting parallel acceleration concepts on the GPUs in the cluster. Like the mini, this is GPU lite, but it should be sufficient to test out some new ideas.

This totals out to:

8 x 4 = 32 cores @ 2.7Ghz
8 x 8G RAM = 64G RAM
8 x 580M stream processors = 3840 GPU Cores (Open CL 1.1)
8 x 256M GDDR5 memory = 2GB GPU RAM
8 x 256G SSD = 2TB SSD

This works out to a modest cluster, but it has some interesting aspects with the relatively fast SSD and the capabilities of per-node GPU computing. The full installation procedure for the mini cluster is on the wiki.

One of the things that we want to explore with this cluster is a hybrid shared nothing / shared disk architecture suitable for cloud deployments of bigdata. I’ve written about this idea elsewhere, but the main concept is to maintain a compute / storage divide. The compute nodes will use local disk to cache the shards that they are actively using. The storage layer will provide the long term durability guarantee for the read-only journal and index segment files
which make up the shards.

The great advantage of this approach is that you can ramp up and shut down the compute nodes at will since the durable data is all on the storage layer. This also simplifies the HA design since we can leave most of that to the storage layer. Each time a data service goes through a synchronous overflow, we will copy the old journal to the storage layer. Each time we build an index segment, we will copy it to the storage layer as well. The main HA concern for bigdata is reduced to the write replication chain for the mutable journals, basically making sure that the system is durable if a DS quorum member is lost before the next synchronous overflow event. During normal service shutdown, we just do a synchronous overflow and ALL persistent state is on the storage layer. If you were on EC2, you could just shut off the compute nodes without loosing any data. This approach works just fine with SAN, NAS, or a parallel file system as well.

The openstack project is providing an open source version of the EC2/S3 environment. We will probably start by dropping the 5400 RPM HDDs from the minis into some [http://h10010.www1.hp.com/wwpc/us/en/sm/WF05a/15351-15351-4237916-4237918-4237917-4248009.html HP microservers] and running the openstack swift object store on the microservers. There are several development steps to get us to this hybrid architecture, but they will all add significant capabilities to the platform (including exabyte scale).

We will be using this cluster to shake down our 1.1 release. I’ll post some performance numbers soon.

We do not specify a single security model. Our position is that handling security for the semantic web depends on the information architecture of the application (how it models things using RDF), the choice of the data model (triples, triples + statement level provenance, or quads), and the access control model to be imposed (e.g., user/group versus roles).

Many applications, especially those which use the triple store as a schema fluid database for a web application, can explicitly model security in their information architecture and use a triples-mode deployment, which has 1/2 of the #of statement indices of a quads mode deployment. Mike has just finished a security design along these lines for a customer.

If it makes sense to collect statements into named graphs and associate permissions with those named graphs, then that is a different architecture and you would use the quads-mode of bigdata. This is efficient IF you can group statements together into named graphs. However, if your named graphs tend to have only one or two statements each then you are much better off with the statement level provenance approach.

bigdata has a data model specifically for this statement level provenance. It associates a unique statement identifier with each (s,p,o) triple. The statement identifier looks like a blank node and can be used in SPARQL query and interchanged as RDF/XML. This is a great choice if you need statement level provenance and/or security. You can read more about statement level provenance support here [1].

[1] Using Statement Identifiers to Manage Provenance

© 2006-2010 by SYSTAP, LLC bigdata® is a registered trademark of SYSTAP, LLC. Suffusion WordPress theme by Sayontan Sinha