There were two excellent presentations yesterday at ISWC 2009 on using parallel techniques to materialize the RDFS closure at extremely high rates (for example, the RDFS closure of U8000 in 15 minutes). This is something that we are going to try out as soon as possible. Unlike either of the systems described in these papers, bigdata using automatic dynamic sharding based on key-ranges of the data. These techniques can be adapted by mapping the computation onto the POS index shards, bringing them to fixed point, and then reusing our high-throughput data loader to quickly relocate the entailments onto the distributed indices. There is clearly a surge in parallel and distributed algorithms for the semantic web, which is extremely exciting.
[1] Jesse Weaver, James A. Hendler. Parallel Materialization of the Finite RDFS Closure for Hundreds of Millions of Triples, In Proceedings of the 8th International Semantic Web Conference, pp. 682–697, 2009.
[2] Jacopo Urbani, Spyros Kotoulas, Eyal Oren, and Frank van Harmelen. Department of Computer Science, Vrije Universiteit Amsterdam, the Netherlands, Scalable Distributed Reasoning using MapReduce, In Proceedings of the 8th International Semantic Web Conference, 2009.
We’ve just released a new version of bigdata. This release is capable of loading 1B triples in under one hour on a 15 node cluster and has been used to load up to 13B triples on the same cluster. JDK 1.6 is required. See [1] for instructions on installing bigdata(R), [2] for the javadoc and [3] and [4] for news, questions, and the latest developments.
Please note that we recommend checking out the code from SVN using the tag for this release. The code will build automatically under eclipse. You can also build the code using the ant script. The cluster installer requires the use of the ant script. You can checkout this release from the following URL:
https://bigdata.svn.sourceforge.net/svnroot/bigdata/tags/RELEASE_0.81b
New features:
- Support for quads.
- The B+Tree data record is now the same on the disk and in memory, eliminating de-serialization costs for immutable B+Tree records.
- Shared LRU for all B+Tree instances in the same JVM. This provides competition across those B+Tree instances for RAM. The LRU is configured by default to use 10% of the JVM heap, but that value may be increased to 20% even on very heavy workloads with large heaps.
- Parallel iterator for distributed access paths.
- 3x improvement in distributed query performance. We have only just begun to optimize distributed query. This performance improvement is mainly due to the parallel access path iterator, the shared LRU, and the selection of a better chunk size for distributed query. We expect substantial improvements in query performance over the next several months.
- Some query hotspot elimination.
The roadmap for the next release includes:
- Full transactions for the SAIL.
- Record level compression.
- Query optimizations.
For more information, please see the following links:
[1] http://bigdata.wiki.sourceforge.net/GettingStarted
[2] http://www.bigdata.com/bigdata/docs/api/
[3] http://sourceforge.net/projects/bigdata/
[4] http://www.bigdata.com/blog
About bigdata:
Bigdata® is a horizontally-scaled, general purpose storage and computing fabric for ordered data (B+Trees), designed to operate on either a single server or a cluster of commodity hardware. Bigdata® uses dynamically partitioned key-range shards in order to remove any realistic scaling limits – in principle, bigdata® may be deployed on 10s, 100s, or even thousands of machines and new capacity may be added incrementally without requiring the full reload of all data. The bigdata® RDF database supports RDFS and OWL Lite reasoning, high-level query (SPARQL), and datum level provenance.
We will be presenting on Thurs, 10/29, 11:20am – 11:50am at ISWC 2009 in Washington DC. The talk will cover the bigdata architecture, present performance results on a 15 node cluster, and the bigdata roadmap. Come on out!
The support for quads is in the development branch [1], which you can check out from SVN. This branch is fairly solid and we plan to create a release from it soon [2].
[2] https://sourceforge.net/apps/trac/bigdata/roadmap