We’ve just released a new version of Bigdata®. This is a bigdata® snapshot release. This release is capable of loading 1B triples in under one hour on a 15 node cluster and has been used to load up to 13B triples on the same cluster. JDK 1.6 is required.
See [1] for instructions on installing bigdata®, [2] for the javadoc and [3] and [4] for news, questions, and the latest developments. For more information about SYSTAP, LLC and bigdata®, see [5].
Please note that we recommend checking out the code from SVN using the tag for this release. The code will build automatically under eclipse. You can also build the code using the ant script. The cluster installer requires the use of the ant script. You can checkout this release from the following URL:
https://bigdata.svn.sourceforge.net/svnroot/bigdata/branches/BIGDATA_RELEASE_0_83_2
New features:
- This release provides a bug fix for issue#118. Upgrade to this release is advised. See https://sourceforge.net/apps/trac/bigdata/ticket/118 for details.
- Inlining XSD numerics, xsd:boolean, or custom datatype extensions into the statement indices. Inlining provides a smaller footprint and faster queries for data using XSD numeric datatypes. In order to introduce inlining we were forced to make a change in the physical schema for the RDF database which breaks binary compatibility for existing stores. The recommended migration path is to export the data and import it into a new bigdata instance.
- Refactor of the dynamic sharding mechanism for higher performance.
- The SparseRowStore has been modified to make Unicode primary keys decodable by representing Unicode primary keys using UTF8 rather than Unicode sort keys. This change also allows the SparseRowStore to work with the JDK collator option which embeds nul bytes into Unicode sort keys. This change breaks binary compatibility, but there is an option for historical compatibility.
The roadmap for the next releases include:
- Query optimizations;
- Support for high-volume analytic query workloads and SPARQL aggregations;
- High availability for the journal and the cluster;
- Simplified deployment, configuration, and administration for clusters.
For more information, please see the following links:
[1] http://bigdata.wiki.sourceforge.net/GettingStarted
[2] http://www.bigdata.com/bigdata/docs/api/
[3] http://sourceforge.net/projects/bigdata/
[4] http://www.bigdata.com/blog
[5] http://www.systap.com/bigdata.htm
About bigdata:
Bigdata® is a horizontally-scaled, general purpose storage and computing fabric for ordered data (B+Trees), designed to operate on either a single server or a cluster of commodity hardware. Bigdata® uses dynamically partitioned key-range shards in order to remove any realistic scaling limits – in principle, bigdata® may be deployed on 10s, 100s, or even thousands of machines and new capacity may be added incrementally without requiring the full reload of all data. The bigdata® RDF database supports RDFS and OWL Lite reasoning, high-level query (SPARQL), and datum level provenance.
Here is the slide deck for those that missed our presentation at SemTech this year on the upcoming high-availability architecture for bigdata. Also see our HA whitepapers:
[1] http://www.bigdata.com/whitepapers/bigdata_ha_whitepaper.pdf
[2] http://www.bigdata.com/whitepapers/Bigdata-HA-Quorum-Detailed-Design.pdf
We’ve just undergone a fairly extensive refactor of how RDF terms are represented in bigdata.
Bigdata is all about high-performance B+Tree indices. With these as basic building blocks, you can model all sorts of interesting data, like RDF. Our RDF store uses two relations – one for statements and one we call the lexicon. Up until now, all RDF terms (i.e. URIs, Literals, and BNodes) were assigned long integer term identifiers to uniquely represent them in the statement indices. The lexicon relation was responsible for keeping this RDF term to long term identifier mapping using two indices, Term2Id (the forward index) and Id2Term (the reverse index). As new terms appeared in statements written to the database, they would first go through a forward mapping process. A new tuple would be inserted into the Term2Id index using a Unicode sort key generated from the string value of the RDF term along with the a new unique long term identifier as the tuple value. This same term identifier would then be used as a key in Id2Term, the reverse index, with the RDF term serialized directly into the tuple value. Once indexed, RDF terms could be quickly resolved (or materialized) to (or from) term identifiers. These lightweight long term identifiers were then used inside the statement indices rather than embedding the much bulkier RDF terms themselves.
While this keeps the statement indices relatively slim, it does necessitate additional index lookups against the lexicon to materialize RDF terms from their term identifiers. This makes it impossible to do numeric filtering during join processing using the statement indices alone, because filters like (?age < 35) are impossible to evaluate when ?age is represented as a term identifier. This becomes especially problematic in scale-out when the RDF term for a particular term identifier might not even be on the same machine as the shard where the filter is being evaluated during join processing.
The lexicon refactor moves from a model where all RDF terms are referenced by term identifiers from the statement indices to one where certain terms are embedded inline directly into the statement indices themselves based on the datatype of the RDF term. Terms that can be represented compactly in statement keys directly are inlined by value rather than assigned a term identifier to reference them. For example, why create a long integer term identifier to reference an RDF Literal that has a datatype of long integer? It makes much more sense to embed that Literal directly, rather than go through an unnecessary mapping to a term identifier in the lexicon. This applies to all numerical RDF Literal values. Once these numerics are inlined directly in the statement indices, it becomes possible to apply numerical filters during join processing without having to materialize terms from the lexicon.
In addition to numerics, it’s also possible to inline some non-numeric datatypes, such as UUIDs. This is useful in “told bnode mode”, a mode where BNode ids are maintained by the database. BNodes using UUIDs for their id can be inlined into the statement indices without the need for term identifiers and the lexicon.
The statement indices use keys that are built from the encoded versions of the RDF terms that compose the particular statement for a given key order. For example, the SPO index uses keys that are built by first encoding the S term, then appending the encoded P term, and finally appending the encoded O term. The OSP index keys use the same encoded terms but in a different order. Before the lexicon refactor, terms were always encoded as simple long integers (8 bytes). Thus statement keys were always a fixed length – 24 bytes for triples (3 terms), and 32 bytes for quads (4 terms).
With the lexicon refactor, we now use one additional byte to represent metadata about the term – whether or not it is inline (1 bit), what type of term it is, URI, Literal, or BNode (2 bits), what the datatype is (4 bits), and 1 additional bit for extensions (explained later). Currently supported intrinsic datatypes are boolean, byte, short, int, long, float, double, BigInteger, BigDecimal, and UUID. Support is also planned for unsigned byte, unsigned short, unsigned int, and unsigned long. These intrinsic types correspond directly to XML Schema datatypes. RDF Literals that are datatyped using XML Schema datatypes will be automatically inlined using one of the intrinsic types. In addition the term itself is no longer necessarily exactly 8 bytes – only term identifiers, inline longs and inline doubles will need 8 bytes, most other intrinsic types can be represented using less. The exceptions to this are BigInteger, BigDecimal, and UUID, which require more than 8 bytes.
What this means for developers is that almost everywhere we used to use java.lang.Long, we now use com.bigdata.rdf.internal.IV. IV stands for “internal value”, which covers term identifiers and all varieties of inline values. Read through the javadoc in this package for much more detail on everything related to the refactor.
You can also define custom extensions that are projected onto one of the intrinsic types using the extension bit. For example, a datatype for “milliseconds since the epoch” can be defined and used to datatype literals with a long integer string value. These literals can then be inlined directly using the long integer intrinsic type as a delegate. Custom extensions are Java classes that are responsible for round-tripping an RDF term to and from an inline internal value for a particular datatype. Two such extensions are provided as examples in the bigdata codebase – one for milliseconds since the epoch and one that demonstrates how to map an enum of colors onto the xsd:byte intrinsic type.
It is also important to note that this change breaks binary compatibility for the RDF store. It was simply not possible to maintain backward compatibility under this refactor without creating very difficult code maintenance problems. The recommended migration path is a full export of old databases and then an import using the new version.
Preliminary testing shows no penalty for inlining in queries that don’t take advantage of numerical filter operators and significant improvement for queries that do (for example BSBM query 4). Inlining is also expected to dramatically improve query performance for scale-out, which we intend to test in the upcoming weeks as we also work on query optimizations for scale-out quad query.
We’ve just release a new version of bigdata. This is a bigdata® snapshot release. This release is capable of loading 1B triples in under one hour on a 15 node cluster and has been used to load up to 13B triples on the same cluster. JDK 1.6 is required.
See [1] for instructions on installing bigdata(R), [2] for the javadoc and [3] and [4] for news, questions, and the latest developments. For more information about SYSTAP, LLC and bigdata, see [5].
Please note that we recommend checking out the code from SVN using the tag for this release. The code will build automatically under eclipse. You can also build the code using the ant script. The cluster installer requires the use of the ant script. You can checkout this release from the following URL:
https://bigdata.svn.sourceforge.net/svnroot/bigdata/branches/BIGDATA_RELEASE_0_83_1
New features:
- Inlining XSD numerics, xsd:boolean, or custom datatype extensions into the statement indices. Inlining provides a smaller footprint and faster queries for data using XSD numeric datatypes. In order to introduce inlining we were forced to make a change in the physical schema for the RDF database which breaks binary compatibility for existing stores. The recommended migration path is to export the data and import it into a new bigdata instance.
- Refactor of the dynamic sharding mechanism for higher performance.
- The SparseRowStore has been modified to make Unicode primary keys decodable by representing Unicode primary keys using UTF8 rather than Unicode sort keys. This change also allows the SparseRowStore to work with the JDK collator option which embeds nul bytes into Unicode sort keys. This change breaks binary compatibility, but there is an option for historical compatibility.
The roadmap for the next releases include:
- Query optimizations;
- Support for high-volume analytic query workloads and SPARQL aggregations;
- High availability for the journal and the cluster;
- Simplified deployment, configuration, and administration for clusters.
For more information, please see the following links:
[1] http://bigdata.wiki.sourceforge.net/GettingStarted
[2] http://www.bigdata.com/bigdata/docs/api/
[3] http://sourceforge.net/projects/bigdata/
[4] http://www.bigdata.com/blog
[5] http://www.systap.com/bigdata.htm
About bigdata:
Bigdata® is a horizontally-scaled, general purpose storage and computing fabric for ordered data (B+Trees), designed to operate on either a single server or a cluster of commodity hardware. Bigdata® uses dynamically partitioned key-range shards in order to remove any realistic scaling limits – in principle, bigdata® may be deployed on 10s, 100s, or even thousands of machines and new capacity may be added incrementally without requiring the full reload of all data. The bigdata® RDF database supports RDFS and OWL Lite reasoning, high-level query (SPARQL), and datum level provenance.