Schema flexible mash ups with RDF
Google’s bigtable architecture gives you scale-out schema flexible ordered data with high write concurrency. This is great, but is it what you need?
RDF provides not only schema flexibility, but the ability to dynamically federate and align data from a variety of sources (ad hoc reuse of data) together with high-level query (SPARQL). In addition, the bigdata RDF database supports datum-level provenance (woefully missing in the RDF stack). That sounds good, but it comes with a price.
There are tradeoffs when choosing a data model between write concurrency and expressivity. In order to get the benefits of RDF, you need to maintain multiple indices over the data (efficient access paths for JOINs) and you need to accept lower write concurrency in order to pre-compute aspects of the entailments (a tradeoff between costs of writers and readers).
Reads are performed against historical commit points, so you have full read-concurrency. Reads can be extremely efficient as well – an access path scan for the properties of some subject is every bit as efficient as a read on a sparse row store like bigtable or HBase. However, since SPARQL supports conjunctive query, reads can also be more interesting involving multiple JOINs on the data.
RDF has write concurrency only slightly worse than bigtable if you choose your read behind commit points to correspond to coherent database states. However, maintaining entailments using eager closure imposes strong write concurrency limits on an RDF database. The basic problem is that closure must be computed against a consistent database state, so concurrent closure computations are not allowed. There are several ways to handle this. One is to use query-time inference, e.g., magic sets, in which case all costs are paid when queries are answered. Another is to collect and batch writes, which works well with a map/reduce style concurrent data loader and sites with high volume updates that can accept some delay between the time when the data are submitted and the time when the closure of the database is updated.
The most interesting aspect of an RDF database is using owl:equivalentClass, owl:equivalentProperty, and owl:sameAs to align classes properties from your ontology (aka schema) and instances from your data (aka rows). These predicates provide a declarative mechanism to semantically align federated datasets making RDF perfect for ad-hoc reuse and repurposing of data, what I think of as RDF mashups.
RDF provides not only schema flexibility, but the ability to dynamically federate and align data from a variety of sources (ad hoc reuse of data) together with high-level query (SPARQL). In addition, the bigdata RDF database supports datum-level provenance (woefully missing in the RDF stack). That sounds good, but it comes with a price.
There are tradeoffs when choosing a data model between write concurrency and expressivity. In order to get the benefits of RDF, you need to maintain multiple indices over the data (efficient access paths for JOINs) and you need to accept lower write concurrency in order to pre-compute aspects of the entailments (a tradeoff between costs of writers and readers).
Reads are performed against historical commit points, so you have full read-concurrency. Reads can be extremely efficient as well – an access path scan for the properties of some subject is every bit as efficient as a read on a sparse row store like bigtable or HBase. However, since SPARQL supports conjunctive query, reads can also be more interesting involving multiple JOINs on the data.
RDF has write concurrency only slightly worse than bigtable if you choose your read behind commit points to correspond to coherent database states. However, maintaining entailments using eager closure imposes strong write concurrency limits on an RDF database. The basic problem is that closure must be computed against a consistent database state, so concurrent closure computations are not allowed. There are several ways to handle this. One is to use query-time inference, e.g., magic sets, in which case all costs are paid when queries are answered. Another is to collect and batch writes, which works well with a map/reduce style concurrent data loader and sites with high volume updates that can accept some delay between the time when the data are submitted and the time when the closure of the database is updated.
The most interesting aspect of an RDF database is using owl:equivalentClass, owl:equivalentProperty, and owl:sameAs to align classes properties from your ontology (aka schema) and instances from your data (aka rows). These predicates provide a declarative mechanism to semantically align federated datasets making RDF perfect for ad-hoc reuse and repurposing of data, what I think of as RDF mashups.
Labels: bigdata, federation and semantic alignment, mashups, RDF
