We presented at a Semantic Data Management workshop last week in Bulgaria [1,2]. This was a great event which brought together semantic web researchers and industry concerned with very large scale semantic web applications and from both traditional and semantic web database communities. There were a number of great presentations and excellent discussions throughput.
There were several interesting outcomes for us from a series of sessions on database architecture during the workshop. In summary, we believe that we can realize:
- Significant improvements in write rate and online query answering (order preserving coding of data type literals into the statement indices rather than indirection through forward and reverse dictionaries);
- 2-3 orders of magnitude improvement on relatively unselective aggregation style queries (GPU based shard-wise joins at the maximum disk transfer rate, which is inspired by the cache conscious processing of MonetDB); and
- 3 orders of magnitude improvement in the potential scale of the database architecture (exabyte scale through the decomposition of the shard locator service across the data services).
I’ve committed an integration with the OpenRDF Sesame HTTP Server into the bigdata development branch. There is an installation task built into the top-level ant script, and instructions up on the wiki.
Sometimes you just have to ask the question “How did we get here?”.
So what about RDF? RDF takes a ride on XML acceptance and introduces a triple-based data model. Where XML provides hierarchical data modelling, RDF provides a simple list of triples (at least theoretically). With this simplicity comes the problem… interpretation. XML naturally encapsulates data whilst RDF enables expansion with assertion of disconnected triples. We must follow a set of rules to interpret the triples to form coherent data structures, and these rules must be understood whenever data is created or accessed.
Many years ago someone coined the term “impedance mismatch” to describe the problem of transforming data between the format used by a programming language and that used to represent it externally, for example in a database. A figure that I heard repeated was that 90% of computer code was involved with data transformation. I suspect this figure is now much higher since we no longer have to worry only about data storage transformations but also other representations. Discussing this recently with respect to RDF the phrase “mother of all impedance mismatches” was coined. We’re going to call this MAIM!
So what does this mean for BigData? Well, we are aiming to solve MAIM by providing a toolset (interfaces and metadata) that enables the easy creation of domain specific models. We will use an underlying triple representation augmented with indices to support the efficient access to domain data. The flexibility of the “schema-free” triple-based representation will enable data sharing between different models/ontologies, while the domain specific metadata will resolve issues of interpretation in defined contexts.
Want to get paid to work on bigdata? Know someone who does? We are hiring! Here is our job listing:
Senior software engineer for parallel semantic web database query optimization.
SYSTAP, LLC is seeking a senior software engineer with experience in the Semantic Web, parallel databases, and distributed systems to help develop bigdata