We presented at a Semantic Data Management workshop last week in Bulgaria [1,2]. This was a great event which brought together semantic web researchers and industry concerned with very large scale semantic web applications and from both traditional and semantic web database communities. There were a number of great presentations and excellent discussions throughput.
There were several interesting outcomes for us from a series of sessions on database architecture during the workshop. In summary, we believe that we can realize:
- Significant improvements in write rate and online query answering (order preserving coding of data type literals into the statement indices rather than indirection through forward and reverse dictionaries);
- 2-3 orders of magnitude improvement on relatively unselective aggregation style queries (GPU based shard-wise joins at the maximum disk transfer rate, which is inspired by the cache conscious processing of MonetDB); and
- 3 orders of magnitude improvement in the potential scale of the database architecture (exabyte scale through the decomposition of the shard locator service across the data services).
Exciting stuff!
[1] http://www.semdata.org/
[2] http://www.semdata.org/events/2010/sofia/
I’ve committed an integration with the OpenRDF Sesame HTTP Server into the bigdata development branch. There is an installation task built into the top-level ant script, and instructions up on the wiki.
Sometimes you just have to ask the question “How did we get here?”.
So what about RDF? RDF takes a ride on XML acceptance and introduces a triple-based data model. Where XML provides hierarchical data modelling, RDF provides a simple list of triples (at least theoretically). With this simplicity comes the problem… interpretation. XML naturally encapsulates data whilst RDF enables expansion with assertion of disconnected triples. We must follow a set of rules to interpret the triples to form coherent data structures, and these rules must be understood whenever data is created or accessed.
Many years ago someone coined the term “impedance mismatch” to describe the problem of transforming data between the format used by a programming language and that used to represent it externally, for example in a database. A figure that I heard repeated was that 90% of computer code was involved with data transformation. I suspect this figure is now much higher since we no longer have to worry only about data storage transformations but also other representations. Discussing this recently with respect to RDF the phrase “mother of all impedance mismatches” was coined. We’re going to call this MAIM!
So what does this mean for BigData? Well, we are aiming to solve MAIM by providing a toolset (interfaces and metadata) that enables the easy creation of domain specific models. We will use an underlying triple representation augmented with indices to support the efficient access to domain data. The flexibility of the “schema-free” triple-based representation will enable data sharing between different models/ontologies, while the domain specific metadata will resolve issues of interpretation in defined contexts.
Want to get paid to work on bigdata? Know someone who does? We are hiring! Here is our job listing:
Senior software engineer for parallel semantic web database query optimization.
SYSTAP, LLC is seeking a senior software engineer with experience in the Semantic Web, parallel databases, and distributed systems to help develop bigdata®, a Java-based open-source parallel Semantic Web database capable of running on clusters of 100s or 1000s of machines. SYSTAP has been recognized as a top Semantic Web startup company, and this is an exciting opportunity to work on the core development team of a cutting-edge parallel Semantic Web database product.
We are looking for immediate help with query rewrite, optimization and distributed query planning. Significant experience is desired in some or all of the following areas: the Semantic Web stack (RDF, OWL), distributed database architectures, query optimization, performance analysis, distributed query planning, view maintenance and caching, datalog, DL reasoners, parallel file systems, and GIS. Programming skills must include Java. Math and/or logic background is a plus. Many core programming (GPU) skills a plus. Active security clearance a plus. Candidate must be self-motivated with good written and verbal communication skills and must able to work independently and from home. Occasional travel required to customer sites and conferences. Competitive salary and benefits. Send cover letter and resume to jobs@systap.com.