Monday, July 27, 2009

Who should use bigdata?

It can be confusing to navigate the various triple store options out there. Which one is best for your application?

Let’s take a step back and look at the history of bigdata. Bigdata was not developed in a vacuum. Bryan and I were building a system for an intelligence community customer that used a triple store as the core of the data layer. This system allowed users to federate and semantically align different structured and unstructured data sets into a single fused view for better analysis. The system had a triple store knowledge base at the data layer, and then user-facing tools that allowed analysts to do things like import structured data, send unstructured documents through a harvest/extract pipeline, search for documents and entities, view graphical link charts, annotate documents, and a host of other things. The system also had an open RESTful service API, which allowed other tools to access the knowledge base to do reads, writes, and queries. The system was multi-user, so it had to handle real-time updates and deletes with concurrent queries. The knowledge base had to be fast enough to keep up with system load and scalable enough to handle lots of data. RDF was a great technology choice for the problem, but we found the RDF database implementations a bit lacking or a bit expensive, or both. And no vendor was tackling scale-out at that time. This was also around the time of Google’s publication on BigTable, and we thought, can we apply these fundamental concepts to RDF data?

Bigdata is not just for applications with multi-billion triple requirements. The single-server version of bigdata is an excellent choice for any system that needs a triple-store, it’s robust, fast, and handles concurrency very well. Bigdata handles real-time updates and deletes with incremental inference and incremental truth maintenance. Concurrent writes are serialized, but in the system for which we designed bigdata, these updates and deletes of about 10-1000 triples were absorbed almost instantly. Meanwhile bigdata’s MVCC concurrency model allows readers to operate totally independently of writers and other readers, so there’s never any waiting for reads or queries to execute. And when they do execute, they go through bigdata’s high-performance join engine for lightning fast query response times.

If you are dissatisfied with the performance, robustness, or feature-completeness of your current triple store (as we were), then look no further. Bigdata was borne of the same dissatisfaction, and designed and implemented specifically for real-world systems like yours.

Thursday, July 9, 2009

getting started with bigdata

A couple of steps forward to help get started with bigdata.

We've published an early draft of a bigdata architecture whitepaper[1]. It's a work in progress as you'll be able to tell by reading it.

Also, we've started sketching out the getting started guide for scale-out on the wiki[2]. We still recommend keeping us involved in the process if you're interested in trying out bigdata on a cluster. There are a lot of do's and dont's when it comes to configuring and writing performant code for a distributed database. What this guide is currently missing is sample code for distributed data load and query. Keep an eye out for this in the next few days.

[1] Whitepaper
[2] Wiki

Tuesday, July 7, 2009

bigdata, OWL, SWRL

Now that bigdata is handling billions of triples with ease, we are ready to venture into higher expressivity as well. There is always a tradeoff between the expressiveness of the ontology and the computational complexity of computing the entailments. So far, bigdata has focused on the data scale, now we are ready to look at the reasoner complexity. To do this we are exploring some integration options, including partnering with Clark & Parsia to develop an integration with the Pellet2 OWL reasoner [2].

Speak out and let us know what combination of data scale and ontology complexity you need. Do you want datalog [1], OWL2 profiles (RL, QL, EL)[3], Horn-SHIQ[4]? Do you need SWRL [5], and how you want to use it? Example ontologies, data scale and use caselets would all help.

[1] http://www.iris-reasoner.org/
[2] http://clarkparsia.com/pellet/
[3] http://www.w3.org/TR/owl2-profiles/
[4] http://logic.aifb.uni-karlsruhe.de/wiki/Horn-SHIQ
[5] http://www.w3.org/Submission/SWRL/