Friday, November 20, 2009

Bigdata, HA, and the clouds

Mike and I have been doing some heavy thinking about the High Availability (HA) architecture for bigdata, and we think that we have a really interesting new dimension to the total story. The basic idea is that we are going to decouple the index read/write service from the distributed file system. This is going to change things in a big way. Here are some for instances:

- Add or drop bigdata services based on demand, dynamically changing the throughput for data write. This is entirely in line with the notions of cloud and will allow resource sharing on private clouds or pay per use on public clouds.

- Many machines can serve from the same shard. This means that we can hit HPC levels of concurrency for query answering. We can even partition the machines so long running queries execute against a dedicated set of hosts without interfering with normal low latency query processing.

- Blindingly fast parallel closure. There are some new algorithms out there which describe simple techniques for computing the RDFS closure of each database partition independently by replicating the "ontology" onto each node. For bigdata, this works out as computing the closure for each POS index shard independently. Using just these techniques we can probably obtain a 10x improvement when computing the database closure. However, by decoupling the index read/write services from the distributed file system, we can actually put as much as an entire machine on each POS shard, computing the closure of an arbitrarily large graph in minutes.

There are a lot of ways to handle the distributed file system, ranging from refactoring our existing code to produce a custom API, to running over any of the various distributed file systems for commodity hardware, to running over a parallel file system for an HPC cluster, SAN or NAS. Our preference is a distributed file system for commodity hardware because that supports the most reuse across both bigdata and other cloud applications, but the distributed file system must support sync to disk.

Bigdata IO is already highly optimized for both write and read operations. When we open an index segment, we fully buffer the B+Tree nodes, which are conveniently arranged in a contiguous region in the file. Likewise, when we read on an index segment, we can go directly to the leaf since the nodes are in memory. The leaves are double-linked as well, and we have bloom filters in from of the B+Tree as well as record level caching.

Journal writes are append only, occur at group commit points, and are generally 1 or more megabytes in size. Still, it might make sense to use a hybrid design where we tie the journal to a data service and use a failover chain for master and secondary data service to absorb writes. Once we close out the journal, we can blast its 200MB extent directly onto the distributed file system. Likewise, when reading against a historical commit point, we could slurp the corresponding journal onto the machine paired with the shard eliminating any remote IO for small data records on the journal. Cold shards can simply be dropped. They can be reassigned to data services on an as needed basis and brought back online nearly instantly. This will allow the number of machines dedicated to bigdata to fluctuate with the demand, which is exactly in keeping with the cloud paradigm.

This really opens up the possibilities that we see for bigdata tremendously. By tossing out the idea that the CPU/RAM of at most a single machine was available to absorb writes or compute closure, we can gain tremendous leverage from pooled computational resources.

Labels: , ,

Monday, November 17, 2008

new algorithm for scale-out joins

The new join strategy (pipeline joins) is now running. Jini join performance 5x better than it used to be (with nested subquery). This puts it at 4x slower than the nested subquery joins for local triple store. Let me put that in perspective -- scale-out distributed joins running at only 1/4th the performance of optimized local joins as measured WITHOUT any additional hardware and on a resource constrained platform (my laptop). I fully expect the pipeline joins to be competetive with local join performance in a scale-out environment.

The "pipeline" join always reads from a local index partition and "joins" intermediate binding sets propagated from the prior join dimension. This means that each join dimension can run in parallel, can optimize reads by re-ordering the binding sets in order to have good locality on the index partition, eliminates access path tests for binding sets that produce the same bindings, etc. When the index is partitioned, there will be one join task per index partition touched by the join and each join task can be running on a different machine. While the original join strategy (nested subquery) is slightly better for the local triple store, but the pipeline join is the hands down winner for scale-out. It completely avoids many of the bottlenecks for jini.

In reality, the numbers may be much better since this is measured on a laptop, which is resource constrained - especially for jini, and the data size used for correctness testing is small so the potential benefits of the new join algorithm are more limited. Also, the new "pipeline" joins are fully distributed while the old join algorithm only distributed the data, but not the computation of the JOIN itself. This means that we can really leverage all the hardware. We already see linear scale-out for data load, and now we hope to see it for RDFS closure and high-level query as well.

The next step is to measure performance on some modest to large data sets and establish a baseline for comparison of scale-out join performance. If the scale-out join performance is good, and it should, then we will go back and test with dynamic index partitioning enabled.

Also, bloom filters are enabled for scale-out and now used by the joins. This is a great win for complex joins, such as LUBM Q9. The bloom filter is an in memory data structure that can very rapidly determine whether a fully bound point test is NOT an index hit. When the bloom filter reports "no", you are done and you do not touch the index. When the bloom filter reports "yes", you have to read the index to verify that there really is a hit.

Bloom filters are a stochastic data structure, require about 1 byte per index entry, and must be provisioned up front for an expected number of index entries. So if you expect 10M triples, that is a 10MB data structure. Since bloom filters do not scale-up, they are automatically disabled once the #of index entries in the mutable B+Tree exceeds about 2M tuples. BUT, they are great for scale-out since the data on the mutable B+Tree is migrated into perfect index segments, and we generate perfect fit bloom filters for those index segments. Every time we overflow a journal, we wind up with a new (empty) B+Tree to absorb writes, so the bloom filter is automatically re-enabled. Further, when index partition splits are also enabled, the #of entries in an index partition should be such that the bloom filters are always on. This should be a drammatic boost for the scale-out system over an equivalent scale-up system and is yet another way in which scale-out can leverage more resources.

Things have been quite, but that's because we've been working quitely! With scale-out join performance nailed, we will be ready for an initial release.

Labels: , , , ,

Tuesday, July 29, 2008

Schema flexible mash ups with RDF

Google’s bigtable architecture gives you scale-out schema flexible ordered data with high write concurrency. This is great, but is it what you need?

RDF provides not only schema flexibility, but the ability to dynamically federate and align data from a variety of sources (ad hoc reuse of data) together with high-level query (SPARQL). In addition, the bigdata RDF database supports datum-level provenance (woefully missing in the RDF stack). That sounds good, but it comes with a price.

There are tradeoffs when choosing a data model between write concurrency and expressivity. In order to get the benefits of RDF, you need to maintain multiple indices over the data (efficient access paths for JOINs) and you need to accept lower write concurrency in order to pre-compute aspects of the entailments (a tradeoff between costs of writers and readers).

Reads are performed against historical commit points, so you have full read-concurrency. Reads can be extremely efficient as well – an access path scan for the properties of some subject is every bit as efficient as a read on a sparse row store like bigtable or HBase. However, since SPARQL supports conjunctive query, reads can also be more interesting involving multiple JOINs on the data.

RDF has write concurrency only slightly worse than bigtable if you choose your read behind commit points to correspond to coherent database states. However, maintaining entailments using eager closure imposes strong write concurrency limits on an RDF database. The basic problem is that closure must be computed against a consistent database state, so concurrent closure computations are not allowed. There are several ways to handle this. One is to use query-time inference, e.g., magic sets, in which case all costs are paid when queries are answered. Another is to collect and batch writes, which works well with a map/reduce style concurrent data loader and sites with high volume updates that can accept some delay between the time when the data are submitted and the time when the closure of the database is updated.

The most interesting aspect of an RDF database is using owl:equivalentClass, owl:equivalentProperty, and owl:sameAs to align classes properties from your ontology (aka schema) and instances from your data (aka rows). These predicates provide a declarative mechanism to semantically align federated datasets making RDF perfect for ad-hoc reuse and repurposing of data, what I think of as RDF mashups.

Labels: , , ,

If the shoe fits

Many open source projects dealing with cloud computing are knockoffs of systems published on by Google (this also appears to be true of commercial businesses trying to compete in the cloud as service space). To my mind, this lack of innovation is problematic. The systems (GFS, map/reduce and bigtable) that Google developed are fairly specialized and are designed for specific problem classes.

However, I did not see much awareness among the potential users of these systems of the tradeoffs that had been considered and accepted by Google in their development. For example, map/reduce is suited to non-transactional, high-latency processing where there is good locality in the inputs while Google’s bigtable (a schema flexible row store) is suited to high concurrency writers (hence the limited ACID guarantee for only a single "logical" row of data), ordered key-range scans, and no secondary indices, no JOINs, and no high-level query). However, people seem to feel that these are general purpose tools that will let them scale-out without further reflection on their data processing problem.

While these systems are gaining popularity right now, I foresee some failed expectations in the next 2-3 years precisely because they are being applied without much consideration to their "fit" for a given problem. This reminds me of the time that I recoded a machine learning algorithm written in C and running on a PC to execute in Fortran on an IBM 3090. We got better performance on the PC.

It is a fair question to ask how bigdata differs. After all, bigdata is at its core a scale-out B+Tree architecture, which is the same architecture underlying Google's bigtable. One way in which bigdata is different is that features such as the distributed file system are built on top of the database architecture rather than the other way around. This provides the same ACID contract for batch updates on an index partition to the distributed file system, the flexible row store, the RDF database, etc. Another benefit of this approach is that we get ACID operations, including file append, on file blocks (chunks of up to 64M each) for "free" from the underlying database semantics.

However, the broader difference is that bigdata is designed to allow for the construction of a variety of data models, each suitable to a different problem space. For example, there is a sparse row store (akin to Google's bigtable, HBase, or hypertable), but there are also other data models, including an RDF database with high-level query and inference, an object database, and the distributed file system as an application of the basic scale-out B+Tree system rather than a substrate for it. While we have not pursued a relation data model, one could. Scale-out can be applied in all of these contexts.

There are tradeoffs in the design space. Typically, you find that you have to trade off (potential) writer concurrency against model complexity. This is because in the more complex models, a writer can have non-local effects (the effects are not restricted to a single key range index partition). The tradeoffs you accept depend on your requirements. Google’s bigtable is an example of one such tradeoff. It accepts limited isolation (a single logical row at a time) in exchange for very high read/write concurrency.

Labels: ,

Monday, July 28, 2008

Cloud Computing with bigdata (OSCON 2008)

Bryan Thompson presented on "Cloud Computing with bigdata" at OSCON 2008. The presentation gave an overview of cloud computing, the bigdata architecture and how it relates to similar systems, why RDF is an interesting technology (fluid schema and declarative schema and instance alignment make for great mashups and facilitates data reuse), some of the additional challenges that need to be addressed for scale-out on RDF (secondary indices and distributed JOINs to support high-level query), and some of the work we have done on scale-out for RDF processing, including statement level provenance and combining map/reduce processing with indexed data. See http://bigdata.sourceforge.net/pubs/bigdata-oscon-7-23-08.pdf for a copy of the slides.

bigdata(R) is a scale-out storage and computing fabric supporting optional transactions, very high concurrency, and very high aggregate IO rates. For more information, see https://sourceforge.net/projects/bigdata/, http://bigdata.sourceforge.net/docs/, and http://www.bigdata.com.

Labels: , , , ,