We have been making steady progress on the bigdata high availability architecture. Bigdata® uses a quorum model for its persistent services. Unlike a master election, where any service can become the master, the quorum model ensures that updates can not be lost by refusing writes unless a quorum is present. The quorum model is built over existing distributed systems infrastructure components, including jini and zookeeper.
Each highly available service has a replication factor k, which is an odd number, and a current replication count n. A quorum “meets” when (k+1)/2 services (e.g., 2 out of 3, or 3 out of 5) agree on their shared state. Each quorum has a leader and zero or more followers. Writes are only permitted for “met” quorums and must be directed to the quorum leader. The services in a quorum are arranged in a write pipeline which efficiently replicates low-level writes (cache blocks) across sockets. All writes are check summed to detect read errors and failover reads are supported so bad reads do not cause runtime errors. Hot spares may be automatically recruited to replace failed nodes. New services can inherit from the high availability architecture. For example, a highly available cache fabric could be derived using the non-blocking cache to buffer an RW journal used as a local persistence store.
On the code front, we have the low-level write replication pipeline running smoothly and are now working on the zookeeper integration to support quorums. Also, in addition to the HA architecture overview, we have also published a detailed description of bigdata quorums. This document goes into the details on quorum management, synchronization for the different kinds of persistence stores, and the integration with zookeeper.
We are currently implementing a shared nothing architecture, which is much simpler and has great performance without dependencies on parallel file systems, SAN or NAS. However, there is a shared disk architecture in the works that has a number of benefits, including a separation of concerns between the storage management and bigdata service management and the ability to retain infinite history in the database. There is an architectural road map leading from shared nothing to shared disk based on the notion of shard affinity and caching resources locally. We expect that the shared disk architecture will offer better performance when combined with local caching on Solid State Disk (SSD). Enterprise SSD is cost prohibitive today if you need to put all of your data on SSD, but the shared disk architecture only caches hot journals locally, and leaves the index segment files on the managed storage since they are optimized for efficient remote access and ordered reads. This trick let’s a little SSD go a long way.
Comments are welcome either here or on the developer list.
I use the term “cache fabric” to refer to an arrangement of nodes supporting a distributed cache. E.g., infinispan, hazelcast, or coherence. Cache fabrics are essentially distributed maps and many of them have integration points for local persistence of the hash partition on a given node, replication, and failover (high availability). A cache fabric is great when all you need to do is efficiently deliver some content. For example, a linked data or topic map engine could aggregate everything about a subject under a URI or subject proxy within the cache. However, cache fabrics do not provide mechanisms for join processing to support high level query.
In order to have join processing directly on the cache fabrics, you would setup hash partitioned indices, with one index for each access path and the hash partitions corresponding to the hosts in the cache fabric. If you are using ordered indices (B+Trees) as local persistence store, you can then write joins which can be relatively selective and useful for online query. If you are using hash tables for the indices, then you have to read all the data for every join, much like typical map/reduce jobs. Needless to say, this is not very suited for online query.
A cache fabric generally holds fully materialized objects. However, you typically normalize the representation to support join processing, e.g., using internal 64-bit identifiers to represent string values, URIs, etc. However, this means that the internal representation can not be directly delivered to the web since you must first materialize the external representations of those internal identifiers. This is pretty much how the RDF layer for bigdata® is laid out, except that we dynamically shard the data which makes it easy for us to change the size of the cluster dynamically. If you use an internal representation which is different from the external representation, then the cache fabric becomes a “query cache.” With a standardized query language (such as SPARQL), you can then implement a cache which knows both how to invalidate itself on update and whether it can respond to a query using data already in the cache (subsumption). There is a good discussion of this for SPARQL online.
We will be speaking at the Semantic Technology Conference (Semtech) in San Francisco in June. We’ve got a couple sessions on the schedule. The first is on Wednesday June 23rd at 2:00 PM and will be about the work we are currently doing on our HA Architecture . The second session will be on Friday June 25th at 11:00 AM and will be on Object Design Patterns with RDF . If you’re interested in learning more about bigdata, giving feedback on your user experience with bigdata, or just meeting us in person to say “hello”, please come to one or both of our sessions!