Mike and I have been doing some heavy thinking about the High Availability (HA) architecture for bigdata, and we think that we have a really interesting new dimension to the total story. The basic idea is that we are going to decouple the index read/write service from the distributed file system. This is going to change things in a big way. Here are some for instances:
- Add or drop bigdata services based on demand, dynamically changing the throughput for data write. This is entirely in line with the notions of cloud and will allow resource sharing on private clouds or pay per use on public clouds.
- Many machines can serve from the same shard. This means that we can hit HPC levels of concurrency for query answering. We can even partition the machines so long running queries execute against a dedicated set of hosts without interfering with normal low latency query processing.
- Blindingly fast parallel closure. There are some new algorithms out there which describe simple techniques for computing the RDFS closure of each database partition independently by replicating the “ontology” onto each node. For bigdata, this works out as computing the closure for each POS index shard independently. Using just these techniques we can probably obtain a 10x improvement when computing the database closure. However, by decoupling the index read/write services from the distributed file system, we can actually put as much as an entire machine on each POS shard, computing the closure of an arbitrarily large graph in minutes.
There are a lot of ways to handle the distributed file system, ranging from refactoring our existing code to produce a custom API, to running over any of the various distributed file systems for commodity hardware, to running over a parallel file system for an HPC cluster, SAN or NAS. Our preference is a distributed file system for commodity hardware because that supports the most reuse across both bigdata and other cloud applications, but the distributed file system must support sync to disk.
Bigdata IO is already highly optimized for both write and read operations. When we open an index segment, we fully buffer the B+Tree nodes, which are conveniently arranged in a contiguous region in the file. Likewise, when we read on an index segment, we can go directly to the leaf since the nodes are in memory. The leaves are double-linked as well, and we have bloom filters in from of the B+Tree as well as record level caching.
Journal writes are append only, occur at group commit points, and are generally 1 or more megabytes in size. Still, it might make sense to use a hybrid design where we tie the journal to a data service and use a failover chain for master and secondary data service to absorb writes. Once we close out the journal, we can blast its 200MB extent directly onto the distributed file system. Likewise, when reading against a historical commit point, we could slurp the corresponding journal onto the machine paired with the shard eliminating any remote IO for small data records on the journal. Cold shards can simply be dropped. They can be reassigned to data services on an as needed basis and brought back online nearly instantly. This will allow the number of machines dedicated to bigdata to fluctuate with the demand, which is exactly in keeping with the cloud paradigm.
This really opens up the possibilities that we see for bigdata tremendously. By tossing out the idea that the CPU/RAM of at most a single machine was available to absorb writes or compute closure, we can gain tremendous leverage from pooled computational resources.