People have been inquiring whether the
bigdata scale-out architecture is enterprise ready, and, if not, what it would take to get there.
Bigdata, as a scale-up platform, has been in operational use for years, but this is still an early adopter phase for the scale-out architecture. What follows is an overview of some enterprise features and their current state of
readiness.
If you need some feature that is not finished yet, consider getting involved or helping out by funding feature development. We can tackle any of these issues within a few months.
Datalog support for query time inference. Bigdata has an efficient internal rule execution model.
SPARQL queries are translated into the internal rule model and then executed using distributed joins. Some
entailments are computed at query time, but generalized query time inference requires a rewrite of the query and the
RDF(S)+ entailment rules into a minimum effort program which computes exactly those
entailments required to answer the query. This is done using a magic sets integration.
Status: Early development.
Transactional semantics. In order to have a semantic web database which is used as a transactional system at scale you must either use NO inference (just triples) or use query time inference rather than eager materialization of the triples. The issue is that eager materialization of inferences requires the total serialization of all transaction commits, which would be an unacceptable performance bottleneck regardless of the rest of the architecture. Either way, the semantics become those of the underlying database concurrency control algorithm, which in this case is
MVCC. Further, in order to avoid race conditions for the lexicon (the mapping from
RDF Values,
URIs, Literals, and blank nodes, onto internal 64-bit unique identifiers) we use an ACID, but non-transactional, consistent write strategy. This guarantees a consistent mapping of
RDF Values onto internal identifiers without limiting concurrency.
Status: Read-only transactions are done and are used to support high-level query. Full distributed read-write transaction support is mostly finished, but we are still working on the distributed commit protocol.
MVCC is fully supported in the data services and the indices.
Concurrency Control. Bigdata uses Multi-Version Concurrency Control. We timestamp all
tuples in the indices. Transactional commits identify write-write conflicts based on those timestamps. If the timestamp has been changed since the ground state from which the transaction is reading, then there is a write-write conflict. For
RDF, we can reconcile write-write conflicts. The typical situation is that two transactions both write the same triple on the database. This is normally not viewed as a conflict since the
RDF data is not typically used to establish ad-
hoc locking protocols! Further, if we are using either query-time or NO inference, then there is no value associated with the
tuples in the statement indices. All of the information is captured by the key. Write-write conflicts do arise, but only when some statements are being retracted.
Status: Done.
Locking. MVCC does not utilize locks for concurrency control. Instead, it does validation during the commit protocol as described above. We do support synchronous distributed locks through an integration with zookeeper, but that is not part of the concurrency control architecture.
Status: Done.
HA architecture. There are two alternatives here. The data service (
DS) is the container for the index partitions (key-range shards). There are logical data services and physical data services. Clients always write on the master
DS for a given logical
DS.
Alternative 1. Clients can read from any physical
DS for a given logical
DS. Storage can be on either local disk or SAN/
NAS. Local disk is acceptable for this alternative because the data are replicated across multiple machines, which provides built in media redundancy. The master (
DS) pipelines writes to a
failover chain of k secondary
DS. That pipeline is flushed during the commit protocol by the master
DS. The commit succeeds once the writes are on stable storage on the master and the secondaries or fails and is rolled back. If the master fails, then the 1st secondary is elected as the new master. We handle master election using zookeeper. Zookeeper was developed by Yahoo! as a distributed lock and configuration management service and is now an Apache
subproject (part of
Hadoop). Among other things, it gets master election protocols right.
Alternative 2. Storage is a shared volume (SAN/
NAS). The secondaries
DS are registered but inactive until the master fails, at which point the 1st secondary in the
failover chain re-opens the same persistence store from the service directory on the SAN/
NAS.
Status:
Failover has not been implemented. Alternative 2 is the easiest to realize and many organizations
perfer to manage storage separately from servers. Alternative 1 probably has the best price/performance for deployments since it can use local disk.
Backup and recovery. Bigdata uses a log structured store known as a "journal" to buffer writes. Periodically, the journal will reach its nominal capacity of ~200 MB. At that point, there is an atomic
cutover ("synchronous overflow") to a new journal and an asynchronous overflow process migrates the buffered writes onto read-optimized B+Tree files ("index segments"). Backup therefore entails copying the index segments, the current (live) journal, and the previous journal (to capture the buffered writes). The best time to do a backup is during the atomic
cutover to the new journal. At that point, write activity on the journal is suspended and it may be snap-copied. In fact, the copy operation only needs to be protected for the root blocks, which are the first page of the journal on the disk. A backup protocol could be integrated into synchronous overflow processing with very low overhead. Backup must also capture new index segments are they are generated, so there is a second integration point for that (the 1st HA strategy already requires the synchronous propagation of index segments to the
failover data services).
Offline recovery is a matter of restoring the persistent state of the services and re-starting the services. Service (re-)start is quite fast, but a total database recovery would not be a fast operation.
Lightweight recovery of data with has been overwritten by transactions that you need to rollback may be achieved using the history retention policy. When you configure a
bigdata federation, you can specify the minimum retention age for historical commit points. This can be hours, days, or weeks.
Bigdata will not release those commit points until their retention age has expired. This makes it possible to perform correcting actions which bring the database back into a desired state. It would be quite feasible to develop a feature where the database was rolled back to a historical commit point and transactions could then be selectively reapplied from a log.
Status: Not implemented yet.