bigdata®

bigdata® is a scale-out data and computing fabric designed for commodity hardware.

See:
          Description

Packages
com.bigdata  
com.bigdata.bfs This package provides a scale-out content repository (aka distributed file system) suitable as the backend for a REST-ful service using the bigdata architecture.
com.bigdata.btree The BTree is a scalable B+-Tree with copy-on-write semantics mapping variable length unsigned byte[] keys to variable length byte[] values (null values are allowed).
com.bigdata.btree.data  
com.bigdata.btree.filter  
com.bigdata.btree.isolation  
com.bigdata.btree.keys  
com.bigdata.btree.proc  
com.bigdata.btree.raba  
com.bigdata.btree.raba.codec  
com.bigdata.btree.view  
com.bigdata.cache A canonicalizing object cache may be constructed from an outer weak reference value hash map backed by an inner hard reference LRU policy.
com.bigdata.concurrent This package supports concurrency control using exclusive locks on resources.
com.bigdata.config  
com.bigdata.counters Package supports declaration, collection, and interchange of performance counters for both hosts and processes on supported platforms.
com.bigdata.counters.httpd A httpd service for performance counters, including both live counters and post-mortem counters read from an XML interchange format.
com.bigdata.counters.linux Package provides performance counter collection on Linux platforms using the sysstat suite.
com.bigdata.counters.query  
com.bigdata.counters.render  
com.bigdata.counters.store This package provides a persistence mechanism for performance counters.
com.bigdata.counters.win Package provides collection of performance counters on Windows platforms.
com.bigdata.gis  
com.bigdata.io  
com.bigdata.io.compression  
com.bigdata.jini.lookup.entry  
com.bigdata.jini.start  
com.bigdata.jini.start.config  
com.bigdata.jini.start.process  
com.bigdata.jini.util  
com.bigdata.jmx  
com.bigdata.journal The journal is an append-only persistence capable data structure supporting atomic commit, named indices, and transactions.
com.bigdata.mdi This package provides a metadata index and range partitioned indices managed by that metadata index.
com.bigdata.net  
com.bigdata.rawstore A set of interfaces and some simple implementations for a read-write store without atomic commit or transactions.
com.bigdata.rdf.axioms  
com.bigdata.rdf.inf This package provides an eager closure inference engine for most of the RDF and RDFS entailments and can be used to realize entailments for owl:sameAs, owl:equivilentClass, and owl:equivilentProperty.
com.bigdata.rdf.lexicon  
com.bigdata.rdf.load Support for concurrent loading of RDF data across one or more clients from a variety of input sources.
com.bigdata.rdf.magic This package includes an abstraction layer for declaring containers of relations, relations, the relationships between the relations (foreign keys), and the indices for a relation, including what goes into the key and the value for the index.
com.bigdata.rdf.model This package provides a tuned implementation of the Sesame RDF data model for the RDF database.
com.bigdata.rdf.rio This package provides an integration with the openrdf RIO parser that supports fast data loads.
com.bigdata.rdf.rules  
com.bigdata.rdf.sail This package contains the SAIL that allow bigdata to be used as a backend for the Sesame 2.x platform.
com.bigdata.rdf.spo This package defines a statement model using long term identifiers rather than RDF Value objects.
com.bigdata.rdf.store This package provides several realizations of an RDF database using the bigdata architecture, including one suitable for temporary data, one suitable for local processing (single host), and one designed for scale-out on commodity hardware.
com.bigdata.rdf.vocab  
com.bigdata.relation This package includes an abstraction layer for relations.
com.bigdata.relation.accesspath This package includes an abstraction layer for efficient access paths, including chunked iterators, blocking buffers, and an abstraction corresponding to the natural order of an index.
com.bigdata.relation.locator Abstraction layer for identifying relations and relation containers in much the same manner that indices are identified (unique name and timestamp).
com.bigdata.relation.rule This package includes an abstraction layer for rules.
com.bigdata.relation.rule.eval This package supports rule evaluation.
com.bigdata.relation.rule.eval.pipeline This package implements a pipeline join.
com.bigdata.resources This package provides the logic to managed the live journal and the historical journals and index segments for a DataService.
com.bigdata.samples  
com.bigdata.search This package provides full text indexing and search.
com.bigdata.service This package provides implementations of bigdata services (metadata service, data service, transaction manager service.
com.bigdata.service.jini  
com.bigdata.service.jini.benchmark  
com.bigdata.service.jini.lookup  
com.bigdata.service.jini.master  
com.bigdata.service.jini.util  
com.bigdata.service.mapred This package contains the map/reduce service, including the master and clients.
com.bigdata.service.mapred.jini  
com.bigdata.service.mapred.jobs  
com.bigdata.service.mapred.tasks This package contains a library of map/reduce tasks.
com.bigdata.service.ndx  
com.bigdata.service.ndx.pipeline  
com.bigdata.service.proxy This package provides implementations of proxy objects for commonly used classes.
com.bigdata.sparse This package provides support for treating normal B+Trees using a "sparse row store" pattern and can be applied to both local B+Trees and scale-out indices.
com.bigdata.striterator Streaming iterator patterns based on Martyn Cutcher's striterator design but supporting generics and with extensions for closable, chunked, and ordered streaming iterators.
com.bigdata.util Utility classes.
com.bigdata.util.concurrent Utility classes related to java.util.concurrent.
com.bigdata.util.httpd  
com.bigdata.zookeeper  
org.apache.system This package contains a utility class SystemUtil that is capable of reporting some interesting information about the platform on which it is running, including the number of CPUs.
org.openrdf.rio.rdfxml  

 

bigdata® is a scale-out data and computing fabric designed for commodity hardware. The bigdata architecture provides named scale-out indices that are transparently and dynamically key-range partitioned and distributed across a cluster or grid of commodity server platforms. The scale-out indices are B+Trees and remain balanced under insert and removal operations. Keys and values for btrees are variable length byte[]s (the keys are interpreted as unsigned byte[]s). Atomic "row" operations are supported for very high concurrency. However, full transactions are also available for applications needing less concurrency but requiring atomic operations that read or write on more than one row, index partition, or index. Writes are absorbed on mutable btree instances in append only "journals" of ~200M capacity. On overflow, data in a journal is evicted onto read-optimized, immutable "index segments". The metadata service manages the index definitions, the index partitions, and the assignment of index partitions to data services. A data service encapsulates a journal and zero or more index partitions assigned to that data service, including the logic to handle overflow.

A deployment is made up of a transaction service, a load balancer service, and a metadata service (with failover redundency) and many data service instances. bigdata can provide data redundency internally (by pipelining writes bound for a partition across primary, secondary, ... data service instances for that partition) or it can be deployed over NAS/SAN. bigdata itself is 100% Java and requires a JDK 1.6. There are additional dependencies for the installer (un*x) and for collecting performance counters from the OS (sysstat).

Architecture

The bigdata SOA defines three essential services and some additional services. The essential services are the metadata service (provides a locator service for index partitions on data services), the data service (provides read, write, and concurrency control for index partitions), and the transaction service (provides consistent timestamps for commits, facilitates the release of data associated with older commit points, read locks, and transactions). Full transactions are NOT required, so you can use bigdata as a scale-out row store. The load balancer service guides the dynamic redistribution of data across a bigdata federation. There are also client services, which are containers for executing distributed jobs.

While other service fabric architectures are contemplated, bigdata services today use JINI 2. to advertise themselves and do service discovery. This means that you must be running a JINI registrar in order for services to be able to register themselves or discover other services. The JINI integration is bundled and installed automatically by default.

Zookeeper handles master election, configuration management and global synchronous locks. Zookeeper was developed by Yahoo! as a distributed lock and configuration management service and is now an Apache subproject (part of Hadoop). Among other things, it gets master election protocols right. Zookeeper is bundled and installed automatically by default.

The main building blocks for the bigdata architecture are the journal (append-only persistence store), the mutable B+Tree (used to absorb writes), and the read-optimized immutable B+Tree (aka the index segment). Highly efficient bulk index builds are used to transfer data absorbed by a mutable B+Tree on a journal into index segment files. Each for index segment contains data for a single partition of a scale-out index. In order to read from an index partition, a consistent view is created by dynamically fusing data for that index partition, including any recent writes on the current journal, any historical writes that are in the process of being transferred onto index segments, and any historical index segments that also contain data for that view. Periodically, index segments are merged together, at which point deleted tuples are purged from the view.

Bigdata periodically releases data associated with older commit points, freeing up disk resources. The transaction service is configured with a minimum release age in milliseconds. This can be ZERO (0L) milliseconds, in which case historical views may be released if there are no read locks for that commit point. The minimum release age can also be hours or days if you want to keep historical states around for a while. When a data service overflows, it will consult the transaction service to determine the effective release time and release any old journals or index segments no longer required to maintain views GT that release time.

An immortal or temporal database can be realized by specifying Long#MAX_VALUE for the minimum release age. In this case, the old journals and index segments will all be retained and you can query any historical commit point of the database at any time.

Sparse row store

The SparseRowStore provides a column flexible row store similar to Google's bigtable or HBase, including very high read/write concurrency and ACID operations on logical "rows". Internally, a "global row store" instance is used to maintain metadata on relations declared within a bigdata federation. You can use this instance to store your own data, or you can create your own named row store instances. However, there is no REST api for the row store at this time (trivial to be sure, but not something that we have gotten around to yet).

In fact, it is trivial to realize bigtable semantics with bigdata - all you need to do is exercise a specific protocol when forming the keys for your scale-out indices and then you simply choose to NOT use transactions. A bigtable style key-value is formed as:

  
      [columnFamily][primaryKey][columnName][timestamp} : [value]
  
  

By placing the column family identifier up front, all data in the same column family will be clustered together by the index. The next component is the "row" identifier, what you would think of as the primary key in a relational table. The column name comes next - only column names for non-null columns are written into the index. Finally, there is a timestamp column that is used either to record a timestamp specified by the application or a datum write time. The value associated with the key is simply the datum for that column in that row. The use of nul byte separators makes it possible to parse the key, which is required for various operations including index partition splits and filtering key scans based on column names or timestamps. See the KeyBuilder class in com.bigdata.btree.keys for utilities that may be used to construct keys for variety of data types.

Map/reduce, Asynchronous Write API, and Query

Google's map/reduce architecture has received a lot of attention, along with its bigtable architecture. Map/reduce provides a means to transparently decompose processing across a cluster. The "map" process examines a series of key-value pairs, emitting a set of intermediate key-value pairs for each input. Those intermediate key-values are then hashed (module R) onto R reduce processes. The inputs for the reduce processes are pre-sorted. The reduce process then runs some arbitrary operation on the sorted data, such as computing an inverted index file or loading the data into a scale-out index.

bigdata® supports an asynchronous index write API, which delivers extremely high throughput for scattered writes. While map/reduce is tremendously effective when there is good locality in the data, it is not the right tool for processing ordered data. Instead, you execute a master job, which spawns client(s) running in client service(s) associated with the bigdata federation. Those clients process data, writing onto blocking buffers. The writes are automatically split and buffered for each key-range shard. This maximizes the chunk size for ordered writes and provides a tremendous throughput boost. Bigdata can work well in combination with map/reduce. The basic paradigm is you use map/reduce jobs to generate data, which is then bulk loaded into a bigdata federation using bigdata jobs and the asynchronous write API.

bigdata® has built in support for distributed rule execution. This can be used for high-level query or for materializing derived views, including maintaining the RDFS+ closure of a semantic web database. The implementation is highly efficient and propagates binding sets to the data service for each key-range shard touched by the query, so the JOINs happen right up against the data. Unlike using map/reduce for join processing, bigdata query processing is very low latency. Distributed query execution can be substantially faster than local query execution, even for low-latency queries.

HA Architecture

There are two alternatives here. The data service (DS) is the container for the index partitions (key-range shards). There are logical data services and physical data services. Clients always write on the master DS for a given logical DS. The #of physical DS per logical DS is k. For HA, k is greater than ONE (1).

Alternative 1. Clients read from any physical DS for a given logical DS. Storage is either on either local disk or SAN/NAS. Local disk is acceptable for this alternative because the data are replicated across multiple machines, which provides built in media redundancy. For each logical DS, master pipelines writes to a failover chain of k secondary DS. That pipeline is flushed during the commit protocol by the master. The commit succeeds once the writes are on stable storage on the master and the secondaries or fails and is rolled back. If the master fails, then the 1st secondary is elected as the new master. Master election has handled by zookeeper.

Alternative 2. Storage is a shared volume (SAN/NAS). The secondaries DS are registered but inactive until the master fails, at which point the 1st secondary in the failover chain re-opens the same persistence store from the service directory on the SAN/NAS.

Standalone Journals and Embedded Federations

While bigdata® is targetted at scale-out federations, it can also be deployed as simple persistence store using just the com.bigdata.journal.Journal API or as an embedded federation.

The journal can scale up to 4TB, but it is a WORM (Write Once Read Many) aka also known as an immortal database or a log-structured store. This works great for some applications, and it is used for the data service write buffers. We will probably do a read-write (RW) version of the journal at some point which is capable of reclaiming storage, but that is not necessary for the scale-out architecture which uses index segments builds and compacting merges to clear the old write buffers.

The embedded federation can be used to test things out on a single machine. It will start instances of each of the services, and also start N data services. We use this to write unit tests for the federation, and it can make testing your application easier as well.

Status

bigdata® has been tested on clusters of up to 15 nodes. We have loaded data sets of 10B+ rows, at rates of over 300,000 rows per second. Overall, 100s of billions of rows have been put down safe on disk. People should be able to reach petabyte scale today, and exabyte scale once we introduce a partitioned metadata service (data service locators). In theory, the architecture can scale to ~ 400 exabytes per scale-out index.

The following features are not finished yet:

Getting Started

See the wiki for Getting Started and our blog for what's new. The javadoc is online and you can also build it with the ant script. If you have a question, please post it on the blog or the forum.

Getting Involved

bigdata® is an open source project. Contributors and contributions are welcome. Like most open source project, contributions must be submitted under a contributor agreement, which must be signed by someone with the appropriate authority. This is necessary to ensure that the code base remains open.

If you want to help out, please check out what is going on our blog and on the main project site. Post your questions and we will help you figure out where you can contribute or how to create that new feature that you need.

Licenses and Services

bigdata® is distributed under GPL(v2). SYSTAP, LLC offers commercial licenses for customers who either want the value add (warranty, technical support, additional regression testing), who want to redistribute bigdata with their own commercial products, or who are not "comfortable" with the GPL license. For inquiries or further information, please write licenses@bigdata.com.

Please let us know if you need specific feature development or help in applying bigdataa® to your problem. We are especially interested in working directly with people who are trying to handle massive data, especially for the semantic web. Please contact us directly.

Related links

CouchDB
http://couchdb.org/CouchDB/CouchDBWeb.nsf/Home?OpenForm
bigtable
http://labs.google.com/papers/bigtable.html, http://www.techcrunch.com/2008/04/04/source-google-to-launch-bigtable-as-web-service/
map/reduce
http://labs.google.com/papers/mapreduce.html
Hadoop
http://lucene.apache.org/hadoop/
Zookeeper
http://hadoop.apache.org/zookeeper/
Jini/River
http://www.jini.org/wiki/Main_Page, http://incubator.apache.org/river/RIVER/index.html
Pig
http://research.yahoo.com/node/90
Sawsall
http://labs.google.com/papers/sawzall.html
Boxwood
http://research.microsoft.com/research/sv/Boxwood/
Blue Cloud
http://www.techcrunch.com/2007/11/15/ibms-blue-cloud-is-web-computng-by-another-name/
SimpleDB
http://www.techcrunch.com/2007/12/14/amazon-takes-on-oracle-and-ibm-with-simple-db-beta/
mg4j
http://mg4j.dsi.unimi.it/



Copyright © 2006-2009 SYSTAP, LLC. All Rights Reserved.