Online query workloads (see my previous post) can be quickly answered using an index. Such queries touch relatively little data on the disk. Using an index, it is possible to rapidly read only the data on the disk necessary to answer the query. In contrast, analytic queries touch large volumes of the disk. If you attempt to answer an analytic query using a B+Tree, the disk utilization will spike to nearly 100%. However, this is not good news. The disk is being saturated with seeks, which are relatively slow operations, and the data transfer rate will not approach the capabilities of the device. SSD offers one path around this bottleneck, by reducing the latency of a disk seek operation, but SSD is much more expensive than traditional disk — especially at the scale of a data warehouse. Instead, analytic queries are answered by sustained IO operations which read more data than you really need, but the data is transferred at or near the bandwidth capabilities of the device. Rather than selecting just the relevant B+Tree leaves using an index, irrelevant data are filtered out once they reach main memory.
Column stores (MonetDB, C-Store) offer a different approach to analytic queries – one designed to maximize the disk transfer rate. Rather than using a B+Tree index and suffering disk access saturation, the physical organization of the data on the disk is a projection of a single column from the original table. By design, these projections are narrow (often a single integer or floating point field) and dense (the index of the tuple in the projection is its insert order into the corresponding table). With this layout on the disk, a column store can rapidly read all tuples for some column in order to select some interesting subset or aggregate some attribute. Once the data transfer rate approaches the maximum transfer rate for the disk, the new bottleneck becomes the CPU cache (for example, see Database Architecture Optimized for the new Bottleneck: Memory Access). As a result, column stores tend to use cache aware algorithms, such as the radix sort. Cache aware algorithms are designed to minimize the likelihood of a CPU cache miss, and thereby maximize the ability of the CPU to process the data coming off the disk. While column stores excel at analytic workloads, they are not well adapted for either online query (selective queries are much easier to answer with an index) or online updates (this poses challenges for the maintenance of the column projections). Ongoing research seeks to address these concerns.
The history of database architectures evolved from the initial debates between advocates for the relational (Cobbs) and network (CODASYL) database models. While there were many arguments concerning the merits of these approaches, including whether they support declarative query languages and decoupling of the logical schema from the physical schema, the issue was settled more by market forces than the merits of these arguments and the relational model became the dominant paradigm. Since that time, object-relational and XML databases have appeared and have been incorporated by various database vendors into their products.
During the last decade, XML databases, column stores, map/reduce, distributed file systems, cloud computing patterns, distributed cache fabrics, key-value stores, commodity hardware, GPS, GPUs, SSD, the semantic web, linked data, social networking, and a slew of other hardware, software and social developments have emerged and created a game changing landscape. The database is in the process of being deconstructed as a technology. This is a tremendous opportunity.
RDF (or graph) databases are relatively new and, like XML databases, pose new technical challenges. YARS modeled RDF using covering indices for the different access patterns. For triples, this is SPO, OSP, and POS. For quads, there are six such indices. This design works well for online query workloads. But, as I pointed out above, analytic workloads will saturate the disk access time when using a B+Tree index. However, there are different ways to get around this problem. YARS2 uses a combination of bulk data load, bulk sort, and ISAM files to allow fast sequential access to the data on the disk. MonetDB is examining a variety of solutions based on column projections. Map/reduce typically uses unindexed data and just brute forces its way, reading all data and pushing the bits selected by the query onto the reduce nodes. I outline the approach bigdata uses below.
Before jumping into the bigdata approach to unselective queries, let me point out that while RDF “tuples” are “narrow”, the physical schema for RDF data with covering indices still interleaves all tuples for the same logical row. Consider the SPO index: all predicate-object combinations are clustered together. This is not what a column projection does. One way to model RDF data in a column store is to use a physical schema with one column per predicate. Running down a column projection of a predicate would provide rapid access to all distinct object bindings for that predicate. However, this physical design can lead to significant problems for column store query optimizers. For example, see ColumnStore Support for RDF Data Management: not all swans are white. However, note that column store publications tend to study unselective queries and the existing studies do not compare the strengths and weaknesses of “native” triple stores when compared to RDBS and column stores for both unselective and selective queries.
Before we look at how bigdata handles unselective queries, let me provide some background on how bigdata manages its indices. When running on a single Journal, the backing persistence store for bigdata can be huge — terabytes of data can be stored in a single file. However, the scale-out architecture uses the Journal as an ACID write buffer, storing only the last 200MB of writes accepted by a given node. Once the current journal files up, a fast synchronous overflow operation atomically closes out that journal for writes and opens a new journal. Asynchronous overflow processing then transfers the data from the old journal onto index segment files using batch B+Tree builds. Each shard consists of the recently buffered writes on the life journal and zero or more index segment files from batch builds on old journals. When the shard view gets complex, a compacting merge will replace the view with a single index segment file on the disk. When that file gets to be 200MB on the disk, the shard will be split. In the scale-out architecture, we never have files larger than 200MB.
Index segment files are laid out optimally for either local or remote IO. The B+Tree nodes are in one region in total key order and are typically read into memory with a single IO when the index segment file is opened. Likewise, the B+Tree leaves are in another region of the file in total key order and are arranged into a double-linked list for sequential traversal. However, sequential traversal of the leaves using the double-linked list is significantly slower than a sustained IO designed to read the leaves directly into memory.
Bigdata handles unselective queries by reading the leaves for each shard view using a single IO per index segment in that shard view. That will range from 1 to 4 IOs per shard view, depending on how long it has been since the last time a compacting merge was done for that view, but no more than 200MB of data in any case for a shard. All told, this is just a few seconds of IO. An ordered B+Tree read over the same shard is at least an order of magnitude slower, and it is at least another order of magnitude if you are doing random B+Tree reads.
If the query is selective, then we navigate down the B+Tree and read just the leaves that we need. If the query is unselective, then we maximize the disk transfer rate. By using sustained IOs, this approach avoids the disk access time bottleneck and let’s us step through the data at the disk transfer rate. However, since the data access pattern is now radically different, we need to use different join operators as well. For example, many column stores use operator at a time rather than vectored pipelined joins. A few, such as VectorWise, use vectored joins. Look for some new join operators in bigdata over the next few months as we move to support analytic query workloads for RDF along with support for SPARQL 1.1, which includes aggregation operators.