Online query workload queries touch a relatively small region of the data on the disk. Such queries are said to be selective. Selective queries can be answered very quickly if an appropriate index is available. The performance curve of the B+Tree is log-linear as a function of its depth. As the B+Tree grows deeper, each new layer of the B+Tree adds another disk seek and disk access latency bounds the performance of the index. Cache avoids some portion of the disk hits for recently used (LRU) or frequently used (LIRS) cache policies. Bigdata has a non-blocking LRU cache implementation now, and a non-blocking LIRS cache in the works. A non-blocking caches may be used to buffer the B+Tree nodes and leaves on the disk, to cache frequently used RDF Values, to cache query results, etc.
Once an online query mix workload becomes sufficiently varied that the read set of the database exceeds the cache, index reads begin to read through to the disk. At this point, the CPU utilization will fall off dramatically as cores are basically waiting around for records to be read off of the disk. Once this occurs, additional CPU activities, such as block compression, block encryption, etc., become basically “free” — their cost disappears into the disk latency with a concurrent online query mixture.
NIO made it possible to write Java applications that can handle 10,000 concurrent network connections using asynchronous network IO patterns. NIO2 will do for the disk what NIO did for network IO. Using asynchronous IO patterns (based on either Futures or callbacks), a single thread will be able to service the disk which will dramatically reduce the resource demand for heavy online query mixes. NIO2 is coming with Java7 and should offer tremendous benefits for the RWStore, which uses scattered and gathered IO, and for the vectored pipeline join algorithm.
When managing a large heap (4G+), the choice of the garbage collection policy becomes critical. Both the parallel old generation and the experimental G1 garbage collectors work well for bigdata with large heaps (G1 has better throughput, but is still experimental and crashes periodically so it can not be reliably deployed yet). However, if you chose the wrong garbage collector, the garbage collector can wind up as 80% of more of your CPU time! Right now, the safest choice is the parallel old generation garbage collector, which is enabled using
-XX:+UseParallelOldGC on the Java command line.
As main memory heaps expand, this interaction of garbage collectors and application memory access profiles has a lot of implications for the ability of a Java application to utilize very large heaps without running into significant GC pauses. While the G1 collector promises to address this issue for many applications, another alternative is to explicitly manage the cache on the native heap using the same record management algorithms we use in the RWStore to manage allocation slots on the disk. However, accessing data on the native heap in a direct ByteBuffer is slower than accessing the data in a Java byte, so the only reason to pursue this approach for the B+Tree node and leaf cache is to deploy a smarter cache, such as LIRS, which in insensitive to index scans and other patterns which cause problems for an LRU algorithm. To derive the most benefit from the LIRS cache, we then have to turn off the file system cache for that disk or file. Doing this is, of course, tremendously platform dependent.
SSD reduces the disk access time by as much as an order of magnitude when compared to traditional disk. When running on SSD with an online query workload, performance drops off much more slowly as the B+Tree begins to read through the cache to the disk. This translates directly into increased database performance and higher query throughput. Enterprise grade SSD is still relatively expensive, but it is cheaper than RAM and available in capacities of 1TB or more. (These arguments apply to the bigdata Journal — SSD has a different role in the bigdata federation, which I will cover in another article.)
In contrast to online query workloads, analytic query workloads are relatively unselective — that is, they have very large working sets on the disk. For such unselective queries, we are often better off reading all the data off the disk using sustained disk transfers that maximize the disk transfer rate rather than saturating the disk with random access reads. For my next subject, I will cover analytic query workloads and show how the bigdata federation handles both online query and analytic workloads in a single architecture.