Starting with our 1.1 release, bigdata includes an optional “analytic query mode”. Enabling analytic query turns on support for the MemoryManager and the HTree and allows bigdata to scale to 4TB of data on the native process heap with zero GC overhead. In the future it will also turn on the runtime query optimizer (RTO). You can get a preview of this release by checking out the TERMS_REFACTOR_BRANCH from SVN.
Someone recently wrote about a query which was doing a SELECT DISTINCT on a large number of solutions. They thought that there might be a memory leak since the JVM eventually crashed with an OutOfMemoryError. However, this is not really a memory leak. Here is what is going on:
The problem is the Java architecture for managed memory. You can read about this, and about how we fix it, here. What you need to do for this query (and others like it) is turn on the “analytic” mode for bigdata. The easiest way to do this is to check the:
option on the NanoSparqlServer’s SPARQL query form page. If you are using the NanoSparqlServer you can also specify the URL query parameter
...&analytic=true. Finally, you can enable this with a magic triple directly in the SPARQL query:
hint:Query hint:analytic "true" .
Just put that triple somewhere in the WHERE clause of the query and the query will run with the “analytic” options enabled. You do not need to declare the “hint:” prefix, but if you want to the namespace should be “http://www.bigdata.com/queryHints#”.
What the analytic query mode will do for you is buffer the data on the native process heap rather than on the JVM object heap. This will reduce the GC overhead associated with the query to basically zero. It performs this feat entirely within Java by leveraging the java.nio package.
There are analytic and non-analytic versions of all the joins, distinct, etc. operators. The analytic versions use the MemoryManager and the HTree. The non-analytic versions use Java collection classes. The Java collection classes are somewhat faster as long as you are not materializing a lot of data on the Java object heap. For example, for the BSBM “explore” use case the Java operators are about 10% faster overall. DISTINCT is a special case. The Java version of that operator uses a ConcurrentHashMap under the covers and can give you much higher concurrency in the query. But, if you are trying to DISTINCT a large number solutions then you are going to run into trouble with the Garbage Collector.
Bottom line: if you are going to be materializing a LOT of data then you need to use the analytic operators. Those operators will scale to 4TB of RAM. If you try to materialize a lot of data on the Java object heap, you will run into heavy GC overhead and the application will slow down to a crawl and then die with “java.lang.OutOfMemoryError: GC overhead limit exceeded”.