I am going to lead up to how we get that outrageous score, but first let me provide some background about the most recent official BSBM run and then I will go into how bigdata performed in the official runs and why we are able to report a much higher result here.
We were invited to participate in the recently concluded BSBM V3.0 benchmark. BSBM V3.0 includes a new data set (statistically similar, but different data) and new use cases for the SPARQL 1.1 update and aggregation extensions. Bigdata does not yet support the SPARQL 1.1 features, so we were not able to participate in those aspects of the benchmark. The results of the benchmark are now online and the report to the lod2 project is here. Many thanks to Chris, Andreas, and the lod2 project for sponsoring this benchmark trial.
The official benchmark reports reasonable results for bigdata. However, those results are not the entire picture and substantially under represent how fast bigdata query performance really is. Let me explain.
First, we are a bit slow on the data load. This is primarily due to the presence of large literals (kilobytes each) in the BSBM data which wind up clogging the B+Tree leaves in which they are stored. To address this, we are changing how we store RDF literals in order to reduce the IO churn. This is something that we will be fixing over the next few weeks, so stay tuned for an update on that.
In previous versions of BSBM, the benchmark included a reduced query mix. This query mix excluded two queries, Q5 and Q6, which were problematic for the triples stores under test. In BSBM V3.0, Q6 has been dropped completely from the query mix (it requires an free text index which supports regular expressions to do well on that query). However, this time around results were not published for a “reduced query mix” — that is, without Q5.
Q5 makes a tremendous difference in query performance for bigdata. We are continuing to work on query optimization for Q5, which requires reasoning about the join order in the presence of SPARQL filters and attaching constraints at the earliest possible point at which they can run. With Q5, we scored 4286 QMpH for 8 clients against the 100M data set (this is the value in the published report). However, without Q5, bigdata delivers a whalloping 36608 under the same conditions. That is on par with the best result reported in the benchmark, which was 36269. The detailed results are below.
This is not a perfect apples to apples comparison since (a) the results below were obtained on a different machine (less RAM, but faster disks); and (b) the “reduced query mix” results were not reported for the other databases so we can not say how their performance would change without Q5. However, it does show the high performance you should expect from bigdata in our next release. In the meanwhile, we are going to keep working on query optimization for Q5. We should have it licked soon.
Scale factor: 284826
Number of warmup runs: 50
Number of clients: 8
Seed: 9175932
Number of query mix runs (without warmups): 500 times min/max Querymix runtime: 0.5236s / 1.2590s
Total runtime (sum): 384.261 seconds
Total actual runtime: 49.168 seconds
QMpH: 36608.83 query mixes per hour
CQET: 0.76852 seconds average runtime of query mix
CQET (geom.): 0.75804 seconds geometric mean runtime of query mix
Metrics for Query: 1
Count: 500 times executed in whole run
AQET: 0.023844 seconds (arithmetic mean)
AQET(geom.): 0.021225 seconds (geometric mean)
QPS: 327.77 Queries per second
minQET/maxQET: 0.00430900s / 0.19842800s
Average result count: 7.88
min/max result count: 0 / 10
Number of timeouts: 0
Metrics for Query: 2
Count: 3000 times executed in whole run
AQET: 0.027569 seconds (arithmetic mean)
AQET(geom.): 0.024954 seconds (geometric mean)
QPS: 283.48 Queries per second
minQET/maxQET: 0.00580500s / 0.22326000s
Average result count: 19.36
min/max result count: 7 / 37
Number of timeouts: 0
Metrics for Query: 3
Count: 500 times executed in whole run
AQET: 0.109546 seconds (arithmetic mean)
AQET(geom.): 0.085314 seconds (geometric mean)
QPS: 71.34 Queries per second
minQET/maxQET: 0.00909600s / 0.50606200s
Average result count: 5.57
min/max result count: 0 / 10
Number of timeouts: 0
Metrics for Query: 4
Count: 500 times executed in whole run
AQET: 0.035182 seconds (arithmetic mean)
AQET(geom.): 0.032202 seconds (geometric mean)
QPS: 222.14 Queries per second
minQET/maxQET: 0.00587100s / 0.22067800s
Average result count: 7.56
min/max result count: 0 / 10
Number of timeouts: 0
Metrics for Query: 5
Count: 0 times executed in whole run
AQET: 0.000000 seconds (arithmetic mean)
AQET(geom.): NaN seconds (geometric mean)
QPS: Infinity Queries per second
minQET/maxQET: 179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.00000000s / 0.00000000s
Average result count: 0.00
min/max result count: 2147483647 / -2147483648
Number of timeouts: 0
Metrics for Query: 7
Count: 2000 times executed in whole run
AQET: 0.039152 seconds (arithmetic mean)
AQET(geom.): 0.036275 seconds (geometric mean)
QPS: 199.61 Queries per second
minQET/maxQET: 0.00948900s / 0.22935300s
Average result count: 12.31
min/max result count: 1 / 100
Number of timeouts: 0
Metrics for Query: 8
Count: 1000 times executed in whole run
AQET: 0.028075 seconds (arithmetic mean)
AQET(geom.): 0.025042 seconds (geometric mean)
QPS: 278.37 Queries per second
minQET/maxQET: 0.00416200s / 0.21368300s
Average result count: 4.80
min/max result count: 0 / 19
Number of timeouts: 0
Metrics for Query: 9
Count: 2000 times executed in whole run
AQET: 0.028693 seconds (arithmetic mean)
AQET(geom.): 0.025425 seconds (geometric mean)
QPS: 272.37 Queries per second
minQET/maxQET: 0.00225400s / 0.22867700s
Average result (Bytes): 6762.83
min/max result (Bytes): 1528 / 12831
Number of timeouts: 0
Metrics for Query: 10
Count: 1000 times executed in whole run
AQET: 0.025270 seconds (arithmetic mean)
AQET(geom.): 0.022319 seconds (geometric mean)
QPS: 309.27 Queries per second
minQET/maxQET: 0.00443900s / 0.21323000s
Average result count: 1.87
min/max result count: 0 / 9
Number of timeouts: 0
Metrics for Query: 11
Count: 500 times executed in whole run
AQET: 0.034836 seconds (arithmetic mean)
AQET(geom.): 0.032530 seconds (geometric mean)
QPS: 224.34 Queries per second
minQET/maxQET: 0.01361800s / 0.22590800s
Average result count: 10.00
min/max result count: 10 / 10
Number of timeouts: 0
Metrics for Query: 12
Count: 500 times executed in whole run
AQET: 0.021630 seconds (arithmetic mean)
AQET(geom.): 0.018793 seconds (geometric mean)
QPS: 361.32 Queries per second
minQET/maxQET: 0.00390600s / 0.20190300s
Average result (Bytes): 1470.73
min/max result (Bytes): 1433 / 1507
Number of timeouts: 0
This result is quoted using the same seed as the official benchmark run and 8 clients using 50 warmup trials and 500 presentations of the query mixes. The database was the bigdata RWStore running on a single machine. These results were obtained against the QUADS_QUERY_BRANCH from SVN r4241. The machine is a quad core AMD Phenom II X4 with 8MB Cache @ 3Ghz running Centos with 16G of RAM and a striped RAID array with 6x SAS disks with 15k spindles (Seagate Cheetah with 16MV Cache, 3.5″). IO utilization approximately 50%. CPU utilization was 50% during the run. The JVM was Oracle Java 1.6.0_23 using “-server -Xmx4g -XX:+UseParallelOldGC”. The Java process size was approximately 3.6G during the benchmark run.
The machine used by the BSBM V3.0 benchmark has more RAM and slower disks. However, the benchmark protocol has a ramp up procedure and much of the data winds up cached in RAM. As a result, BSBM preferentially favors machines with more RAM over machines with faster disks.
We’ve been looking at possible ways to “port” bigdata federations onto Amazon Web Services (AWS [5]). To my mind, the big questions are how to best provide for:
1. low latency durable writes
2. low latency reads
3. long term durable storage
4. blob storage (assuming the proposed lexicon refactor [1])
Unfortunately, things like EBS (Amazon’s Elastic Block Storage [2]) are not great matches since they are linked to a single EC2[4] instance at a time. While we could work around the single instance limit by exposing the EBS volume as an NFS mount, there would be significant latency if the instance exposing that volume were to go down.
However, I think that we can map these requirements onto AWS as follows:
For low latency durable writes, I propose that we:
- Assign a quorum (3 or 5) of compute nodes the responsibility for absorbing writes bound for a given shard.
- Replicate writes across the quorum for durability. This will guarantee that the peers joined with the quorum have the same binary image for the current journal.
- When the current mutable journal fills up (~200M), close it against further writes, open a new mutable journal to continue absorbing writes, and PUT the old journal to S3 for long term durability.
- When we build an index segment, PUT the generated index segment to S3 for long term durability.
- This approach can accept the failure of individual nodes in the quorum for a shard as along as the quorum does not break (e.g., 2 nodes are still available out of a quorum of 3; or 3 nodes are still available out of a quorum of 5).
- Delete behind old shard views from S3 once their retention period expires.
For low latency reads, I propose:
- Use “affinity” to assign views of shards to compute nodes.
- Cache data for those shard views locally on the compute nodes (GET journals and index segments from S3).
In both cases, we are providing long term durable storage via S3. However, newly absorbed writes do not immediately propagate to S3 in order to avoid an unforgivable latency. Instead, they are incorporated into the mutable journal(s) servicing the shard(s) for which those writes are destined. Journals are periodically PUT to S3 as writes are buffered by bigdata. A checkpoint could be created in S3 by instructing all compute nodes (really, all quorum leaders) to undergo synchronous overflow, which would trigger the PUT of the old mutable journal to S3.
An alternative for handling low latency writes would be to use an EBS[2] volume for each data service to absorb writes. If the data service fails, the same EBS volume would be mounted by which ever compute node was designated to handle the failover event. However, I think that this approach would be more complex to administer and subject to more latency when handling failover for a data server leader.
Concerning large literals and URIs, the proposal in [1] could be adapted to store objects less than ~1k in the shard and greater than that in S3. The TERM2ID index would associate either a local record address with the RDF Value (within shard storage) or information sufficient to address the corresponding S3 resource.
I think that this approach maps bigdata’s architecture onto AWS such that we get the efficiency of local IOs, the “flex” of a shared disk deployment (we can add or remove compute nodes at will), and the durability and cost efficiency of S3.
[1] http://sourceforge.net/apps/trac/bigdata/ticket/109
[2] http://aws.amazon.com/ebs/
[3] http://aws.amazon.com/s3/
[4] http://aws.amazon.com/ec2/
[5] http://aws.amazon.com/