Saturday, May 2, 2009

100,000 triples-per-second to 3B triples

At this point, we have run out to 3B triples on a cluster with a net throughput of more than 100,000 triples per second for the cluster. The per-machine throughput is now ~ 10,000 triples per second. We have also addressed a high memory demand issue in the data services which was leading to premature RAM exhaustion.

We are currently looking into memory demand for the clients, which increases in proportion to the #of index partitions. I think that we will solve this by adding compressing to the RDF Values in the ID2TERM index, leading to fewering splits of that index and hence less RAM demand on the clients.

While the throughput is now reasonable at 10,000 triples-per-second/host, I am hopeful that we can improve on this substantially by introducing asynchronous writes for the TERM2ID index.

0 Comments:

Post a Comment

<< Home