Scaling limits? No where in sight.
Some people have been asking about scaling limits for bigdata. We have run out to 10B triples in a single semantic web database instance, and we are producer bound for most of that. By the time we are nearing 10B, the producers have been at maximum CPU utilization for hours while the database servers are at perhaps 35-40% utilization.
So, what is limiting us right now is the single machine capacity for the producer. As the number of index partitions grows over time, the producer needs to allocate the data to be written onto the index into more and more buffers (one per index partition). As we get into 100s of index partitions, the RDF/XML parsers are running full blast into those buffers without putting an appreciable load onto the database.
To get around the single machine limit for the producers, we are going to refactor the clients so that they have an aggregation stage, similar to, but somewhat different from, a reduce phase. That will allow us to run enough RDF/XML parser clients to feed the system and sustain high throughput well past 10B triples.
Since we can scale by adding hardware, even after a bigdata federation has been deployed, the practical scaling limit for bigdata is going to be at least another order of magnitude (100B).
Update: We have since resolved the bottleneck mentioned in the original post without the introduction of an aggregator phase. The problem was traced to some POS index queues in the clients which were being filled with small chunks due to a systematic presentation of specific predicates once per document. Those chunks are now automatically combined on insertion into the queue, which solved the problem -- at least at this scale!
So, what is limiting us right now is the single machine capacity for the producer. As the number of index partitions grows over time, the producer needs to allocate the data to be written onto the index into more and more buffers (one per index partition). As we get into 100s of index partitions, the RDF/XML parsers are running full blast into those buffers without putting an appreciable load onto the database.
To get around the single machine limit for the producers, we are going to refactor the clients so that they have an aggregation stage, similar to, but somewhat different from, a reduce phase. That will allow us to run enough RDF/XML parser clients to feed the system and sustain high throughput well past 10B triples.
Since we can scale by adding hardware, even after a bigdata federation has been deployed, the practical scaling limit for bigdata is going to be at least another order of magnitude (100B).
Update: We have since resolved the bottleneck mentioned in the original post without the introduction of an aggregator phase. The problem was traced to some POS index queues in the clients which were being filled with small chunks due to a systematic presentation of specific predicates once per document. Those chunks are now automatically combined on insertion into the queue, which solved the problem -- at least at this scale!

5 Comments:
This post has been removed by the author.
Just a clarification, we are running 5 producer hosts (hosts running a client service on which we are executing the RDF/XML parser task) and 9 database hosts (hosts running a data service). The #of database hosts is dictated by the amount of local disk on those hosts (the data services are writing on local disk) and the scale that we are trying to hit in terms of the data on disk (the on disk storage requirement per triple is ~60 bytes).
Hi Bryan... Are you using map-reduce or something else over HDFS? Any GOM code still around? :>
Hi Kendall,
Bigdata is a from scratch implementation of key-range partitioned B+Trees using dynamic index partitioning. Internally it uses a log-structured store (to buffer writes) combined with read-optimized index segments. When the log-structured store overflows, a new one is opened and index writes are migrated asynchronously onto the read-optimized index segments. Index partitions are moved from node to node to load balance writers.
Bigdata does not use map/reduce processing to load or query RDF data. Instead, it uses its distributed B+Tree architecture and distributed JOIN algorithms.
Deployment can be over local disk, over NSF, over NAS, or, presumably over HDFS using FUSE to mount HDFS as a file system, but I have not tried the latter myself.
We do have plans to fit GOM over bigdata to provide a scale-out schema-flexible object database. The main piece missing at this time is a distributed object cache.
-bryan
Hi Kendall! That is so funny that you would post that comment. We just took a long meeting on exactly that topic today. Seems to be on a lot of people's minds. Bryan has a nice write-up along similar lines, specifically how bigdata compares and contrasts to HBase, which I hope he will post.
No GOM code still around though, this is a brand new animal.
See you at SemTech!
Post a Comment
<< Home