Tuesday, June 16, 2009

Scale-up or scale-out?

Scale-up or scale-out? You can scale-up to larger and larger problems by buying a machine with more RAM, or by buying an even fancier machine with a dozens of CPUs and a huge amount of RAM. However, you swifly reach the limits of what is practical (the former) or cost-effective (the latter). Commodity hardware is cheap and scale-out approaches let you make the most of it.

Bigdata loads 300,000 triples per second on a commodity cluster. 1,000,000,000 triples loaded in under an hour. You can not touch performance like that on a single machine for anything near the same cost. And it is not just the CPU and the increased RAM, but the parallel DISK IO that comes along with the architecture. And when one machine fails, you just failover. If you have sunk all those resources into a single (very) high end server and it goes down, well, you are just out of luck. Bigdata is a fully persistent architecture. Service restart time is well under a second and your data is available immediately, not re-loaded from a log file.

Bigdata is architected to be highly concurrent. Most databases limit you to one writer, or to one writer on an index. In bigdata, we transparently and dynamically break down scale-out indices into key-ranges called index partitions (shards, really) and each writes on each index partition run concurrently. This means that the potential concurrency of the application grows with the data scale. And since we dynamically partition the data, you can always add more hardware to keep pace with your data size or your query demands without having to reload all your data.

We are still working on query performance tuning, but we have already seen query response times which are better than the best scale-up triple stores. Bigdata distributes the query processing across the hardware, doing JOINs right at the data and it uses MVCC (Multi-Version Concurrency Control), so we run concurrent read-consistent queries without blocking. It is able to put more CPU, more RAM and more DISK bandwidth on any given problem when compared to any single machine.

Oh, it runs on a single machine too.

0 Comments:

Post a Comment

<< Home