Incremental data load and query performance
We have been getting some questions about incremental data load and query performance. Let me tackle both issues here.
With RDFS+ databases, there is always a trade off in when you materialize the statements entailed by the combination of the data and the ontology. This can be done eagerly, when the data are loaded into the database, or on demand, when a query is issued. Eager materialization (also known as eager closure) has advantages for some situations, but implies more latency between when you data are stable on the database and when the database can answer queries based on the new data and also takes up more space on the disk. The alternative is query-time inference, but the downside is that you must do more work to answer the query. In practice, bigdata, like most RDF databases, does a little bit of both -- materializing some statements during data load and others dynamically during query.
One of the next features that we will introduce is a magic sets integration. This will allow us to compute more of the inferences at query time using a minimum cost "program". The magic sets rewrite of the query is basically the original query, plus the entailment rules which were not materialized, plus a bunch of "gates" which prevent those rules from firing unless they are specifically required to answer the query, and even then they are fired with bindings which constrain their scope to exactly the necessary data. This integration will let us offer more flexible inference mechanisms and will give us another alternative for handling truth maintenance (when statements are retracted from the database).
Bigdata is an Multi-Version Concurrency Control (MVCC) architecture. This means that readers never block. You can use a read-only transaction to place a read-lock on a pre-materialized view of the data plus any entailment rules which are eagerly materialized. Clients can then read from that view while you add (or remove) statements to (from) the database. Once the closure of the database has been updated, you can simply update the view from which the clients are reading and they will immediately begin to read against the new state. These updates can be small and fast, or they can be massive. Bigdata simply does not care. However, if you are using eager closure then small incremental data loads will be quite fast while larger loads will take more time to update the materialized statements, but are generally more efficient. In practice, if you have a lot of small updates, it may be better to multiplex them together for more throughput -- but this depends on your application. However, clients are still reading from your last consistent view so the update is atomic from their perspective.
"Historical states" are views which have been superseded by subsequent commits. Historical state is retained automatically for open transactions. In fact, bigdata can be configured as an immortal database, where all history is preserved, if you have the disk and the requirement for that capability. More commonly, you will configure the minimum age that historical state must be retained, say 1 day, and older data is gradually purged to reclaim disk space. This all happens transparently during what we call "overflow" processing -- you never have to "vacuum" the database. When we cite high throughput like 300k triples per second or 1B triples in an hour, that is with ongoing overflow processing and dynamic index partitions partitions.
Query performance is quite good -- and we are looking forward to giving everyone some hard numbers real soon now. Query performance on a single machine is comparable to the best commercial triples stores. Query performance on a cluster varies between 2x faster and roughly equal to the best commercial triple stores running on a single machine. So, we are not loosing any performance by running a distributed database, not even for small queries. We are going to wrap up soon with the work we have been doing on data load throughput and then turn back to query performance optimization. While query performance on a cluster right now is good, we made some changes in how data is distributed across a cluster in order to achieve higher write rates and now we need re-optimize distributed query performance. In fact, I expect query performance will improve substantially when we do this, which is why we are not quoting numbers at this time.
Bigdata has a lot of advantages for query processing. For example, readers are non-blocking (MVCC) and can run concurrently, bloom filters are available at each index partition (aka key-range shard), so they can be applied to very large data sets efficiently, and we incrementally optimize the data sets by migrating buffered writes from a log-structured store onto read-optimized B+Tree segments. Overall, the scale-out architecture allows us to apply vastly more resources to query processing when compared with any single host solution.
We are working on a whitepaper in which we will publish on the scale-out architecture, data load throughput, and query performance on a cluster. We are using synthetic data sets for this, but if you have a lot of data and queries, contact us -- we'd be happy to run the numbers on your data!
With RDFS+ databases, there is always a trade off in when you materialize the statements entailed by the combination of the data and the ontology. This can be done eagerly, when the data are loaded into the database, or on demand, when a query is issued. Eager materialization (also known as eager closure) has advantages for some situations, but implies more latency between when you data are stable on the database and when the database can answer queries based on the new data and also takes up more space on the disk. The alternative is query-time inference, but the downside is that you must do more work to answer the query. In practice, bigdata, like most RDF databases, does a little bit of both -- materializing some statements during data load and others dynamically during query.
One of the next features that we will introduce is a magic sets integration. This will allow us to compute more of the inferences at query time using a minimum cost "program". The magic sets rewrite of the query is basically the original query, plus the entailment rules which were not materialized, plus a bunch of "gates" which prevent those rules from firing unless they are specifically required to answer the query, and even then they are fired with bindings which constrain their scope to exactly the necessary data. This integration will let us offer more flexible inference mechanisms and will give us another alternative for handling truth maintenance (when statements are retracted from the database).
Bigdata is an Multi-Version Concurrency Control (MVCC) architecture. This means that readers never block. You can use a read-only transaction to place a read-lock on a pre-materialized view of the data plus any entailment rules which are eagerly materialized. Clients can then read from that view while you add (or remove) statements to (from) the database. Once the closure of the database has been updated, you can simply update the view from which the clients are reading and they will immediately begin to read against the new state. These updates can be small and fast, or they can be massive. Bigdata simply does not care. However, if you are using eager closure then small incremental data loads will be quite fast while larger loads will take more time to update the materialized statements, but are generally more efficient. In practice, if you have a lot of small updates, it may be better to multiplex them together for more throughput -- but this depends on your application. However, clients are still reading from your last consistent view so the update is atomic from their perspective.
"Historical states" are views which have been superseded by subsequent commits. Historical state is retained automatically for open transactions. In fact, bigdata can be configured as an immortal database, where all history is preserved, if you have the disk and the requirement for that capability. More commonly, you will configure the minimum age that historical state must be retained, say 1 day, and older data is gradually purged to reclaim disk space. This all happens transparently during what we call "overflow" processing -- you never have to "vacuum" the database. When we cite high throughput like 300k triples per second or 1B triples in an hour, that is with ongoing overflow processing and dynamic index partitions partitions.
Query performance is quite good -- and we are looking forward to giving everyone some hard numbers real soon now. Query performance on a single machine is comparable to the best commercial triples stores. Query performance on a cluster varies between 2x faster and roughly equal to the best commercial triple stores running on a single machine. So, we are not loosing any performance by running a distributed database, not even for small queries. We are going to wrap up soon with the work we have been doing on data load throughput and then turn back to query performance optimization. While query performance on a cluster right now is good, we made some changes in how data is distributed across a cluster in order to achieve higher write rates and now we need re-optimize distributed query performance. In fact, I expect query performance will improve substantially when we do this, which is why we are not quoting numbers at this time.
Bigdata has a lot of advantages for query processing. For example, readers are non-blocking (MVCC) and can run concurrently, bloom filters are available at each index partition (aka key-range shard), so they can be applied to very large data sets efficiently, and we incrementally optimize the data sets by migrating buffered writes from a log-structured store onto read-optimized B+Tree segments. Overall, the scale-out architecture allows us to apply vastly more resources to query processing when compared with any single host solution.
We are working on a whitepaper in which we will publish on the scale-out architecture, data load throughput, and query performance on a cluster. We are using synthetic data sets for this, but if you have a lot of data and queries, contact us -- we'd be happy to run the numbers on your data!

0 Comments:
Post a Comment
<< Home