I have finished an implementation of a stream-based approach for writing on the scale-out indices. This does all the right things in terms of deferring RMI (index partition writes) until it has a good-sized chunk of data and having only a single outstanding RMI per client per index partition. I need to test this and then integrate it into the RDF bulk data loader in a few places and then I can start collecting performance data on it.

I think that the TERM2ID index will still need to use the synchronous RPC index writes since we need to have the assigned term identifiers before we can write on the rest of the indices. Likewise, if statement identifiers are enabled, then there will be another synchronous RPC (which is also on the TERM2ID index) to obtain the statement identifiers. Once we are done with the synchronous RPC on the TERM2ID index, the terms and statements can just be written onto an asynchronous sink for the ID2TERM, TEXT (full text lookup), and SPO, POS, and OSP indices. When the bulk load finishes, it will just await the Future for those asynchronous sinks.

More when I know more.

One Response to “stream-based API for scale-out index writes”

  1. koltso7 says:

    Hi. I searched for RDF data store. There were Jena, Sesame, Virtuoso… But your project of distributed rdf data store, I think is most promising.

    My friends and I want to build fully semantic social network. Where users will have opportunity to create dinamically there own datatypes and models. And create there own queries in SPARQL.

    We have a great desire to create this project. And want at first create something very simple on Tomcat, JSP, SPARQL in place of SQL, and you Bigdata in place of MySQL or Postgres. Hello world style.

    Do you have some Docs on this case? I will be glad if you write some ideas, warnings and advices about building such web-project with Bigdata.

    Does Bigdata support many data inputs from web-users – like new posts and comments? Can we use SPARQL to make updates and inserts?

    What tools it has for creating and editing models? What tools for semantic calculations Bigdata has or we need to use other frameworks like Jena or Sesame?

    What docs you can advice on our approach? May be some examples of Bigdata use?