Bryan Thompson

Come see us at the Semantic Technologies Conference in San Francisco. We will be talking about the highly available replication cluster support in bigdata.

This session will be a detailed technical presentation on the bigdata® architecture and what’s new in bigdata®, including support for a new high-availability architecture based on shared-nothing replication clusters. Bigdata® is a standards-based, high-performance, scalable, open-source RDF database scaling to 50 billions triples or quads on a single node. Written entirely in Java, the platform supports RDFS+ inference and the SPARQL 1.1 family of specifications, including Query, Update, Basic Federated Query, and Service Description. Bigdata® supports novel extensions for durable named solution sets, efficient storage and querying of statement level provenance without reification, and scalable graph analytics. The database supports multi-tenancy and can be deployed as an embedded database, a standalone server, a highly available replication cluster, and as a horizontally-scaled and dynamically-sharded federation similar to Google¹s bigtable, Apache Accumulo, or Cassandra. The bigdata® open-source platform has been under continuous development since 2006. It is available under a dual licensing model (GPLv2 and commercial licensing) and a number of well-known companies OEM, resell, or embed bigdata® in their applications. SYSTAP, LLC leads the development of the open-source project and offers support subscriptions for both commercial and open-source users.

We are nearing completion of the HA Replication Cluster. There is a new server process (HAJournalServer) that embeds a Journal and exposes an RMI interface (HAGlue) supporting the internal HA protocols and a variety of administrative tasks. The Journal provides a high performance single machine graph database and SPARQL end point for up to 50 billion edges. The HAJournalServer extends this to support a replication cluster with automated full and incremental backups and offers linear scaling for query in the size of the replication cluster. Linear scaling is achieved because each query is answered 100% locally by the HAJournalServer process to which it is directed. The HAJournalServer has exactly the same high performance, low latency behavior as the embedded Journal.

High Availability is based on a quorum model. The HA replication cluster is up and available for reads and writes as long as a majority of the services are up. The quorum size must be an odd number of machines greater than one (3, 5, 7, etc.). Apache River is used for service discovery. Apache Zookeeper is used to manage the state of the quorum and elect the leader. Writes must be directed to the leader, but you can read on any service that is joined with the met quorum. Writes are replicated using a low level protocol along a pipeline to each of the followers. The /status page of the NanoSparqlServer provides extended status information about the HA replication cluster. You can query either the REST API or the RMI API to discover the status of each running server process.

The chart below shows linear scaling on an HA replication cluster consisting of 3 Intel Mac Minis against the BSBM 100M “Explore” reduced query mix (that is, w/o query 5). This is an aggregated throughput of nearly 140,000 QMpH on a highly available cluster built on commodity hardware.

This work is being performed in the READ_CACHE branch. See the src/resources/HAJournal directory if you want a preview of this capability. There is a README, sample configuration file, and some scripts that are designed to execute from an SVN build. At this point, we are green on the test suite for an HA3 cluster. There will be some additional QA before we bring this feature back into the development branch for a release.

For people who want to track what we’ve been up to on the XDATA project, there are three surveys articles that we’ve produced:

- Literature Review of Graph Databases (Bryan Thompson, SYSTAP)
- Large Scale Graph Algorithms on the GPU (Yangzihao Wang and John Owens, UC Davis)
- Graph Pattern Matching, Search, and OLAP (Dr. Xifeng Yan, UCSB)

See also these posts by Patrick Durusau and Danny Bickson.

We are currently working on:
- New work-efficient graph algorithms on the GPU;
- A vertex-centric API for graph processing on GPUs that extends Duane Merrill’s excellent work on BFS on GPUs; and
- Extensions to the Parallel Sliding Window (PSW) algorithm used by GraphChi for attributed graphs, GPUs, and compute clusters.

We plan to have something people can play with in Q2.

We are very excited about the XDATA program. As a number of you know, we have been interested in hybrid GPU/CPU computation platforms for several years. Finally, we are getting the opportunity to explore this issue in depth with some world class researchers. GPU accelerated capabilities are being developed under a liberal open source license and will be freely available for integration into any platform. Of course, we plan to integrate this capabilities into our own platform as well!

Over the next few months, we plan to wrap up property paths, finish the HAJournalServer, and introduce an IO efficient out-of-core graph mining capability into the bigdata platform. We are also looking at hybrid column compression techniques that will improve the on-disk footprint of the database and support direct compute by GPUs against our index structures. It is going to be an exciting year.

SYSTAP, LLC, developer of the leading edge bigdata® RDF graph database platform, today announced receiving research funding totaling $2M from the Defense Advanced Research Projects Agency (DARPA) to develop an open source platform for accelerated and real-time graph analytics on GPU compute clusters. This contract is part of DARPA’s XDATA program, a 4-year research effort to develop new computational techniques and open source software tools for processing and analyzing data, motivated by defense needs. SYSTAP, LLC has been selected by DARPA as a performer in the technical area of scalable analytics and data processing technology. The contract is administered by the Air Force Research Laboratory, Rome, NY.

Graphs are everywhere in today’s society, in social networks, business transactions, the structure of the Internet, the structure of ecosystems and markets, the nature of the genome and cell processes. Graphs take us beyond tables and relational models, making it easy for people to mash up data from different sources. Graph processing algorithms help people turn complex, uncertain, and messy data into actionable information. By combining graphs, GPUs (Graphical Processing Units), and techniques from High Performance Computing (HPC), we hope to achieve a new capability for scientific discovery and real time business analytics.

GPUs and graphs are the tipping point. Together, they will let us address the largest problems in science and business, but the commodity price point of GPUs means that even the smallest business can have the compute power of a 30-node cluster in a $10k workstation. The challenge is to give people the software tools that make it easy to apply this compute power to their data. By combining open data and linked data with high performance graph processing, people will be able to create value new services from existing data.

SYSTAP, LLC develops innovative, scalable open source platforms for graph databases and graph data mining. SYSTAP, LLC leads the development of the bigdata® RDF graph database platform and offers support subscriptions, training, and custom services for the platform. The bigdata® platform is also available under commercial licenses from a number of OEMs and VARs.
Contact: Bryan Thompson

University of California, Davis – Dr. John Owens. Dr. Owens will lead the development of an open-source, scalable, multi-node, out-of-core graph library on GPUs. Dr. Owens is an Associate Professor of Electrical and Computer Engineering. He leads a research group pursuing problems in GPU computing in both GPU fundamentals (data structures, algorithms, and multi-GPU computing) and applications (including computer vision, GPU-based embedded systems, real-time and offline graphics, medical imaging, speech recognition, protein folding, computational fluid dynamics, and visualization).
Contact: John Owens

University of California, Santa Barbara – Dr. Xifeng Yan. Dr. Yan will provide consultation on basic primitives behind graph data mining, graph query/search, and graph OLAP. He is an Associate Professor and holds the Venkatesh Narayanamurti Chair in Computer Science where he works on modeling, managing, and mining graphs in bioinformatics, social networks, information networks, and computer systems.
Contact: Xifeng Yan

smartRealm specializes in geo-social analytics, linked data, knowledge representation, semantic reasoning and data-knowledge integration. Our flagship product, knowledgeSmarts® (kS), is an award-winning Knowledge-as-a-Service™ (KaaS) platform, and features kS Workbench to knowledge-enable enterprise environments.
Contact: Stephane Fellah, CTO and Harry Niedzwiadek, General Manager. 703-669-5514.

For more information on DARPA and the XDATA program, visit www.darpa.mil

The views expressed are those of the author and do not reflect the official policy or position of the Department of Defense or the U.S. Government.

Approved for Public Release, Distribution Unlimited.




Big Graph Data Panel

Last month I had the opportunity to participate in the Big Graph Data panel at ISWC 2012 with Tim Berner’s Lee, Mike Stonebraker, and John Giannandrea with Frank van Harmelen as the moderator. If you missed the panel then, you can catch it online at videolectures.net.

A few interesting positions emerged during the panel:

- Relational databases : Stonebraker points out that the big relational database platforms are old technology. This is not a critique of the relational model, but a critique of the implementations. He points to new platforms with very high transactions per second based on main memory. Well and good, but not really about graphs. My own spin on relational databases is that they *could* be doing better. The bigdata(R) platform itself is based on the same kinds of technology that are used by relational databases – B+Trees, Hash indices, Multi-Version Concurrency Control, etc. There are some critical challenges for relational databases to play as high performance graph databases, including: (a) SPARQL queries with 50-100 joins are common, but cost-based query optimizers blow up on the state space for these query plans; (b) schema fluidity (any property may be ANY value); and (c) property cardinality (multiple values for the same cell). Beyond that, the Big Problem is getting the physical schema for the database right.

- Big graphs : There is a consensus that this is a basic research problem. Google’s Knowledge Graph (John Giannaandrea) is actually “little data”. The knowledge graph is a relatively small set of highly curated relationships. It fits nicely in main memory and that is how it is served out. There are hard issues with large graphs that the research community is still addressing. Key problems are how to organize the information for locality, how to partition the information among the nodes of a cluster, and how to support both structured graph query (SPARQL) and efficient graph mining algorithms.

- Schema Alignment : This is not just a hard problem, but that it is an intractable problem. Even in a single data set there is always a schema alignment problem. I like to point to the UMLS as applied by the National Library of Medicine. This gold standard for taxonomies has an inter-rater reliability of only 50%. That means that different trained reviewers are likely to assign different tags to the same article. We need to being looking at schema alignment as a problem with noisy data and apply techniques that are robust for noisy data. The data mining community is doing this right now, but without an explicit focus on schema alignment.

- Map/Reduce and Key-Value Stores : These are the wrong technologies for graph processing. A lot of noise and effort is going into trying to cram large graphs into these platforms. Map/Reduce has tremendous latencies and it is just painful watching people trying to architect around the latency of running a map/reduce job. Distributed computing can be fast, scalable and have low latency — people have been doing this for years. Key-Value stores are great for storing property sets, but they can not handle the cardinality associated with the link sets of high cardinality vertices. If the logical row model could be extended to support paged access to the link sets then this technology could be applied to a highly scalable linked data cache (ala the diplodocus sub-graph clusters), but you will never be able to implement an efficient graph query language against a key-value store alone.

- There is no “graph database community” – rather we have distinct communities for graph theory, graph mining (Apache Giraph, Spark’s Bagel, Signal/Collect, graphlab), object-based graph databases (blueprints et al), and semantic web graph databases. We need to start talking. Object-based graph databases are optimized for property set retrieval – which is basically the same goal as a key-value store. That’s why things like Titan can be layered over key-value stores. However, they are NOT optimized for queries that cut across graphs – which is the sweet spot for semantic web graph databases. Graph databases should optimize for both complex SPARQL query processing and the look up of the property set + link set for a vertex (after all, this is the Linked Data “GET URI” method). The LDBC has members from at least three of these communities (graph mining, object-based graph databases, and semantic-web graph databases). Hopefully it will drive standards and benchmarks for link data access access, declarative query, and high level APIs for graph mining.

- High level APIs : I see the success of blueprints as a disaster for everyone. Blueprints reduces to pointer chasing against the disk (if the data are durable) and locks in the messaging model for graph processing (making efficient distributed graph mining impossible). APIs that force programmers to write procedural code for each data access or query are a great way to keep programmers employed, but a miserable way to develop applications. High level declarative query lets the database query planner optimize the access and hides the physical schema permitting the database to evolve over time. Yes, we need APIs for writing graph-mining kernels but they need to allow the underlying implementation to maximize the possible parallelism and abstract away from the physical schema. Rather than hard-coding to APIs, think standard abstractions, LLVM and cross-compilation.

SYSTAP is leading a DARPA XDATA team for accelerated graph processing on GPU clusters where we are looking at some of these issues – I will post on this soon.

The REST API supports ACID DELETE+INSERT operations. There are three ways of doing this:

- SPARQL UPDATE: DELETE/INSERT/WHERE. This is most flexible and least scalable. This approach should really be reserved for when you need to discover the triples to be removed or the triples to be inserted based on the graph pattern in the WHERE clause.

- SPARQL UPDATE: DELETE DATA and INSERT DATA. Both operations can be combined in a single UPDATE request and that request will be ACID. This is the preferred standards-based option if you are removing and adding specific statements.

- POST with multi-part request body – This is the most scalable option, but the performance difference will only be noticeable for large updates (100k statements or more).

We’ve recently seen a few people run into trouble using DELETE/INSERT/WHERE when they should have been using one of the other approaches.

This is a critical maintenance release of bigdata(R). Users of version 1.2.1 are strongly encouraged to upgrade to this release.

Bigdata is a horizontally-scaled, open-source architecture for indexed data with an emphasis on RDF capable of loading 1B triples in under one hour on a 15 node cluster. Bigdata operates in both a single machine mode (Journal) and a cluster mode (Federation). The Journal provides fast scalable ACID indexed storage for very large data sets, up to 50 billion triples / quads. The federation provides fast scalable shard-wise parallel indexed storage using dynamic sharding and shard-wise ACID updates and incremental cluster size growth. Both platforms support fully concurrent readers with snapshot isolation.

Distributed processing offers greater throughput but does not reduce query or update latency. Choose the Journal when the anticipated scale and throughput requirements permit. Choose the Federation when the administrative and machine overhead associated with operating a cluster is an acceptable tradeoff to have essentially unlimited data scaling and throughput.

See [1,2,8] for instructions on installing bigdata(R), [4] for the javadoc, and [3,5,6] for news, questions, and the latest developments. For more information about SYSTAP, LLC and bigdata, see [7].

Starting with the 1.0.0 release, we offer a WAR artifact [8] for easy installation of the single machine RDF database. For custom development and cluster installations we recommend checking out the code from SVN using the tag for this release. The code will build automatically under eclipse. You can also build the code using the ant script. The cluster installer requires the use of the ant script.

You can download the WAR from:

http://sourceforge.net/projects/bigdata/

You can checkout this release from:

https://bigdata.svn.sourceforge.net/svnroot/bigdata/tags/BIGDATA_RELEASE_1_2_2

New features:

- SPARQL 1.1 UPDATE
- SPARQL 1.1 Service Description
- SPARQL 1.1 Basic Federated Query
- New integration point for custom services (ServiceRegistry).
- Remote Java client for NanoSparqlServer
- Sesame 2.6.3
- Ganglia integration (cluster)
- Performance improvements (cluster)
- MemoryManager mode for the Journal (native memory Journal)

Feature summary:

- Single machine data storage to ~50B triples/quads (RWStore);
- Clustered data storage is essentially unlimited;
- Simple embedded and/or webapp deployment (NanoSparqlServer);
- Triples, quads, or triples with provenance (SIDs);
- Fast RDFS+ inference and truth maintenance;
- Fast 100% native SPARQL 1.1 evaluation;
- Integrated “analytic” query package;
- %100 Java memory manager leverages the JVM native heap (no GC);

Road map [3]:

- SPARQL 1.1 property paths (last missing feature for SPARQL 1.1);
- Runtime Query Optimizer for Analytic Query mode;
- Simplified deployment, configuration, and administration for clusters; and
- High availability for the journal and the cluster.

Change log:

Note: Versions with (*) MAY require data migration. For details, see [9].

1.2.2:

- http://sourceforge.net/apps/trac/bigdata/ticket/586 (RWStore immedateFree() not removing Checkpoint addresses from the historical index cache.)
- http://sourceforge.net/apps/trac/bigdata/ticket/602 (RWStore does not discard logged deletes on reset())
- http://sourceforge.net/apps/trac/bigdata/ticket/603 (Prepare critical maintenance release as branch of 1.2.1)

1.2.1:

- http://sourceforge.net/apps/trac/bigdata/ticket/533 (Review materialization for inline IVs)
- http://sourceforge.net/apps/trac/bigdata/ticket/539 (NotMaterializedException with REGEX and Vocab)
- http://sourceforge.net/apps/trac/bigdata/ticket/540 (SPARQL UPDATE using NSS via index.html)
- http://sourceforge.net/apps/trac/bigdata/ticket/541 (MemoryManaged backed Journal mode)
- http://sourceforge.net/apps/trac/bigdata/ticket/546 (Index cache for Journal)
- http://sourceforge.net/apps/trac/bigdata/ticket/549 (BTree can not be cast to Name2Addr (MemStore recycler))
- http://sourceforge.net/apps/trac/bigdata/ticket/550 (NPE in Leaf.getKey() : root cause was user error)
- http://sourceforge.net/apps/trac/bigdata/ticket/558 (SPARQL INSERT not working in same request after INSERT DATA)
- http://sourceforge.net/apps/trac/bigdata/ticket/562 (Sub-select in INSERT cause NPE in UpdateExprBuilder)
- http://sourceforge.net/apps/trac/bigdata/ticket/563 (DISTINCT ORDER BY)
- http://sourceforge.net/apps/trac/bigdata/ticket/567 (Failure to set cached value on IV results in incorrect behavior for complex UPDATE operation)
- http://sourceforge.net/apps/trac/bigdata/ticket/568 (DELETE WHERE fails with Java AssertionError)
- http://sourceforge.net/apps/trac/bigdata/ticket/569 (LOAD-CREATE-LOAD using virgin journal fails with “Graph exists” exception)
- http://sourceforge.net/apps/trac/bigdata/ticket/571 (DELETE/INSERT WHERE handling of blank nodes)
- http://sourceforge.net/apps/trac/bigdata/ticket/573 (NullPointerException when attempting to INSERT DATA containing a blank node)

1.2.0: (*)

- http://sourceforge.net/apps/trac/bigdata/ticket/92 (Monitoring webapp)
- http://sourceforge.net/apps/trac/bigdata/ticket/267 (Support evaluation of 3rd party operators)
- http://sourceforge.net/apps/trac/bigdata/ticket/337 (Compact and efficient movement of binding sets between nodes.)
- http://sourceforge.net/apps/trac/bigdata/ticket/433 (Cluster leaks threads under read-only index operations: DGC thread leak)
- http://sourceforge.net/apps/trac/bigdata/ticket/437 (Thread-local cache combined with unbounded thread pools causes effective memory leak: termCache memory leak & thread-local buffers)
- http://sourceforge.net/apps/trac/bigdata/ticket/438 (KeyBeforePartitionException on cluster)
- http://sourceforge.net/apps/trac/bigdata/ticket/439 (Class loader problem)
- http://sourceforge.net/apps/trac/bigdata/ticket/441 (Ganglia integration)
- http://sourceforge.net/apps/trac/bigdata/ticket/443 (Logger for RWStore transaction service and recycler)
- http://sourceforge.net/apps/trac/bigdata/ticket/444 (SPARQL query can fail to notice when IRunningQuery.isDone() on cluster)
- http://sourceforge.net/apps/trac/bigdata/ticket/445 (RWStore does not track tx release correctly)
- http://sourceforge.net/apps/trac/bigdata/ticket/446 (HTTP Repostory broken with bigdata 1.1.0)
- http://sourceforge.net/apps/trac/bigdata/ticket/448 (SPARQL 1.1 UPDATE)
- http://sourceforge.net/apps/trac/bigdata/ticket/449 (SPARQL 1.1 Federation extension)
- http://sourceforge.net/apps/trac/bigdata/ticket/451 (Serialization error in SIDs mode on cluster)
- http://sourceforge.net/apps/trac/bigdata/ticket/454 (Global Row Store Read on Cluster uses Tx)
- http://sourceforge.net/apps/trac/bigdata/ticket/456 (IExtension implementations do point lookups on lexicon)
- http://sourceforge.net/apps/trac/bigdata/ticket/457 (“No such index” on cluster under concurrent query workload)
- http://sourceforge.net/apps/trac/bigdata/ticket/458 (Java level deadlock in DS)
- http://sourceforge.net/apps/trac/bigdata/ticket/460 (Uncaught interrupt resolving RDF terms)
- http://sourceforge.net/apps/trac/bigdata/ticket/461 (KeyAfterPartitionException / KeyBeforePartitionException on cluster)
- http://sourceforge.net/apps/trac/bigdata/ticket/463 (NoSuchVocabularyItem with LUBMVocabulary for DerivedNumericsExtension)
- http://sourceforge.net/apps/trac/bigdata/ticket/464 (Query statistics do not update correctly on cluster)
- http://sourceforge.net/apps/trac/bigdata/ticket/465 (Too many GRS reads on cluster)
- http://sourceforge.net/apps/trac/bigdata/ticket/469 (Sail does not flush assertion buffers before query)
- http://sourceforge.net/apps/trac/bigdata/ticket/472 (acceptTaskService pool size on cluster)
- http://sourceforge.net/apps/trac/bigdata/ticket/475 (Optimize serialization for query messages on cluster)
- http://sourceforge.net/apps/trac/bigdata/ticket/476 (Test suite for writeCheckpoint() and recycling for BTree/HTree)
- http://sourceforge.net/apps/trac/bigdata/ticket/478 (Cluster does not map input solution(s) across shards)
- http://sourceforge.net/apps/trac/bigdata/ticket/480 (Error releasing deferred frees using 1.0.6 against a 1.0.4 journal)
- http://sourceforge.net/apps/trac/bigdata/ticket/481 (PhysicalAddressResolutionException against 1.0.6)
- http://sourceforge.net/apps/trac/bigdata/ticket/482 (RWStore reset() should be thread-safe for concurrent readers)
- http://sourceforge.net/apps/trac/bigdata/ticket/484 (Java API for NanoSparqlServer REST API)
- http://sourceforge.net/apps/trac/bigdata/ticket/491 (AbstractTripleStore.destroy() does not clear the locator cache)
- http://sourceforge.net/apps/trac/bigdata/ticket/492 (Empty chunk in ThickChunkMessage (cluster))
- http://sourceforge.net/apps/trac/bigdata/ticket/493 (Virtual Graphs)
- http://sourceforge.net/apps/trac/bigdata/ticket/496 (Sesame 2.6.3)
- http://sourceforge.net/apps/trac/bigdata/ticket/497 (Implement STRBEFORE, STRAFTER, and REPLACE)
- http://sourceforge.net/apps/trac/bigdata/ticket/498 (Bring bigdata RDF/XML parser up to openrdf 2.6.3.)
- http://sourceforge.net/apps/trac/bigdata/ticket/500 (SPARQL 1.1 Service Description)
- http://www.openrdf.org/issues/browse/SES-884 (Aggregation with an solution set as input should produce an empty solution as output)
- http://www.openrdf.org/issues/browse/SES-862 (Incorrect error handling for SPARQL aggregation; fix in 2.6.1)
- http://www.openrdf.org/issues/browse/SES-873 (Order the same Blank Nodes together in ORDER BY)
- http://sourceforge.net/apps/trac/bigdata/ticket/501 (SPARQL 1.1 BINDINGS are ignored)
- http://sourceforge.net/apps/trac/bigdata/ticket/503 (Bigdata2Sesame2BindingSetIterator throws QueryEvaluationException were it should throw NoSuchElementException)
- http://sourceforge.net/apps/trac/bigdata/ticket/504 (UNION with Empty Group Pattern)
- http://sourceforge.net/apps/trac/bigdata/ticket/505 (Exception when using SPARQL sort & statement identifiers)
- http://sourceforge.net/apps/trac/bigdata/ticket/506 (Load, closure and query performance in 1.1.x versus 1.0.x)
- http://sourceforge.net/apps/trac/bigdata/ticket/508 (LIMIT causes hash join utility to log errors)
- http://sourceforge.net/apps/trac/bigdata/ticket/513 (Expose the LexiconConfiguration to Function BOPs)
- http://sourceforge.net/apps/trac/bigdata/ticket/515 (Query with two “FILTER NOT EXISTS” expressions returns no results)
- http://sourceforge.net/apps/trac/bigdata/ticket/516 (REGEXBOp should cache the Pattern when it is a constant)
- http://sourceforge.net/apps/trac/bigdata/ticket/517 (Java 7 Compiler Compatibility)
- http://sourceforge.net/apps/trac/bigdata/ticket/518 (Review function bop subclass hierarchy, optimize datatype bop, etc.)
- http://sourceforge.net/apps/trac/bigdata/ticket/520 (CONSTRUCT WHERE shortcut)
- http://sourceforge.net/apps/trac/bigdata/ticket/521 (Incremental materialization of Tuple and Graph query results)
- http://sourceforge.net/apps/trac/bigdata/ticket/525 (Modify the IChangeLog interface to support multiple agents)
- http://sourceforge.net/apps/trac/bigdata/ticket/527 (Expose timestamp of LexiconRelation to function bops)
- http://sourceforge.net/apps/trac/bigdata/ticket/532 (ClassCastException during hash join (can not be cast to TermId))
- http://sourceforge.net/apps/trac/bigdata/ticket/533 (Review materialization for inline IVs)
- http://sourceforge.net/apps/trac/bigdata/ticket/534 (BSBM BI Q5 error using MERGE JOIN)

1.1.0 (*)

– http://sourceforge.net/apps/trac/bigdata/ticket/23 (Lexicon joins)
– http://sourceforge.net/apps/trac/bigdata/ticket/109 (Store large literals as “blobs”)
– http://sourceforge.net/apps/trac/bigdata/ticket/181 (Scale-out LUBM “how to” in wiki and build.xml are out of date.)
– http://sourceforge.net/apps/trac/bigdata/ticket/203 (Implement an persistence capable hash table to support analytic query)
– http://sourceforge.net/apps/trac/bigdata/ticket/209 (AccessPath should visit binding sets rather than elements for high level query.)
– http://sourceforge.net/apps/trac/bigdata/ticket/227 (SliceOp appears to be necessary when operator plan should suffice without)
– http://sourceforge.net/apps/trac/bigdata/ticket/232 (Bottom-up evaluation semantics).
– http://sourceforge.net/apps/trac/bigdata/ticket/246 (Derived xsd numeric data types must be inlined as extension types.)
– http://sourceforge.net/apps/trac/bigdata/ticket/254 (Revisit pruning of intermediate variable bindings during query execution)
– http://sourceforge.net/apps/trac/bigdata/ticket/261 (Lift conditions out of subqueries.)
– http://sourceforge.net/apps/trac/bigdata/ticket/300 (Native ORDER BY)
– http://sourceforge.net/apps/trac/bigdata/ticket/324 (Inline predeclared URIs and namespaces in 2-3 bytes)
– http://sourceforge.net/apps/trac/bigdata/ticket/330 (NanoSparqlServer does not locate “html” resources when run from jar)
– http://sourceforge.net/apps/trac/bigdata/ticket/334 (Support inlining of unicode data in the statement indices.)
– http://sourceforge.net/apps/trac/bigdata/ticket/364 (Scalable default graph evaluation)
– http://sourceforge.net/apps/trac/bigdata/ticket/368 (Prune variable bindings during query evaluation)
– http://sourceforge.net/apps/trac/bigdata/ticket/370 (Direct translation of openrdf AST to bigdata AST)
– http://sourceforge.net/apps/trac/bigdata/ticket/373 (Fix StrBOp and other IValueExpressions)
– http://sourceforge.net/apps/trac/bigdata/ticket/377 (Optimize OPTIONALs with multiple statement patterns.)
– http://sourceforge.net/apps/trac/bigdata/ticket/380 (Native SPARQL evaluation on cluster)
– http://sourceforge.net/apps/trac/bigdata/ticket/387 (Cluster does not compute closure)
– http://sourceforge.net/apps/trac/bigdata/ticket/395 (HTree hash join performance)
– http://sourceforge.net/apps/trac/bigdata/ticket/401 (inline xsd:unsigned datatypes)
– http://sourceforge.net/apps/trac/bigdata/ticket/408 (xsd:string cast fails for non-numeric data)
– http://sourceforge.net/apps/trac/bigdata/ticket/421 (New query hints model.)
– http://sourceforge.net/apps/trac/bigdata/ticket/431 (Use of read-only tx per query defeats cache on cluster)

1.0.3

– http://sourceforge.net/apps/trac/bigdata/ticket/217 (BTreeCounters does not track bytes released)
– http://sourceforge.net/apps/trac/bigdata/ticket/269 (Refactor performance counters using accessor interface)
– http://sourceforge.net/apps/trac/bigdata/ticket/329 (B+Tree should delete bloom filter when it is disabled.)
– http://sourceforge.net/apps/trac/bigdata/ticket/372 (RWStore does not prune the CommitRecordIndex)
– http://sourceforge.net/apps/trac/bigdata/ticket/375 (Persistent memory leaks (RWStore/DISK))
– http://sourceforge.net/apps/trac/bigdata/ticket/385 (FastRDFValueCoder2: ArrayIndexOutOfBoundsException)
– http://sourceforge.net/apps/trac/bigdata/ticket/391 (Release age advanced on WORM mode journal)
– http://sourceforge.net/apps/trac/bigdata/ticket/392 (Add a DELETE by access path method to the NanoSparqlServer)
– http://sourceforge.net/apps/trac/bigdata/ticket/393 (Add “context-uri” request parameter to specify the default context for INSERT in the REST API)
– http://sourceforge.net/apps/trac/bigdata/ticket/394 (log4j configuration error message in WAR deployment)
– http://sourceforge.net/apps/trac/bigdata/ticket/399 (Add a fast range count method to the REST API)
– http://sourceforge.net/apps/trac/bigdata/ticket/422 (Support temp triple store wrapped by a BigdataSail)
– http://sourceforge.net/apps/trac/bigdata/ticket/424 (NQuads support for NanoSparqlServer)
– http://sourceforge.net/apps/trac/bigdata/ticket/425 (Bug fix to DEFAULT_RDF_FORMAT for bulk data loader in scale-out)
– http://sourceforge.net/apps/trac/bigdata/ticket/426 (Support either lockfile (procmail) and dotlockfile (liblockfile1) in scale-out)
– http://sourceforge.net/apps/trac/bigdata/ticket/427 (BigdataSail#getReadOnlyConnection() race condition with concurrent commit)
– http://sourceforge.net/apps/trac/bigdata/ticket/435 (Address is 0L)
– http://sourceforge.net/apps/trac/bigdata/ticket/436 (TestMROWTransactions failure in CI)

1.0.2

– http://sourceforge.net/apps/trac/bigdata/ticket/32 (Query time expansion of (foo rdf:type rdfs:Resource) drags in SPORelation for scale-out.)
– http://sourceforge.net/apps/trac/bigdata/ticket/181 (Scale-out LUBM “how to” in wiki and build.xml are out of date.)
– http://sourceforge.net/apps/trac/bigdata/ticket/356 (Query not terminated by error.)
– http://sourceforge.net/apps/trac/bigdata/ticket/359 (NamedGraph pattern fails to bind graph variable if only one binding exists.)
– http://sourceforge.net/apps/trac/bigdata/ticket/361 (IRunningQuery not closed promptly.)
– http://sourceforge.net/apps/trac/bigdata/ticket/371 (DataLoader fails to load resources available from the classpath.)
– http://sourceforge.net/apps/trac/bigdata/ticket/376 (Support for the streaming of bigdata IBindingSets into a sparql query.)
– http://sourceforge.net/apps/trac/bigdata/ticket/378 (ClosedByInterruptException during heavy query mix.)
– http://sourceforge.net/apps/trac/bigdata/ticket/379 (NotSerializableException for SPOAccessPath.)
– http://sourceforge.net/apps/trac/bigdata/ticket/382 (Change dependencies to Apache River 2.2.0)

1.0.1 (*)

– http://sourceforge.net/apps/trac/bigdata/ticket/107 (Unicode clean schema names in the sparse row store).
– http://sourceforge.net/apps/trac/bigdata/ticket/124 (TermIdEncoder should use more bits for scale-out).
– http://sourceforge.net/apps/trac/bigdata/ticket/225 (OSX requires specialized performance counter collection classes).
– http://sourceforge.net/apps/trac/bigdata/ticket/348 (BigdataValueFactory.asValue() must return new instance when DummyIV is used).
– http://sourceforge.net/apps/trac/bigdata/ticket/349 (TermIdEncoder limits Journal to 2B distinct RDF Values per triple/quad store instance).
– http://sourceforge.net/apps/trac/bigdata/ticket/351 (SPO not Serializable exception in SIDS mode (scale-out)).
– http://sourceforge.net/apps/trac/bigdata/ticket/352 (ClassCastException when querying with binding-values that are not known to the database).
– http://sourceforge.net/apps/trac/bigdata/ticket/353 (UnsupportedOperatorException for some SPARQL queries).
– http://sourceforge.net/apps/trac/bigdata/ticket/355 (Query failure when comparing with non materialized value).
– http://sourceforge.net/apps/trac/bigdata/ticket/357 (RWStore reports “FixedAllocator returning null address, with freeBits”.)
– http://sourceforge.net/apps/trac/bigdata/ticket/359 (NamedGraph pattern fails to bind graph variable if only one binding exists.)
– http://sourceforge.net/apps/trac/bigdata/ticket/362 (log4j – slf4j bridge.)

For more information about bigdata(R), please see the following links:

[1] http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=Main_Page
[2] http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=GettingStarted
[3] http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=Roadmap
[4] http://www.bigdata.com/bigdata/docs/api/
[5] http://sourceforge.net/projects/bigdata/
[6] http://www.bigdata.com/blog
[7] http://www.systap.com/bigdata.htm
[8] http://sourceforge.net/projects/bigdata/files/bigdata/
[9] http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=DataMigration

About bigdata:

Bigdata(R) is a horizontally-scaled, general purpose storage and computing fabric for ordered data (B+Trees), designed to operate on either a single server or a cluster of commodity hardware. Bigdata(R) uses dynamically partitioned key-range shards in order to remove any realistic scaling limits – in principle, bigdata(R) may be deployed on 10s, 100s, or even thousands of machines and new capacity may be added incrementally without requiring the full reload of all data. The bigdata(R) RDF database supports RDFS and OWL Lite reasoning, high-level query (SPARQL), and datum level provenance.

I would like to thank everyone who came out for the meetup in Research Triangle Park at Bull City Coworking, and a big thanks to our hosts. We had a good showing (38 people attending) and a great discussing both during and after the talk. The slides from the talk are available here.

Thanks,
Bryan

Linked data depends on simple access patterns in which a GET of a resource will return a (machine readable) representation of that resource as RDF. In our next release, we plan to add several features to support linked data applications, including:

  • VoID support [2].
  • Several new DESCRIBE algorithms
  • A linked data cache to accelerate DESCRIBE queries

As part of our commitment to linked data, we have added support for several DESCRIBE algorithms in the development branch. This feature will be available in 1.2.3 (was 1.2.2).  Those algorithms include:

  • ForwardOneStep – The DESCRIBE is just the attributes and forward links.
  • SymmetricOneStep – The DESCRIBE is the attributes, the forward links, and the reverse links.  This is the historical behavior for bigdata.
  • CBD – The DESCRIBE is the Concise Bounded Description as defined by [1]
  • SCBD – The DESCRIBE is the Symmetric Concise Bounded Description, as defined by [1].

The default behavior is SymmetricOneStep, which is what bigdata has always implemented.  You can now override this default when the KB is configured.  For example, the following property will make Symmetric Concise Bounded Description the default DESCRIBE algorithm for a KB:

com.bigdata.rdf.sail.describeMode=SCBD

You can also specify the DESCRIBE algorithm as a query hint:

DESCRIBE <http://example.com/aReallyGreatBook>
{
   hint:Query hint:describeMode "SCBD"
}

The advantage of Symmetric Concise Bounded Description over the SymmetricOneStep is that blank nodes are always expanded to include their representation. This is an important advantage in a Linked Data world because you can not query a blank node using SPARQL.

We’ll post again once we have the DESCRIBE cache running. This will be a tremendous performance boost for linked data applications. The cache will maintain descriptions so you only pay the cost for the DESCRIBE query once, and it is integrated with the change log listener so cache entries are invalidated automatically as the data set is updated.

A LinkedData wiki page has been added to document the emerging linked data features.

 

[1] Concise Bounded Description
[2] Describing Linked Datasets with the VoID Vocabulary

 

© 2006-2010 by SYSTAP, LLC bigdata® is a registered trademark of SYSTAP, LLC. Suffusion WordPress theme by Sayontan Sinha