Bigdata is a high-performance RDF database that uses B+tree indices to index RDF statements and terms. But did you know that bigdata also uses these same B+tree indices to provide built-in support for Lucene-style free text search? If the text index is enabled (via a property when the database is created), then each literal added to the database (by appearing in the “O” position of a statement) is also added to the text index. This index can then be accessed directly, via the bigdata API, or indirectly, via high-level query. To accomplish this integration of free text search with high-level query, bigdata defines several magic predicates that are given special meaning, and when encountered in a SPARQL query are interpreted as service calls to the text index.
Before you get started, make sure you have enabled the free text index in your properties file:
com.bigdata.rdf.store.AbstractTripleStore.textIndex=true.
The full list of magic predicates related to free text search is defined and documented in the class com.bigdata.rdf.store.BD. The simplest way to integrate free text search into a SPARQL query in bigdata is to use the magic predicate “bd:search” inside of a SPARQL join group. The predicate bd:search is used to search the full text index using the pattern in the “O” position of the search and to bind the hits (Literals) to the variable defined in the “S” position of the search. For example:
?lit bd:search “mike” .
will search the full text index for literals that contain the token “mike” and bind those literals onto the ?lit variable for use in subsequent joins. To find statements that use literals that contain the token “mike”, the SPARQL query would look as follows:
prefix bd: <http://www.bigdata.com/rdf#search>
select ?s, ?p, ?o
where {
?o bd:search “mike” .
?s ?p ?o .
}
In addition to simple search, additional metadata about the search can be defined inside the SPARQL query using other magic predicates (also defined in the BD class). These predicates, when attached to the same variable as the search, will help narrow the search or bind additional metadata about search hits to other variables. We could expand the SPARQL query as follows:
prefix bd: <http://www.bigdata.com/rdf#search>
select ?s, ?p, ?o, ?score, ?rank
where {
?o bd:search “mike personick” .
?o bd:matchAllTerms “true” .
?o bd:minRelevance “0.25” .
?o bd:relevance ?score .
?o bd:maxRank “1000” .
?o bd:rank ?rank .
?s ?p ?o .
}
The magic predicate bd:matchAllTerms indicates that only literals that contain all of the specified search terms should be considered. Similarly, literals can be constrained by min and max relevance (a 0 to 1 score signifying how closely the literal matches the search terms) and by min and max rank (hits are ordered by relevance and the rank describes where the literal appears in that ordered list). If the relevance or rank is relevant to the application, those pieces of metadata can be bound to variables in the search results using the predicates bd:relevance and bd:rank.
I’ve been working recently with a customer that is making extensive use of the free text search feature within its application. A number of interesting challenges arise when working with a large database inside an application designed to answer user queries with very low latency. The number of hits bound to a free text search can be very large (unconstrained search) or very small (highly constrained search), and when joined with the statement indices inside a SPARQL query, these hits can produce either huge numbers of relevant statements (a condition we’ve been referring to as “overflow”) or hardly any statements at all (“underflow”). By playing tricks with the SPARQL query we have been cutting the hits into “rank slices” by specifying the min and max ranks to be considered for any particular query, and then running those rank slices through the rest of the joins in the query until we find just enough results to paint the first page of search results, but no more. By starting with very small rank slices, we try to ensure we don’t overload and stall out the application while the user waits. This trick of rank slicing is made possible by caching the ordered list of free text hits temporarily inside the query engine, so that subsequent calls to the text index with the exact same parameters (search and metadata) will be costless.