Lambdacurry

Mathematical differences between a database and a search system


I get asked this question often - on what is the actual difference between a search system (like Solr) and a traditional RDBMS… or even a NoSQL datastore like MongoDB. It almost seems like a rule of thumb that every webapp needs a database for its store and a search solution to do free text search. But why ?

What is the inherent differences between the two problems - retrieving data and retrieving search results - that require two different software systems. Can we not model a decent search stack using just the database ?

The answer is in the underlying mathematics. Databases work in terms of sets, where each element is a tuple consisting of various fields

Suggestion for just the word "industry"

The figure on the left illustrates a join between two tables - an intersection of two sets.

Theoretically, you could progressively intersect sets to get a result set of the results you were looking for. So what really is Lucene good for ?

Lucene/Solr/Elasticsearch operate on ranked sets. Lets look at how that works:

The most fundamental data structure of Lucene is an inverted index - think of it as a list of terms mapped to the list of document ids (also known as postings list) which contains the terms - in reality, this is implemented as a skip list.

Basically, when you perform a search query, what you get as output is a bitvector which tells you which documents matched the search (by toggling between 1 and 0). Facets by themselves are also bitvectors. So, if you want to know how many results fall in a certain facet, all you need to do is a bitwise AND operation.

Now, there is a particular feature that many claim is extremely highly optimized in case of search technologies. I am not sure if this is the case, since RDBMS could be emulating a lot of these as well. However, there might be differences in caching behavior leading to massive performance differences.

The crux of the difference between relational mathematics and search based is this - The real optimizations of Lucene come from the fact that you never really want all documents which match the query, but only the top K. And this is what affects the difference in performance as well.

There is no boolean in search

One of the frequent misconceptions is to model a search query in terms of boolean logic - AND, OR & NOT. However, again the same concept comes into play - top-K only. Which is why, Lucene models it in terms of MUST, SHOULD, NOT.

Any clause you add with MUST signifies that, any documents returned MUST have that term. The SHOULD part of the query simply bumps the scores of documents with that clause. Remember that this gets more and more complex if you are using the Dismax or Edismax parser of Solr. Which means that when there is a complex query with multiple boolean subqueries, you need to start diving deeper (e.g. dont mix should and must in the same BooleanQuery if strict boolean matching is needed. It will match all docs, containing the MUST clause and “boost” all SHOULD term).

 

So now, you can point this article if management questions you on why you cant run search through database queries.


Lambdacurry

Mathematical differences between a database and a search system

Published

July 17, 2013

Find me on Twitter @sandeepssrin

Did i make any mistake? Please consider sending a pull request.