Monday, 2 November 2015

Using a Key-Value Store for Full Text Indexing and Search

Couchbase Server is a multi-purpose Database System. One of the purposes is to use it as a simple key-value store. A key-value store allows you to store/retrieve any value by its key. Such a value can be a JSON document (Couchbase allows you to index and query based on such JSON documents and so another purpose is the one as document database.), a small binary or a full text index entry. This article explains why such a key-value store can be also used for full text indexing purposes.

Let's explain how full text indexing works in general. A full text index is a so called inverted index. The table below shows how the following sentences would be indexed: 'Tim is sitting next to Bob' and 'Jim is sitting next to Bob'. The word 'Tim' is only existing in the first sentence and there is exactly one occurrence of it.

Term | Count | Reference
Tim | 1 | #1
is | 2 | #1, #2
sitting | 2 | #1, #2
next | 2 | #1, #2
to | 2 | #1, #2
Bob | 2 | #1, #2
Jim | 1 | #2

There are a bunch of specialized systems out there for full text indexing. Couchbase has for instance a very good integration with Elasticsearch. In the future Couchbase will also have it's own full text service which is called 'cbft'. However, this article is not about Elasticsearch and also not about 'cbft'. We want to use Couchbase's key-value store features for full text indexing here.

Let's define the data model first:

"fts::$field::$term" : { "count" : $numOf, "refs" : [ ...] }

It is actually quite simple. A full text search index entry does point back for a term to the original key-value pairs those are identified by their keys. The refs array contains these keys. The field is just the field on which we want to search. This could be for instance 'address' or 'message'. Let's say that the default field is called '_all'. So if no specific field is used then the '_all' field is used as the fallback field. 

In order to index based on a provided text, we can do the following:
  • Tokenize the text on which should be indexed. This means basically to break the text up into several words (terms). In our case we assume that our text contains the word 'fox'.
  • Check if the e.g. the key "fts::_all::fox" is already existing. If not then create the document by referencing back to the document id which contained the word 'fox'.
  • If the full text index entry was existent then check if the reference list does already contain the reference to the document which contains the text on which should be indexed.
  • If the reference list does not yet contain the key of this document then extend the reference list by adding the key of the document.
Now in order to search for the specific word 'fox', we just have to do the following:
  • Get "fts::_all::fox"
  • Perform a multi-get based on the keys in the array 'refs'
Some example source code (Node.js) can be found here: . The service is implemented here: . This application was created by using CEAN stack tools (Couchbase + Express + Angular + Node.js ). They are available here: .

Given that I already wrote this little demo application, it makes sense to try it out :-) . First let's add 2 sentences:
  • the_fox: The quick brown fox jumps over the lazy dog
  • the_cat: The quick brown fox jumps over the lazy cat

Now in the next step let's perform some searches. I implemented the word search in a way that you can enter any sequence of words (separated by white spaces). The following searches for 'cat':

As we can see, only the sentence with the id 'the_cat' contains the word 'cat'. Next lets's search for the word 'fox':

Both sentences contain the word 'fox'. Last but not least let's search for multiple words:

I think you get it ... :-) . The data which is stored in Couchbase looks as the following one:

This article explained how you can use Couchbase to store a full text search index. Such an index can be used for simple and basic text searches, which might be sufficient for some of your development projects. If you need more sophisticated text search or text analysis then a dedicated full text search service might be the better option.

No comments:

Post a Comment