Friday, 13 June 2014

Understanding Couchbase's Elasticsearch configuration

Preamble 

Couchbase Server can be used together with Elasticsearch. This makes quite sense because Full-Text-Search is not one of the main purposes of a Key-Value-Store. Instead an external (also scalable) search engine should be used in order to tokenize and index the data for full text search purposes. How it works with Couchbase is that you can install a transport plug-in to your Elasticsearch instance. You then can configure Couchbase's XDCR (Cross Data Centre Replication) feature in order to replicate the JSON documents to Elasticsearch and to let index your JSON data there. Then you can perform full text searches by using the Lucene Query Languge by getting the resuting keys (of Documents those are stored in Couchbase) back. Further details can be found at http://docs.couchbase.com/couchbase-elastic-search/.

Elasticsearch Basics

Let's talk a bit about the Elasitcsearch terminology before we spot a light on Couchbase's default Elasticsearch configuration.

Document Index

You can have multiple indexes. Elasticsearch can store multiple types in one single index. An index is called an inverted index because it works in an inverted way in comparison to E.G. relational indexes. In non-inverted indexes, we have the situation that we index on one or multiple key-values to find the address of a specific data item.So you can understand a usual index as a two column table: whereby in this case the 2 sentences "Tim is sitting next to Bob" and "Bob is sitting next to Jim" are indexed.

Document Type

A type is just a logical container. It contains multiple fields. Important is that two fields with the same name must have everytime the same primitive (core) data type if both fields are associated to the same type. To add a JSON document to the index, the following REST call can be used:
The ${id} parameter is optional and if it is not given Elasticsearch will generate an identifier automatically.

Document Field

A JSON document has multiple fields. The following example shows the fields firstName and lastName.
The field type will be automatically determined E.G. 1 is handled as a number but "1" as a String. However, an explicit Mapping is recommended.

A mapping file is basically looking like:
Basic options are:
  •  type: The Core Type E.G. long, string, boolen, used to tell the Search Engine how to analyze the field
  • store: If it should be stored. This is a kind of confusing. So as far as I understood a field can be searched (via the inverted index) if it is indexed. If you set 'store' to 'no' and index to 'yes' then you can search for it but not show the contents of the document. BUT if you set the whole document as '_source' then you can also access the contents of it
  •  index: Index mode 'analyzed','no', 'not_analyzed', the value 'no' means that the field is not searchable. The difference between 'no' and 'not_analyzed' is that with the last one the field is indexed but let's say not tokenized. Instead only "perfect match queries" are possible on the field.
  • null_value: Default value if the field value is not available
  • include_in_all: The _all field is a special field in which all the other fields are automatically included. If set to 'no' then the field will not be included in the _all field
Built in Fiels are:
  • _uid: Unique identifier composed of the document's type and _id
  • _id: The identifier of the document
  • _type: The type of the document (as indexed by type)
  • _all: To store the data of all the other fields in a single field in order to simplify searching. By default every field will be added to the _all field
  • _source: Is used to store the orginal source document. This is enabled by default. To avoid storage overhead the 'includes' and 'excludes' option could be used to override the default.
Other fiels are:
  • _index
  • _size
  • _timestamp
  • _ttl

 

 The Default Mapping

As mentioned before Elasticsearch maps automatically by default. Let's name this 'implicite mapping' whereby the other one is an 'explicite' one. So if a new type is encountered (because you put data to it) and you would not have a excplicte mapping for it then the _default_ mapping is used. It looks by default like the follwowing one:
You can override the default mapping (globally, or by index). Everything which is added will just override the old default value by being then the new default value.

 Dynamic Templates

These are giving a better controll (if necessary). So you can apply mappings based on the field name by using patterns. An example is to use a different analyzer for a field that has name which ends with '_es'. Therefor the match parameter is used (Match a field based on a pattern, E.G. "match" : "*_es"). Additionally you could match on paths of properties in a JSON document (E.G. address.*.name).

The Couchbase Template

Elasticsearch allows you to perform the configuration (E.G. the mapping) via it's RESTFul interface.
So the Couchbase mapping file looks like the following one:
The setting '"template" : "*"' means that the template applies to every index. A value of E.G. 'cou*' would mean that the template is only applied to indexes those have a name which starts with 'cou'.

Multiple templates can potentially match one index. In this case the mappings are merged. So the "order" setting specifies the order of this merge operation. So lower numbers are meaning that it is merged earlier.

The type 'couchbaseCheckpoint' has at first a '_source' setting which causes that only the 'doc' (and not the 'meta') part of documents of this type is accessibla as content. Furthermore a dynamic template is created which is named 'store_no_index'. This template matches every document of the type 'couchbaseCheckpoint' by not storing it, by not indexing it and by not making it accessible via the '_all' field BUT you should keep in mind that the content of the 'doc' part of such a JSON document is still accessible as the 'source'.

The '_default_' mapping overrides the global one in this case. The '_source' setting causes that by default just the content of the 'meta' part of the incoming JSON document is stored as the 'source'.

The meta property of a document is mapped to the type 'object' and it is not included in the '_all' field. Because the other defaults are not overridden, this means that it is indexed and stored. This also means that all other fields are indexed and stored.

What may be interesting is the fact that Couchases transfers objects by letting them index as the 'couchbaseDocument' type. So the '_default_' section configures especially how Couchbase Documents are handled.

4 comments:

  1. lets say that I have different documents types in couchbase, and I want to store them as different types on ES and not just as couchbaseDocument.
    how do I do it?
    or what if I have some docs that I dont want to sync at all, should I create different bucket for it?
    or can I do filter fo it.
    besides whats the best practice?

    ReplyDelete
    Replies
    1. My understanding is that the Couchbase transport plug-in is putting the data to this specific type. So every indexed document has the type 'couchbaseDocument' or 'couchbaseCheckpoint'. If your documents have a type property and the type value is indexed in Elasticsearch (it is by default) then you just can perform searches in Elasticsearch for documents of this type. You also can create queries by concatenating this 'type search term' with AND with other search terms. You should keep in mind that the main purpose of an Elasticsearch type is to have a kind logical container which allows you to define mappings based on it. However, by using dynamic templates you can define a mapping based on patterns within the type. So if you want to exclude documents from the index you can just create a dynamic template like the 'store_no_index' one. To control this you could add a 'boolean' property to your Couchbase stored JSON documents (E.G. storedNotInES') and then perform a match based on this property. Every document which matches 'doc.storedNotInES' will match your dynamic template and so will be handled the 'store_no_index' way. Hope this helps.

      Delete
    2. This comment has been removed by the author.

      Delete
  2. Hi David,
    Thanks for your explanations. Actually I need to index but not analyze some of my document's fields. It seems all of fields analyze by default. I guess I should add a couchbaseDocument part and alter mappings characteristics but I'm not sure exactly. Would you help me ?

    Merry Christmas,
    Afshin

    ReplyDelete