Friday, 20 November 2015

Document Modeling Basics

An often asked question of developers those are new to NoSQL is how to start with the document modeling. This article does not aim to give you answers to all document modeling related questions. It is more a starting point.

Flexible Schema

I am personally not a big fan of the word 'schema free'. My personal opinion is that if we talk about structured data, then we also talk about how to structure the data. (BTW: Couchbase also allows to store unstructured data as binaries. Also semi-structured is supported by e.g. embedding base64 encoded strings into JSON documents.) Couchbase Server does not enforce (on the database side) to follow a specific schema. This brings you more flexibility. Some documents might have a specific property, others might not have it. You don't have to specify upfront that a property might be there and then set it to a NULL value if it is not. So what you have is a flexible schema (or better data model), whereby the application is implicitly providing it. If your application/service is managing user profiles then you will find user documents whereby a user has a first name, last name ... and so on. So 'flexible data model' would be the better term.

Key Patterns

Best practice is to use meaningful key patterns. This helps you directly access a document based on it's context. A key could be a combination of a type and a unique attribute value or it can be also an artificial number. If possible, don’t use artificial numbers. This is indeed not every time possible. The following example shows the key of a user with the email address “mmustermann@domain.com”:

“user::mmustermann@domain.com”
Pattern is: $type::$email

If you know that users are accessible via their email address then you can directly get the user without the need to perform a more complex query. (Whereby querying is e.g. supported via SQL like query language - N1QL - in Couchbase). 

A more interesting pattern would reflect a hierarchy. If you would assume that one employee belongs only to one company (but a company can have multiple employees) then you can reflect this ‘belongs to’ already in the key. The following shows an example of the key of a user who belongs to the company ‘Foo’ which has the domain ‘foo.org’.

“user::foo.org::12345”
Pattern is: $type::$domain::$id

Types

We have already seen that the key pattern often has a type prefix. It is also best practice to store an extra type attribute. This allows you later to filter more specifically based on this type (e.g. to ask for all users). Here an example of a user:

“user::mmustermann@domain.com” : {
  “type” : “user”
  “first_name” : “Max”,
  “last_name” : “Mustermann”,
  “email” : “mmustermann@domain.com”
}

1:1 Relationships

In this case one entity X has a relationship to another one and vice versa. A one to one relationship can be modeled by embedding or by referencing documents. My recommendation would be to model such a relationship in the first step as a key reference and then embed if there are atomicity requirements, which means if there is a requirement to access the two entities most of the times together. The following shows an example of an user who has a session.

“user::mmustermann@domain.com” : {
  “type” : “user”,
  “first_name” : “Max”,
  “last_name” : “Mustermann”,
  “email” : “mmustermann@domain.com”,
  “session” : {
    “source” : “web”,
    “token” : “123456”
  }
}
Embedded Document

The same example by expressing it now as a key reference:

“user::mmustermann@domain.com” : {
  “type” : “user”,
  “first_name” : “Max”,
  “last_name” : “Mustermann”,
  “email” : “mmustermann@domain.com”,
  “session” : “session::mmustermann@domain.com”
}

“session::mmustermann@domain.com” : {
    “type” : “session”,
    “source” : “web”,
    “token” : “123456”,
    “user” : “user::mmustermann@domain.com”
}
Explicitly Referenced Document

It’s easy to see that there is a direct relationship via the id of the user, which is the email address in this case. Because the two documents are anyway correlated via their keys we can in this case simplify it to:

“user::mmustermann@domain.com” : {
  “type” : “user”,
  “first_name” : “Max”,
  “last_name” : “Mustermann”,
  “email” : “mmustermann@domain.com”
}

“session::mmustermann@domain.com” : {
    “type” : “session”,
    “source” : “web”,
    “token” : “123456”
}
Documents those are implicitly referencing each other

When to embed or to reference is not like black or white. It's more transitioning with some grey values in between. Dependent on the access patterns it is indeed also possible to embed a part of the document and reference to another part.


1:Many Relationship

The one to many relationship means that one document references multiple other ones (1 up to n). A back reference from the referenced one might be suitable. Again, I would by default reference and would embed as an optimization step if there are any atomicity requirements. This is only my personal preference, you could also start with embedding documents and then externalize by optimize regarding cardinalities and data duplication. Here a company which has multiple employees:

“company::domain.com” : {
  “type” : “company”,
  “name” : “Some name”,
  “address” : “Some address”,
  “employees” : [ “user::bart.simpson@domain.com”, “user::moe@domain.com”]
}

“user::bart.simpson@domain.com” : {
  “type” : “user”,
  “first_name” : “Bart”,
  “last_name” : “Simpson”,
  “email” : “bart.simpson@domain.com”,
  “company” : “company::domain.com”
}

“user::moe@domain.com” : {
  “type” : “user”,
  “first_name” : “Moe”,
  “email” : “moe@domain.com”,
  “company” : “company::domain.com”
}
1-to-many via key references

Let’s now assume that we embed the users as subdocuments. Another alternative would be to embed them in an array. An array is better if you need only to iterate over the list of embedded documents. If you need to access specific sub-documents  by their id then embedding as nested documents would be preferred. It's identical to the question when you, as a developer, use a List or a Map in order to reflect the associations between your classes.

“company::domain.com” : {
  “type” : “company”,
  “name” : “Some name”,
  “address” : “Some address”,
  “employees” : {
        “user::bart.simpson@domain.com” : {
            “type” : “user”,
            “first_name” : “Bart”,
            “last_name” : “Simpson”,
            “email” : “bart.simpson@domain.com
         },
        “user::moe@domain.com” : {
            “type” : “user”,
            “first_name” : “Moe”,
            “email” : “moe@domain.com
        }
   }
}
1-to-many as embedded document

What happens now if one employee works for multiple companies in the embedded case? Right, you get data duplicates because one user needs now to be fully embedded into 2 company documents. 

At the end it is a question of normalization. A completely normalized schema would contain a lot of relations whereby a de-normalized schema would in the worst case have everything in one table. As in relational databases, the truth is something in the middle. You would not embed everything into one document and you would normally also not express every property as an extra document and then use key references to glue them together. What works best depends on the actual requirements and how you need to access the data.

So you reference in order to avoid duplicates but you embed in order to have atomic access. Your documents are usually in average smaller if you reference but you then might have to perform client side joins (or server side ones via N1QL since Couchbase 4.0). 

The Many-Many relationship (via references) is quite similar to the 1-Many one. It just means that you have reference arrays (arrays of keys to express the relationships) on both sides.

Lookup Documents

A lookup document is a document which has only the purpose to provide you a direct reference to one or multiple other documents. Lookup documents are quite useful to maintain own indexes (alternative access paths) in Couchbase’s cache. Let’s assume you want to access a user profile via a customer number:

“customer_ref::12345” : {
  “ref” : “mmustermann@domain.com”
}

“user::mmustermann@domain.com” : {
  “type” : “user”,
  “first_name” : “Max”,
  “last_name” : “Mustermann”,
  “email” : “mmustermann@domain.com”
}

In order to get an user by his customer id, you can now perform 2 get operations. First you get the lookup document based on the customer id, then you read the ‘ref’ attribute which gives you the key of the associated user document. In the next step you can then access the user directly. This way of access is often more efficient than an exact match query because the index lookup is in this case independent from the number of documents in the bucket or entries in the index which is scanned as part of the query processing.

Atomic Counters

Couchbase allows you to increment counter values. This is a useful feature which helps you for instance to generate Id-s. So it’s similar to sequences in the relational world. The following shows some pseudo-code how to increment the counter value and then reuse it for the id generation.

//“count::user” : “0”
id = client.incr(“count::user”);
client.add(“user::” + id, doc);

A typical pattern would be to perform a multi-get based on a range (e.g. 0...count::user) by taking the counter value into account. You could then skip every non existent document by ignoring the ‘DocNotFound’ error messages. This is indeed prefered if you have evolving data with only a low amount of deletes.

We saw in the section ‘Key patterns’ that keys can reflect hierarchies. So you could easily reflect a 1:Many relationship this way by not using explicit references. A user document belongs to a company document if the corresponding key contains the company prefix. We can get all users of a company by knowing the number of users of the company. Here some pseudo code:

count = client.get(“count::foo.org::user”); //e.g. “563”

for ( i=0; i < count; i++ ) {
    
    doc = client.get(“user::foo.org::” + i);

    if (! doc.err ) {
       
         result.add(doc);
    }
}

Conclusion

Basic document modeling techniques were presented here. Couchbase allows you a flexible data model. As mentioned, the way how to model your data is not always black or white. My personal preference is to:
  •  Start with the logical data model (e.g. derived from Object Oriented Analysis)
  •  Create a stupid and simple initial model (e.g. by using key references for 1:Many relationships all the time)
  •  Evolve and optimize it step by step regarding the requirements (unnecessary reference lists because the reference is clear via the key pattern, access patterns, atomicity, duplicates, ...). 
Here some useful rules/thoughts:
  • Use meaningful keys and speaking key patterns if possible!
  • Use counters for the key generation if there is a need to use artificial ids!
  • Maintain a type attribute!
  • Embed documents into others in order to allow to write/get them all together with the parent document. (Atomic access)
  • If not embedding and if possible (e.g. by using the counter value as part of the key, or by having correlated keys) then express the relationship via the key directly without having the need to reference via key arrays.
  • Reference in order to avoid data duplicates and in order to keep the average document size smaller. Often we just live with duplicates by having other advantages (atomic access, no client side joins). But on the other hand's side we might have such a high amount of data and such a high degree of connectivity that we can't duplicate all the time.
  • If the cardinalities (number of related documents) are too high and there is no requirement for atomicity then referencing would be preferred over embedding.
  • Externalize groups of properties from a document (by adding a 1:1 relationship) if you access this group of properties all the the time together and if the overhead of transferring all the other properties of the document the same time would be too high.
  • It might make sense to externalize reference arrays from a document if the number of references is very high and so you would like to avoid the overhead of transferring these arrays if you usually only access a few properties of the document. 

Monday, 2 November 2015

Using a Key-Value Store for Full Text Indexing and Search

Couchbase Server is a multi-purpose Database System. One of the purposes is to use it as a simple key-value store. A key-value store allows you to store/retrieve any value by its key. Such a value can be a JSON document (Couchbase allows you to index and query based on such JSON documents and so another purpose is the one as document database.), a small binary or a full text index entry. This article explains why such a key-value store can be also used for full text indexing purposes.

Let's explain how full text indexing works in general. A full text index is a so called inverted index. The table below shows how the following sentences would be indexed: 'Tim is sitting next to Bob' and 'Jim is sitting next to Bob'. The word 'Tim' is only existing in the first sentence and there is exactly one occurrence of it.

Term | Count | Reference
------------------------
Tim | 1 | #1
is | 2 | #1, #2
sitting | 2 | #1, #2
next | 2 | #1, #2
to | 2 | #1, #2
Bob | 2 | #1, #2
Jim | 1 | #2


There are a bunch of specialized systems out there for full text indexing. Couchbase has for instance a very good integration with Elasticsearch. In the future Couchbase will also have it's own full text service which is called 'cbft'. However, this article is not about Elasticsearch and also not about 'cbft'. We want to use Couchbase's key-value store features for full text indexing here.

Let's define the data model first:


"fts::$field::$term" : { "count" : $numOf, "refs" : [ ...] }


It is actually quite simple. A full text search index entry does point back for a term to the original key-value pairs those are identified by their keys. The refs array contains these keys. The field is just the field on which we want to search. This could be for instance 'address' or 'message'. Let's say that the default field is called '_all'. So if no specific field is used then the '_all' field is used as the fallback field. 

In order to index based on a provided text, we can do the following:
  • Tokenize the text on which should be indexed. This means basically to break the text up into several words (terms). In our case we assume that our text contains the word 'fox'.
  • Check if the e.g. the key "fts::_all::fox" is already existing. If not then create the document by referencing back to the document id which contained the word 'fox'.
  • If the full text index entry was existent then check if the reference list does already contain the reference to the document which contains the text on which should be indexed.
  • If the reference list does not yet contain the key of this document then extend the reference list by adding the key of the document.
Now in order to search for the specific word 'fox', we just have to do the following:
  • Get "fts::_all::fox"
  • Perform a multi-get based on the keys in the array 'refs'
Some example source code (Node.js) can be found here: https://github.com/dmaier-couchbase/cb-node-fts . The service is implemented here: https://github.com/dmaier-couchbase/cb-node-fts/blob/master/routes/fts.js . This application was created by using CEAN stack tools (Couchbase + Express + Angular + Node.js ). They are available here: http://www.ceanjs.org .

Given that I already wrote this little demo application, it makes sense to try it out :-) . First let's add 2 sentences:
  • the_fox: The quick brown fox jumps over the lazy dog
  • the_cat: The quick brown fox jumps over the lazy cat


Now in the next step let's perform some searches. I implemented the word search in a way that you can enter any sequence of words (separated by white spaces). The following searches for 'cat':



As we can see, only the sentence with the id 'the_cat' contains the word 'cat'. Next lets's search for the word 'fox':



Both sentences contain the word 'fox'. Last but not least let's search for multiple words:


I think you get it ... :-) . The data which is stored in Couchbase looks as the following one:


This article explained how you can use Couchbase to store a full text search index. Such an index can be used for simple and basic text searches, which might be sufficient for some of your development projects. If you need more sophisticated text search or text analysis then a dedicated full text search service might be the better option.