Document Modeling Basics

An often asked question of developers those are new to NoSQL is how to start with the document modeling. This article does not aim to give you answers to all document modeling related questions. It is more a starting point.

Flexible Schema

I am personally not a big fan of the word 'schema free'. My personal opinion is that if we talk about structured data, then we also talk about how to structure the data. (BTW: Couchbase also allows to store unstructured data as binaries. Also semi-structured is supported by e.g. embedding base64 encoded strings into JSON documents.) Couchbase Server does not enforce (on the database side) to follow a specific schema. This brings you more flexibility. Some documents might have a specific property, others might not have it. You don't have to specify upfront that a property might be there and then set it to a NULL value if it is not. So what you have is a flexible schema (or better data model), whereby the application is implicitly providing it. If your application/service is managing user profiles then you will find user documents whereby a user has a first name, last name ... and so on. So 'flexible data model' would be the better term.

Key Patterns

Best practice is to use meaningful key patterns. This helps you directly access a document based on it's context. A key could be a combination of a type and a unique attribute value or it can be also an artificial number. If possible, don’t use artificial numbers. This is indeed not every time possible. The following example shows the key of a user with the email address “mmustermann@domain.com”:

“user::mmustermann@domain.com”

Pattern is: $type::$email

If you know that users are accessible via their email address then you can directly get the user without the need to perform a more complex query. (Whereby querying is e.g. supported via SQL like query language - N1QL - in Couchbase).

A more interesting pattern would reflect a hierarchy. If you would assume that one employee belongs only to one company (but a company can have multiple employees) then you can reflect this ‘belongs to’ already in the key. The following shows an example of the key of a user who belongs to the company ‘Foo’ which has the domain ‘foo.org’.

“user::foo.org::12345”

Pattern is: $type::$domain::$id

Types

We have already seen that the key pattern often has a type prefix. It is also best practice to store an extra type attribute. This allows you later to filter more specifically based on this type (e.g. to ask for all users). Here an example of a user:

“user::mmustermann@domain.com” : {

“type” : “user”

“first_name” : “Max”,

“last_name” : “Mustermann”,

“email” : “mmustermann@domain.com”

}

1:1 Relationships

In this case one entity X has a relationship to another one and vice versa. A one to one relationship can be modeled by embedding or by referencing documents. My recommendation would be to model such a relationship in the first step as a key reference and then embed if there are atomicity requirements, which means if there is a requirement to access the two entities most of the times together. The following shows an example of an user who has a session.

“user::mmustermann@domain.com” : {

“type” : “user”,

“first_name” : “Max”,

“last_name” : “Mustermann”,

“email” : “mmustermann@domain.com”,

“session” : {

“source” : “web”,

“token” : “123456”

}

Embedded Document

The same example by expressing it now as a key reference:

“user::mmustermann@domain.com” : {

“type” : “user”,

“first_name” : “Max”,

“last_name” : “Mustermann”,

“email” : “mmustermann@domain.com”,

“session” : “session::mmustermann@domain.com”

}

“session::mmustermann@domain.com” : {

“type” : “session”,

“source” : “web”,

“token” : “123456”,

“user” : “user::mmustermann@domain.com”

}

Explicitly Referenced Document

It’s easy to see that there is a direct relationship via the id of the user, which is the email address in this case. Because the two documents are anyway correlated via their keys we can in this case simplify it to:

“user::mmustermann@domain.com” : {

“type” : “user”,

“first_name” : “Max”,

“last_name” : “Mustermann”,

“email” : “mmustermann@domain.com”

}

“session::mmustermann@domain.com” : {

“type” : “session”,

“source” : “web”,

“token” : “123456”

}

Documents those are implicitly referencing each other

When to embed or to reference is not like black or white. It's more transitioning with some grey values in between. Dependent on the access patterns it is indeed also possible to embed a part of the document and reference to another part.

1:Many Relationship

The one to many relationship means that one document references multiple other ones (1 up to n). A back reference from the referenced one might be suitable. Again, I would by default reference and would embed as an optimization step if there are any atomicity requirements. This is only my personal preference, you could also start with embedding documents and then externalize by optimize regarding cardinalities and data duplication. Here a company which has multiple employees:

“company::domain.com” : {

“type” : “company”,

“name” : “Some name”,

“address” : “Some address”,

“employees” : [ “user::bart.simpson@domain.com”, “user::moe@domain.com”]

}

“user::bart.simpson@domain.com” : {

“type” : “user”,

“first_name” : “Bart”,

“last_name” : “Simpson”,

“email” : “bart.simpson@domain.com”,

“company” : “company::domain.com”

}

“user::moe@domain.com” : {

“type” : “user”,

“first_name” : “Moe”,

“email” : “moe@domain.com”,

“company” : “company::domain.com”

}

1-to-many via key references

Let’s now assume that we embed the users as subdocuments. Another alternative would be to embed them in an array. An array is better if you need only to iterate over the list of embedded documents. If you need to access specific sub-documents by their id then embedding as nested documents would be preferred. It's identical to the question when you, as a developer, use a List or a Map in order to reflect the associations between your classes.

“company::domain.com” : {

“type” : “company”,

“name” : “Some name”,

“address” : “Some address”,

“employees” : {

“user::bart.simpson@domain.com” : {

“type” : “user”,

“first_name” : “Bart”,

“last_name” : “Simpson”,

“email” : “bart.simpson@domain.com”

“user::moe@domain.com” : {

“type” : “user”,

“first_name” : “Moe”,

“email” : “moe@domain.com”

}

1-to-many as embedded document

What happens now if one employee works for multiple companies in the embedded case? Right, you get data duplicates because one user needs now to be fully embedded into 2 company documents.

At the end it is a question of normalization. A completely normalized schema would contain a lot of relations whereby a de-normalized schema would in the worst case have everything in one table. As in relational databases, the truth is something in the middle. You would not embed everything into one document and you would normally also not express every property as an extra document and then use key references to glue them together. What works best depends on the actual requirements and how you need to access the data.

So you reference in order to avoid duplicates but you embed in order to have atomic access. Your documents are usually in average smaller if you reference but you then might have to perform client side joins (or server side ones via N1QL since Couchbase 4.0).

The Many-Many relationship (via references) is quite similar to the 1-Many one. It just means that you have reference arrays (arrays of keys to express the relationships) on both sides.

Lookup Documents

A lookup document is a document which has only the purpose to provide you a direct reference to one or multiple other documents. Lookup documents are quite useful to maintain own indexes (alternative access paths) in Couchbase’s cache. Let’s assume you want to access a user profile via a customer number:

“customer_ref::12345” : {

“ref” : “mmustermann@domain.com”

}

“user::mmustermann@domain.com” : {

“type” : “user”,

“first_name” : “Max”,

“last_name” : “Mustermann”,

“email” : “mmustermann@domain.com”

}

In order to get an user by his customer id, you can now perform 2 get operations. First you get the lookup document based on the customer id, then you read the ‘ref’ attribute which gives you the key of the associated user document. In the next step you can then access the user directly. This way of access is often more efficient than an exact match query because the index lookup is in this case independent from the number of documents in the bucket or entries in the index which is scanned as part of the query processing.

Atomic Counters

Couchbase allows you to increment counter values. This is a useful feature which helps you for instance to generate Id-s. So it’s similar to sequences in the relational world. The following shows some pseudo-code how to increment the counter value and then reuse it for the id generation.

//“count::user” : “0”

id = client.incr(“count::user”);

client.add(“user::” + id, doc);

A typical pattern would be to perform a multi-get based on a range (e.g. 0...count::user) by taking the counter value into account. You could then skip every non existent document by ignoring the ‘DocNotFound’ error messages. This is indeed prefered if you have evolving data with only a low amount of deletes.

We saw in the section ‘Key patterns’ that keys can reflect hierarchies. So you could easily reflect a 1:Many relationship this way by not using explicit references. A user document belongs to a company document if the corresponding key contains the company prefix. We can get all users of a company by knowing the number of users of the company. Here some pseudo code:

count = client.get(“count::foo.org::user”); //e.g. “563”

for ( i=0; i < count; i++ ) {

doc = client.get(“user::foo.org::” + i);

if (! doc.err ) {

result.add(doc);

}

Conclusion

Basic document modeling techniques were presented here. Couchbase allows you a flexible data model. As mentioned, the way how to model your data is not always black or white. My personal preference is to:

Start with the logical data model (e.g. derived from Object Oriented Analysis)
Create a stupid and simple initial model (e.g. by using key references for 1:Many relationships all the time)
Evolve and optimize it step by step regarding the requirements (unnecessary reference lists because the reference is clear via the key pattern, access patterns, atomicity, duplicates, ...).

Here some useful rules/thoughts:

Use meaningful keys and speaking key patterns if possible!
Use counters for the key generation if there is a need to use artificial ids!
Maintain a type attribute!
Embed documents into others in order to allow to write/get them all together with the parent document. (Atomic access)
If not embedding and if possible (e.g. by using the counter value as part of the key, or by having correlated keys) then express the relationship via the key directly without having the need to reference via key arrays.
Reference in order to avoid data duplicates and in order to keep the average document size smaller. Often we just live with duplicates by having other advantages (atomic access, no client side joins). But on the other hand's side we might have such a high amount of data and such a high degree of connectivity that we can't duplicate all the time.
If the cardinalities (number of related documents) are too high and there is no requirement for atomicity then referencing would be preferred over embedding.
Externalize groups of properties from a document (by adding a 1:1 relationship) if you access this group of properties all the the time together and if the overhead of transferring all the other properties of the document the same time would be too high.
It might make sense to externalize reference arrays from a document if the number of references is very high and so you would like to avoid the overhead of transferring these arrays if you usually only access a few properties of the document.

Search This Blog

nosqlgeek.org