nosqlgeek.org

How to use Redis as a Vector Database for Recommendations

2023-05-19T17:00:00.010+02:00

Introduction

This blog post is the result of some preparation work for a recent meetup, where I introduced a bunch of recommendation engine algorithms. The idea of using vector similarity search for recommendations is quite simple:

The interests of a user are expressed as a vector. Each component of the vector is associated with one category of interest.
If we know the interests of a specific user, we can search for the K-N(earest)N(eighbours) to find other users that share the same interests.
We can then inspect the behavior of these users (e.g., the purchase history) to recommend our user specific products.

The following table shows the vector [0.9, 0.7, 0.2]:

*books*	*comics*	*computers*
0.9	0.7	0.2

Let's assume that a user is only interested in a specific category if the interest value is larger than the threshold of 0.4, which means that our user is interested in 'books' and 'comics' but not in 'computers'.

How to use Redis

You can use Redis Stack's query and search capabilities to:

Index vectors
Find similar vectors

My source code example wraps the Redis commands by using a VectorDB class. Let's look at some of the methods I implemented for accessing Redis:

init

The Redis connection is established within the constructor of the VectorDB class:

create_index

As the name indicates, this method creates a search index. The default schema has a descr text field, a labels tag field, a numeric field called time, and a vector field named vec. The relevant code within this method is equivalent to the following:

add

This method adds a vector with metadata to the database. I use a Redis hash in this case, but you can also store vectors within JSON with Redis Stack.

The data dictionary contains the fields labels, descr, time, and vec. The vector is stored as binary within the vec field. The library numpy is used to convert a more human-readable float vector (a Python list) to its byte string representation:

Here is an example of such a data dictionary:

vector_search

The vector_search method performs the vector similarity search. My implementation only returns the id and vector score. The query string has a few arguments:

Metadata query: The variable meta_data_query is set to the query string that is executed to pre-filter based on the metadata, such as the description (desc) or the labels. The => operator means execute before => execute after. So, the metadata query is executed before the vector similarity search is performed.
Number of neighbours: The value of num_neighbours is set to the KNN integer value.
Vector field: This is the vector field that is used for the search. Redis can store multiple vector fields within an item (hash or JSON).

Here is the source code that constructs the vector query string:

You can then query the database the following way:

For further details, please look at the vector similarity search reference documentation.

Putting it all together

As explained, I decided to add a thin layer of abstraction by implementing this VectorDB class. The following example shows how to use it:

Create an index
Add some vectors with metadata
Perform a simple query for users that are labeled with specific interests
Execute a vector similarity search for the two nearest neighbours

Here is the source code of the demo application:

The output of this program is:

It's important to understand that a lower score means that a vector is closer to the search vector. In my case, the result is ordered (ascending) by the vector field's score.

I hope that you enjoyed this blog post. If you didn't find the time to read it entirely, then I also recorded a video walk-through.

Gemeinsames Projekt zu 'AI-gestütztes Tool zur vereinfachten Erfassung von Objekten im Museum' mit der Hochschule Augsburg

2023-03-06T12:03:00.006+01:00

Wir bei NoSQL Geeks freuen uns über eine weitere Zusammenarbeit mit dem Fachbereich Datenbanken der Fakultät für Informatik der Hochschule Augsburg. Wir wissen die Resonanz zu unserem Projektvorschlag 'AI-gestütztes Tool zur vereinfachten Erfassung von Objekten im Museum' zu schätzen und blicken der Zusammenarbeit mit den Studierenden entgegen. Außerdem konnten wir das Mittelschwäbische Heimatmuseum Krumbach als Anforderungsgeber, und Ansprechpartner zu fachlichen Fragen rund um die Dokumentation im Museum, gewinnen.

Hier eine kurze Projektbeschreibung: Museen jeder Größe stehen vor der Herausforderung, mit begrenzten personellen Mitteln Objekte aufzunehmen und zu dokumentieren. Die Beschreibung eines Objekts umfasst die Klassifizierung (z.B. Doppelhenkelvase), das Material (z.B. Porzellan), die Farbe, das Alter, die Herkunft, und viele weitere Eigenschaften. Ziel des Projekts ist somit die Erstellung eines Open-Source-Tools zur Dokumentation im Museum, welches Merkmale wie Form, Farbe und Material erkennt und vorschlägt. Die teilnehmenden Studenten werden die Gelegenheit haben, praktische Erfahrungen mit NoSQL-Datenmodellen, Vektoreinbettungen und „Vector Similarity Search“ zu sammeln.

Nähere Informationen sind außerdem auf der Github-Projektseite zu finden:

https://github.com/nosqlgeek/ai-meets-museum

Azure DevOps im Überblick

2023-03-06T11:33:00.010+01:00

Am 17.03.2023 treffen wir uns um 21:00 im Stückwerk. Der Vortragende, Christian Linke, wird uns einen Überblick zum Thema 'DevOps mit und in der Azure Cloud' geben.

Nähere Informationen zum Event findet ihr in unserer Meetup-Gruppe hier:

https://www.meetup.com/de-DE/nerdkram-mittelschwaben/events/292046891/

Codecamp for Kids

2022-12-02T09:30:00.000+01:00

NoSQL Geeks wird im Januar ein Codecamp für Kinder zwischen 12 und 16 Jahren organisieren. Nähere Details findet ihr hier: https://www.meetup.com/codecamps-by-nosql-geeks/events/290042192/. Bei Fragen könnt ihr uns auch direkt kontaktieren. Alle Kontaktdaten finden ihr auf https://www.nosqlgeeks.de.

New meet-up group in Mittelschwaben

2022-12-02T09:20:00.000+01:00

Wir freuen uns mitteilen zu können, dass wir eine Meet-up-Gruppe in Mittelschwaben organisieren werden. Themen sind u.a. IT, Softwareentwicklung und Datenverwaltung. Dazu werden wir uns regelmäßig in Krumbach (Schwaben) treffen. Nähere Informationen erhaltet ihr auf https://www.meetup.com/nerdkram-mittelschwaben/. Lebt oder arbeitet Ihr nahe Mittelschwaben? Wollt ihr den Vorträgen zuhören, oder selbst präsentieren? Dann tretet doch einfach unserer Gruppe auf meetup.com bei!

We are happy to share that we will participate and sponsor a meet-up group around IT, Software Development, and Data Management in Mittelschwaben. The idea is to meet frequently in Krumbach (Schwaben). Further details can be found here: https://www.meetup.com/nerdkram-mittelschwaben/.

Are you located in this area of the world? Do you want to participate, either as guest or as presenter? Then please join our group on meetup.com!

Talk at the University of Applied Sciences in Augsburg about practical use cases of NoSQL

2022-11-21T10:31:00.004+01:00

It was a pleasure visiting the University of Applied Sciences in Augsburg last week. Prof. Dr. Michael Predeschly invited me to give a talk about practical uses cases of NoSQL. It was amazing to see all those interested students and to answer their questions about polyglot persistence, the right usage of NoSQL databases, and practical use cases.

Do you want to know more? Then reach out to me. My contact details can be found here: https://www.nosqlgeeks.de .

NoSQL Geeks is part of the Stückwerk Community

2022-11-20T16:03:00.010+01:00

The Stückwerk Community is a project of the "Kult E.V." in Krumbach (the home town of the company NoSQL Geeks). The idea is to create a place where culture meets social initiatives. There are frequent events like art exhibitions or intercultural meet-and-greet-s. NoSQL Geeks will present during this year's XMas market that happens in the Stückwerk building in the town center. More about NoSQL Geeks can be found here: https://www.nosqlgeeks.de.
/
Das Stückwerk ist ein Projekt des Kult-Vereins in Krumbach (der Firmensitz der NoSQL Geeks). Die Idee ist es einen Platz der Begegnung von Kultur und sozialen Initiativen zu schaffen. So gibt es z.B. Veranstaltungen wie Kunstaustellungen oder interkulturelle Treffen. NoSQL Geeks wird sich wärend des diesjährigen Weihnachtsmarkts im Stückwerksgebäude im Stadzentrum vorstellen. Mehr über NoSQL Geeks erfahren Sie hier: https://www.nosqlgeeks.de.

https://vimeo.com/773013927

So what exactly is an Event Loop?

2019-10-25T16:13:00.004+02:00

Introduction

Most of you are knowing that I am working a lot with Redis. And some of you might also know that Redis OSS 'standalone' is mainly single-threaded. The reason why it can achieve that high throughputs on a single instance is that it uses an event loop. But what the hell is an event loop and how does it work?

First of all, let me tell you the story behind this article. It all started last weekend. For some reason, I found the time to read a Kotlin book. Not sure why I did, guess I just had the feeling that I am too long disconnected from actual development tasks and wanted to explore one of the comparable new programming languages. The book was great and I had the impression that Kotlin is actually quite nice. Then I went back to my main task (helping to enable the Technical Field at Redis Labs) and had to work on a slide about the Redis event loop. A look at the source code (and the following article https://redis.io/topics/internals-rediseventlib), raised a question for me: Why is there a TimeEvent? The answer might finally sound simple, but just the fact that I dug a bit deeper into the event loop topic caused the idea to just implement a very simple event loop in Kotlin. So I did a bit more research and found also very good explanations of how the Node.js event loop works.

So this article has the intention to share my learnings by using this light-weighted event loop that I implemented as an academic example

What is kEventLib?

So kEventLib is a light-weighted event loop implementation in Kotlin. As said, this project has more academical character. The idea is to illustrate how an event loop works by giving me the chance to play a bit around with Kotlin.

What's an Event?

We are defining a generic event as something which can happen at a specific time, has a type and a payload:

More specific events were derived from Event:

SimpleEvent: An event without a specific time. It doesn't matter exactly when such an event should be executed
TimedEvent: An event which allows passing a delay, which means that the event should not be executed before this time is over

Timed events are having a lower priority than non-timed events. So we will execute non-timed events first, but we are considering that timed events are deferred to be executed in the future.
An excellent example of timed events would be 'disk write' events in Redis or async calls in Node.js.

Why do we need an Event Queue and Event Buffer?

I decided to implement two different structures, dependent on if it is about a non-timed event or a timed event.

EventQueue: We are using the event queue to process the events in the order of their appearance. Node.js is using a stack instead of a queue because calls can be nested, and so a call-stack makes more sense.
EventBuffer: This structure is used to buffer timed events. My naive event-loop works in a way that timed events are only processed after all non-timed events are processed. You could indeed think of more sophisticated scheduling approaches.

What does the Event Loop?

The event loop is a ... loop which runs a function call in a single thread:

Processing an event means to check first if the event queue is empty. If not, then we are processing one of the queued events. If it is empty, then we start processing the buffered events. All buffered events that are in the past will be processed, whereby the event with the minimum timestamp (the one which happened earliest) will be processed first.
It can happen that no event can be processed. Then an EmptyEvent is returned. It can also happen that an error occurs when submitting an event to the loop. This will return an ErrorEvent. Such an error is caused by the fact that either the event loop's queue or buffer is fully utilized. The 'submitter' would then need to implement a back-pressure mechanism.

Show me an Example!

Here some example code:

The execution output (handling an event just prints some details about it) looks like:

Events that are printed with the prefix '-1' are non-timed events. Otherwise, the prefix is the timestamp (to which the event was deferred to).

You can see that the non-timed events were executed first. Then we didn't submit non-timed for a while (due to the sleep after every submission of a timed event) which is the reason why some of the timed events are executed. Then we are executing the non-timed events again and finally the deferred timed ones.

I hope you enjoyed reading this post. The full source code can be found here: https://github.com/nosqlgeek/kEventLib .

A simple but special Redis Web Client

2019-05-10T10:42:00.002+02:00

It was a while ago when I wrote my last blog post. So let me take the chance to write something about a very simple piece of software which started as a fun project of mine. Let's do it a bit different this time. I will show you the results first and then I will explain what's special about this application:

This looks very basic, right? What's special about it? So here is the story behind it:

It all started a few weeks ago when I decide to buy an iPad Pro (11 inches). The motivation was indeed not to use it as a development machine. My current role requires to draw some diagrams and to explain stuff a bit more vizually. So the iPad Pro seemed to be a nice device for such a purpose.
Being a techie, I wondered a bit which kind of development can be done on it and I started to install some tools for experimenting with them:

Working Copy: A Git client
Pythonista: A Python IDE
StaSh: A shell for Pythonista which allows you to use packages via 'pip'
Blink: A CLI with an SSH client
VNC Viewer: Access the screen of your computer (which is running a VNC server)

Especially Pythonista is a great tool. Here a screen shot of it:

It all went a bit "crazy" when I went to my barber to get my beard cut. Now my Turkish barber is a very good one which means that he is very busy (seems this is a pattern across all industries - let it be IT-specialists or barbers). So I had to wait for about 2 hours to get a shave. As I already expected some waiting time, I took my iPad Pro with me (for i.e. reading a book). Holding it in my hands, I was then thinking why not trying to drive this 'How to develop on this device?' idea forward. But which kind of application should I develop? So one of the thoughts was to be able to demonstrate Redis (https://redis.io/) itself on the iPad. Wouldn't it be cool to

Just connect a small device with a Pen to a big screen (i.e. projector via USB-C, screen sharing)
Explain some concepts by just drawing like on a whiteboard
Demonstrate some Redis basics by being connected to a Redis Cloud instance (You can get a 30MB database for free here: https://redislabs.com/redis-enterprise/essentials/)

?

Given the fact that I invested some time to write this article, my opinion is clearly: "Yes, it would be cool!". However, if you are still wondering "Why the hell should I want to develop on an iPad Pro?" then I guess the answer is: "Because you can!" ;-) So this is more a fun project, whereby it might be a good basic example for:

Understanding some Redis basics
Using a very popular Redis Python client (https://github.com/andymccurdy/redis-py)
Understanding some Jinja2 (http://jinja.pocoo.org/docs/2.10/) HTML templating basics
Learning how to use some Flask basics (http://flask.pocoo.org/) for building a simple web application

So have fun!

Ah before I forget it: If you are searching for a very basic Redis Web Client or if you are interested in to take a look at the source code, then the source code can be found here for now.

Building a Recommendation Engine with Redis

2019-02-08T15:22:00.002+01:00

When I was asked which topic I would like to present at this year's OOP conference, I was out of the box thinking about 'Something with Machine Learning' involved. It was years ago at the university when I had a secondary focus on 'Artificial Intelligence and Neural Networks' and I think that's fair to say that the topic was not as 'hot' as it is today. The algorithms were the same as today but the frameworks were not that commodity and calculations happened either on paper, with MatLab or with some very specialized software for neural network training. However, the actual discipline stayed fascinating and even if I would not call myself a Data Scientist (I sticked more with my primary focus which was Database Implementation Techniques - so I am more a database guy :-) ) I am really amazed of the adoption and number of arising frameworks in the field of Machine Learning and Artifical Intelligence.

Machine Learning or Artificial Intelligence is quite a wide field and so I concluded to go with something more specific which has touch points to Machine Learning and AI. The topic with which I finally went is:

Redis Modules for Recommender Systems

Most of you might know Redis already but it's maybe worth to mention what Redis actually is:

Redis is an open source (BSD licensed), in-memory data structure store, used as a database, cache and message broker. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, geospatial indexes with radius queries and streams. Redis has built-in replication, Lua scripting, LRU eviction, transactions and different levels of on-disk persistence.

Redis is very popular. Here a ranking of the most popular database systems (just to highlight how popular Redis is):

https://db-engines.com/en/ranking

Redis is modular, which means that it can be extended by modules (like plug-ins). A list of Redis modules can be found here.

https://redislabs.com/community/redis-modules-hub/

Agenda

As I am already seeing that this article is becoming a bit longer than a blog article should be, here an outlook regarding the topics. I hope this will motivate the one or the other to continue reading ... .

Data Model
Preparations
Content Based Filtering (using Sets)
Collaborative Filtering (using Sets)
Ratings based Collaborative Filtering (using Sorted Sets)
Social Collaborative Filtering (using RedisGraph)
Content Relevance via Full Text Search (using RediSearch)
Probabilistic Data Structures (using the built-in HyperLogLog structure + ReBloom)
Machine Learning for Classifications and Predictions (using Redis-ML and other AI modules)

Data Model

Let's discuss which problem needs to be solved. Therefore the following simplified data model might be interesting:

User: A real-life person which is interacting with a system by showing some interests in items. Users can be classified.
Item: The thing users can be interested in. Items can be classified.
Interest Classification: Classifications can happen based on item properties, the user attributes, the relationship of users to existing items or the relationship of users to other users and their items. A classification can be for instance expressed as a simple 'Class membership' or as a number which is telling how likely something belongs to which class.
Recommendation: The actual recommendation, so which items could be interesting for a user is derived from some classifications.

Our example code will basicall use the following terms:

Users are just named users
We are looking at specific items, which means Comic books
Several algorithms and approaches are using different kinds of classifications

Preparations

If you want to follow the code samples of this blog post then you can also find a Jupyter notbook here.

https://github.com/nosqlgeek/rl-recsys/blob/master/notebooks/Redis_for_Recommendations.ipynb

I basically prepared the following Redis instances (one per module):

Redis (r)
Redis + Machine Learning (r_m)
Redis (Bloom Filters) + HyperLogLog (r_b)
Redis + Graph (r_g)
Redis + Full Text Search (r_s)

Python Prep Script

Content Based Filtering

The idea is to look at what a specific user is interested in and then to recommend things those are similar (i.e. having the same class) as other things the user is liking.

Content based filtering can be done based on 'real set' operations. Redis is coming with a 'Set' data structure which is allowing to perform membership checks, scans, intersections, unions and so on.

Let's look at the following example:

Python Script

The output is:

David could be also interested in: { 'Fantastic Four', 'Wonder Woman', 'Batman', 'Dragon Age', 'Avatar', 'Valerian', 'Spiderman'}

Collarborative Filtering

The underlying idea is that if person A likes the same things as person B, then person B might also like the other items those are liked by person A. So it's mandatory to have details about many other users collected for a proper classification:

We are using again Redis' Set data structure and especially the union and diff operations.

Let's add some demo data:

Python Demo Data Script

Now let's look at the following example:

Python Script

We can now see which other users B could be interested in the same items as A and then derive a recommendation for the given user A based on the other interests of users B:

Users interested in the same items as David: {'david', 'pieter'}
David is interested in: {'Spiderman', 'Batman'}
David could be also interested in: {'Wonder Woman'}

Ratings based Collaborative Filtering

We are talking about collaborative filtering again. So the approach is to derive a recommendation from similarities to other users. In addition we are now interested in 'How much does a user like an item?'. This allows us i.e. to find out if two or more users are liking similar things. Items those are liked by user B but not yet liked by user A could be also interesting for user A.

The Redis structure which is used is a 'Sorted Set'. An element in a sorted has a score. We will use this score as our rating value (1-5, i.e. stars). A very cool feature of sorted sets is that set operations are allowing to aggregate scores. By default, the resulting score of an element is the sum of the scores across the considered sorted sets. You can combine aggregations with weights. Weights are multiplicators for scores. The weight (1,-1) means to subtract the second score value from the first score value.

We will first find users B those rated the same items as a specific user A. The idea is then to leverage this aggregation feature in order to calculate the distance vector of ratings between user A and the previously identified users B. We will then use RMS in order to calculate the average distance as a scalar. The R(oot) M(ean) S(quare) value of a set of values is the square root of the arithmetic mean of the squares of the values. Only users with an average rating distance less or equal to 1 (which means that the users rating was very similar) will be considered. Finally we will recommend items of users B to user A, whereby we are only considering items with a score of at least 4.

I added some helper functions to the following script. I am bascially not pasting them here again by hoping that their functionality is self-explaining. The full source code can be found here: https://github.com/nosqlgeek/rl-recsys/blob/master/notebooks/Redis_for_Recommendations.ipynb .

Python Demo Data Script

Let's first find users B those are liking the same things as A:

Python Script

The output is:

The following users rated David's items: ['pieter', 'david']

Now let's calculate the similarities by then proposing the highest rated items of users B:

Python Script

The result is that Pieter has a matching rating distance and so 'Aqua Man' is a highly recommended Comic to David (whatever tells this about Pieter ;-) ):

The rating distance to pieter is [('batman', 1.0), ('superman', -1.0)]
The average distance (RMS) to pieter is 1.0
The following is highly recommended: [('aqua_man', 5.0)]

Social Collaborative Filtering

The previous examples used Sets and Sorted Sets. We are now exploring how to use Graphs. Our example is taking a social ('friend of') aspect into account. A mathematical Graph is described as a set of vertices V and a set of Edges E, whereby E is a subset of VxV. Graph database systems are extending this mathematical definition to a so called 'Property Graph Model' which means that vertices and edges can have properties (KV pairs) associated.

Our idea is to find all comics of friends B of a given user A those are interested in a specific comic category (Super Heros). Comic books that are liked more often by the friends of user A are more relevant and should be recommended.

I am again skipping the helper functions in order to avoid to blow this article even more up. If you are interested, then the full source code can be found here: https://github.com/nosqlgeek/rl-recsys/blob/master/notebooks/Redis_for_Recommendations.ipynb. The function names are hopefully self-explaining.

Python Demo Data Script

Here the actual Graph Query:

Python Script

As 'Wonder Woman' is liked by 2 of David's friends it is more relevant than the other comics:

David has the following friends: [[['name'], ['Pieter'], ['Vassilis'], ['Katrin']]]
David likes [[['name'], ['Spiderman'], ['Batman']]]
Comic 'Wonder Woman' with relevance 2.000000
Comic 'Batman' with relevance 1.000000
Comic 'Superman' with relevance 1.000000

Content Relevance via Full Text Search

RediSearch is a search engine module for Redis. It comes with multiple built-in scoring functions. We will look at T(erm)F(requency)I(inverse)D(ocument)F(requency). It takes the following aspects into account:

Term Frequency: How often does a specific term appear?
Inverse Document Frequency: An inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely (i.e. the relevance of the word 'the'

Furhter details about scoring can be found here:

https://oss.redislabs.com/redisearch/Scoring/

We are trying to identify how likely something belongs to a specific class/category by performing a text search for terms those are associated to this category.

Helper functions are again skipped in this article but can be found in the source code repo.

Python Script

Spiderman is more likely a super hero than Batman:

[2, 'spiderman', 0.10000000000000001,['name', 'Spiderman'], 'batman', 0.035714285714285712,['name', 'Batman']]

Probabilistic Data Structures

Probabilistic data structures are characterized in the follwoing way: They ...

use hash functions for randamization purposes
return an approximated result
the error is under a specific threshold
are much more space efficient than deterministic approaches
provide a constant query time

You would use them because sometimes …

speed is more important than correctness
compactness is more important than correctness
you only need certain data guarantees

It's possible to combine them with deterministic approaches (i.e. HLL + det. counter for discovering counter manipulations).

We will take a look at the following two structures:

HyperLogLog: Cardinality estimation of a set, i.e. unique visits
Bloom Filter: Check if an item is contained in a set whereby false-positves are possible

Our example will not use 'unique visits' but we are more interested in how many unique users 'touched' a specific comic. Just imagine a real-life comic book store. A bunch of nerds (including myself ...) are hanging around and they are browsing for comics. Interesting comics will be removed from the shelf in order to take a closer look. This is what I mean with 'touched'. A comic which is more often touched can be considered as more interesting. We can count these unique touches quite space efficently by using a HyperLogLog:

Python Script

The output is:

HLL initial size: 31
Approx. count: 4
Please wait ...
Final HLL size: 10590 bytes
Approx. count: 99475

The bloom filter can be used to check if a user is interested a specific comic category without storing the users per category in a set:

Python Script

We are also printing the sizes out in order to demonstrate how space efficient Bloom filters are. The output of this script is:

BF size: 115 bytes
BF size: 76 bytes
Is Katrin interested in Fantasy?: 1
Is Katrin interested in Super Heros?: 0
Is David interested in Super Heros?: 1

Machine Learning for Classifications and Predictions

We are closing this blog article by circling back to the introduction of it. Classifications were so far often seen as a given (i.e. a comic book belonging to the 'Fantasy' category). Others could be derived by taking the existing user interests into account. We also mentioned in the section 'Data Model' that classifications might be derived from user attributes or item properties. Now, Machine Learning is providing us ways to describe a more complex models by taking such attributes (=features) into account. Such features can be represented by structured data (i.e. the comic name, ...) or unstructured data (i.e. the images within a comic book, the used colors within comic book). Feature vectors could be for instance derived from the bitmap of a scanned cover of a comic book. At the end, you can consider every ML approach as a way to approximate a function F(x) -> y, whereby x is a feature vector and y is the output vector. The idea is to create/train a model based on the known values y for given vectors x. These given vectors are called the training features. The idea is to derive a model which is able to approximate/predict a 'good' output vector y for an unknown input vector x.

Here 2 examples for such models:

Decision Tree ensembles (random forests). The idea is to conduct a forest of decision trees at training time. RedisML can be used for the Model Serving by leveraging these decision trees for i.e. classification purposes. The class which appears most often will be the winner.
Neural networks: Train the weighted connections between neurons by using a learning algorithm (i.e. Backpropagation).

The following example leverages 2 very small decision trees:

Users with an age <=20 are liking Manga comics
Users with more than 1000 are not liking Manga comics

Python Script

I am feeling that this article could tell much more about 'Neural Networks and Artificial Intelligence', but I am also hoping that it's understandable that this is a very wide field and so I am thinking that it is worth to write a dedicated article about Redis for Machine Learning and Artificial Intelligence at a later point in time.

Finally, here some attitional Redis modules those didn't make it into one of the earlier sections:

Neural Redis: Is a Redis module that implements feed forward neural networks as a native data type for Redis. The project goal is to provide Redis users with an extremely simple to use machine learning experience.
Countminsketch: An apporximate frequency counter
Topk: An almost deterministic top k elements counte
Redis-tdigest: T-digest data structure wich can be used for accurate online accumulation of rank-based statistics such as quantiles and cumulative distribution at a point.

Thanks for reading! Feedback regarding this article is very welcome!

Asynchronous Operation Execution with Netty on Redis

2018-06-19T18:21:00.005+02:00

Netty got my attention a while back and I just wanted to play a bit around with it. Given the fact that I am already fallen in love with Redis, what would be more fun than implementing a low level client for Redis based on Netty?

Let's begin to answer the question "What the hell is Netty?". Netty is an asynchronous (Java based) event-driven network application framework. It is helping you to develop high performance protocol servers and clients.

We are obviously more interested in the client part here, meaning that this article is focusing on how to interact with a Redis Server.

Netty is already coming with RESP support. The package 'io.netty.handler.codec.redis' contains several Redis message formats:

RedisMessage: A general Redis message
ArrayRedisMessages: An implementation of the RESP Array message
SimpleRedisStringMessage: An implementation of a RESP Simple String message
...

So all we need to do is to:

Boostrap a channel: A channel is a nexus to a network socket or component which is capable of I/O operations. Bootstrapping means to assign the relevant components to the channel (Event loop group, handlers, listeners, ...) and to establish the socket connection. An example class can be found here.
Define a channel pipeline: We are using an initialization handler in order to add several other handlers to the channel's pipeline. The pipeline is a list of channel handlers, whereby each handler handles or intercepts inbound events or outbound operations. Our channel pipeline is having the following handlers: RedisDecoder (Inbound handler that decodes into a RedisMessage), RedisBulkStringAggregator (Inbound handler that aggregates an BulkStringHeaderRedisMessage and its following BulkStringRedisContents into a single FullBulkStringRedisMessage), RedisArrayAggregator (Aggregates RedisMessage parts into an ArrayRedisMessage) and RedisEncoder (This outbound handler encodes RedisMessage into
bytes by following the RESP (REdis Serialization Protocol). Netty will first apply the outbound handlers to the passed in value. Then it will put the encoded message on the socket. When the response will be received then it will apply the inbound handlers. The last handler is then able to work with the decoded (pre-handled) message. An example for such a pipeline definition can be found here.
Add a custom handler: We are also adding a custom duplex handler to the pipeline. It is used in order to execute custom logic when a message is received (channelRead) or sent (write). We are not yet planning to execute business logic based on the RedisMessage but instead want to just fetch it, which means that our handler just allows to retrieve the result. My handler is providing an asynchronous method to do so. The method 'sendAsyncMessage' returns a Future. It's then possible to check if the Future is completed. When it is completed then you can get the RedisMessage from it. This handler is buffering the futures until they are completed. The source code of my example handler can be found here.

BTW: It's also possible to attach listeners to a channel. Whereby I found it initially to be a good idea to use listeners in order to react on new messages, I had to realize that channel listeners are invoked before the last handler (the last one is usually your custom one), which means that you face the issue that your received message did not go through the channel pipeline when the listener is invoked. So my conclusion is that channel listeners are more used for side tasks (inform someone that something was received, log a message out, ...) instead of the message processing itself, whereby handlers are designed in order to be used to process the received messages. So if you want to use listeners then a better way is to let the handler work with promises and then attach the listener to the promise of a result.

In addition the following classes were implemented for demoing purposes:

GetMsg and SetMsg: Are extending the class ArrayRedisMessage by defining how a GET and a SET message are looking like.
AsyncRedisMessageBuffer: A message buffer which uses a blocking queue in order to buffer outgoing and incoming messages. The Redis Client Handler (my custom handler) is doing the following: Sending a message causes that the Future is put into the buffer. When the response arrives then the Future is updated and removed from the buffer. Whoever called the 'sendAsyncMessage' method has hopefully still a reference to the just dequeued Future. I used 'LinkedBlockingDeque' which means that the implementation should be thread safe.

Here a code example how to use the handler in order to execute an asynchronous GET operation:

Hope you enjoyed reading this blog post! Feedback is welcome.

Data Encryption at Rest

2018-06-06T16:09:00.000+02:00

Data security and protection is currently a hot topic. It seems that we reached the point when the pendulum is swinging back again. After years of voluntary openness by sharing personal information freely with social networks, people are getting more and more concerned about how their personal data is used in order to profile or influence them. Social network vendors are getting currently bad press, but maybe we should ask ourself the fair question "Didn't we know all the time that their services are not for free and that we are paying them with our data?". Maybe not strictly related to prominent (so called) 'data scandals' but at least following the movement of the pendulum is the new European GDPR regulation around data protection. Even if I think that it tends to 'overshoot the mark' (as we would say in German) and leaves data controllers and processors sometimes in the dark (unexpected rhyme ...), it is a good reason for me address some security topics again from a technical point of view. So this article has the subject of 'Data Encryption at Rest' on Linux servers.

To be more accurate this article is mainly focusing on how to ensure that folders are encrypted under Linux. Linux provides the following ways to encrypt your data:

Partition level: This allows you to define an encrypted partition on a hard drive. I think that this is the most commonly seen way of encrypting data with Linux. Most Linux distributions are providing this option already during the installation
Folder level: Allows you to encrypt specific folders by i.e. mounting them under as specific path. The 'ecryptfs' solution can be used for such a purpose.
File level: It is also possible to encrypt single files. The PGP (Pretty Good Privacy) tools can be used for this purpose.

This article focuses on the 'Folder level' encryption. It has the advantage that you can define encrypted folders on-demand without the need to repartition your drives. It also doesn't just work on single files but allows you to mount your folder directly. Each file which is stored in the encrypted folder is encrypted separately. This is especially useful if you want to encrypt only specific data by providing only specific users unencrypted access. One use case would be to only allow your CIFS service (File Server service) unencrypted access to the folder. I can also easily see that database systems could leverage this feature, whereby I didn't test which performance implication might be seen when using folder level encryption with DBMS.

Step 0 - Install 'ecryptfs'

apt-get -y install ecryptfs-utils

Step 1 - Create a hidden directory: Let's assume that we have a folder /mnt/data which is the mount point of your main data partition (in my case an EXT4 partition on a RAD1 of 2 spinning HDD-s. We create a hidden folder named encrypted there:

mkdir /mnt/data/.encrypted

Step 2 - Create a second folder: This folder is used as our mount point. All access needs to happen via this second folder.

mkdir /mnt/encrypted

Step 2 - Mount the hidden folder as an encrypted one: Let's assume we want to access our encryption folder under /mnt/encrypted. This means that each write to the newly mounted folder is involving the encryption of the written data. Here a small script which does the job:

#!/bin/bash

mount -t ecryptfs\

-o rw,relatime,ecryptfs_fnek_sig=82028e5be8a0a05b,\

ecryptfs_sig=55028e0be5a0a08a,ecryptfs_cipher=aes,\

ecryptfs_key_bytes=16,ecryptfs_unlink_sigs\ /mnt/data/.encrypted /mnt/encrypted

The mount command will ask your for the passphrase. The passphrase will be used for every remount.

WARNING: If you loose your passphrase, then you will no longer be able to read your previously encrypted data.

This is what's stored in your mounted folder:

root@ubuntu-server:/mnt/encrypted# ls

hello2.txt  hello.txt

root@ubuntu-server:/mnt/encrypted# cat hello.txt 

Hello world!

Whereby the original folder contains the encrypted data:

root@ubuntu-server:/mnt/data/.encrypted# ls

ECRYPTFS_FNEK_ENCRYPTED.FWYW-ctPtO0USURgl98vtKSoykT9hmQROUa3cBMaMT0UyWKbxkF7KQOiU---  ECRYPTFS_FNEK_ENCRYPTED.FWYW-ctPtO0USURgl98vtKSoykT9hmQROUa3TeggyUTAxFqhqUkBB.a-Bk--

root@ubuntu-server:/mnt/data/.encrypted# cat ECRYPTFS_FNEK_ENCRYPTED.FWYW-ctPtO0USURgl98vtKSoykT9hmQROUa3cBMaMT0UyWKbxkF7KQOiU---

??tY?ì

?"3DUfw`n6?

           ?3ﯙY7?_?_CONSOLE"?[堠zx?ŷZ?G??铅?ǈ*?9?.fEN??`????R?:??83?F???{???

                                                                            ??_Z&tx?,?2!?w

Access to the decrypted data is possible under the following circumstances:

The logged-in user has permission to read or write the folder /mnt/encrypted
The folder /mnt/data/.encrypted was mounted to /mnt/encrypted by providing the passphrase

It's especially no longer possible to read the unencrypted data after removing the hard disk physically from a machine. As said, it's necessary to know the passphrase in order to decrypt the data of this folder again.

Hope this article is helpful :-) .

To PubSub or not to PubSub, that is the question

2018-01-22T21:13:00.002+01:00

Introduction

The PubSub pattern is quite simple:

Publishers can publish messages to channels
Subscribers of these channels are able to receive the messages from them

There is no knowledge of the publisher about the functionality of any of the subscribers. So they are acting independently. The only thing which glues them together is a message within a channel.

Here a very brief example with Redis:

Open a session via 'redis-cli' and enter the following command in order to subscribe to a channel with the name 'public'

In another 'redis-cli' session enter the following command in order to publish the message 'Hello world' to the 'public' channel:

The result in the first session is:

BTW: It's also possible to subscribe to a bunch of channels by using patterns, e.g. `PSUBSCRIBE pub*`

Fire and Forget

If we would start additional subscribers after our experiment then they won't receive the previous messages. So we can see that we can only receive messages when we are actively subscribed. Meaning that we can't retrieve missed messages afterwards. In other words:

Only currently listening subscribers are retrieving messages
A message is retrieved by all active subscribers of a channel
If a subscriber dies and comes back later then it might have missed messages

PubSub is completely independent from the key space. So whatever is published to a channel will not directly affect the data in your database. Published messages are not persisted and there are no delivery guarantees. However, you can indeed use it in order to notify subscribers that something affected your key space (e.g. The value of item 'hello:world' has changed, you might fetch the change!). So what's the purpose of PubSub then? It's about message delivery and notifications. Each of the subscribers can decide by himself how to handle the received message. Because all subscribers of a channel receive the same message, it's obviously not about scaling the workload itself. This is an important difference in comparision to message queue use cases.

Message Queues

Message queues on the other's hand side are intended to scale the workload. A list of messages is processed by a pool of workers. As the pool of workers is usually limited in size, it's important that messages are buffered until a worker is free in order to process it. Redis (Enterprise) features like

Persistency
High Availability

are quite more important for such a queuing scenario. So such a queue should surrive a node failure or a node restart. Redis fortunately comes with the LIST data structure for simple queues or a Sorted Set structure for priortiy queues.

It's important to state that there are already plenty of libraries and solutions out there for this purpose. Here two examples:

A very simple queue implementation would use a list. Because entries of the list are strings, it would be good to encode messages into e.g. JSON if they have a more complex structure.

Create a queue and inform the scheduler that a new queue is alive:

Add 2 messages to the queue:

Schedule the workers: We could indeed use a more complex scheduling apporach. However, the simplest and stupidest would be to just assign the next worker of the pool to the next message. So in order to dequeue a message we can just use `LPOP`:

BTW: If our queue would be initially empty then there is a way to wait for a while until something arrives by using the `BLPOP` command.

Using PubSub is actually optional for our message queue example. It's easy to see that the scheduler could also assign workers without getting notified because it can at any time access the queues and messages. However, I found it a bit more dynamic to combine our queue example with PubSub:

The scheduler gets notified when new work needs to be assigned to the workers
As these notifications are fire and forget, it would be also possible for the scheduler to check from time to time if there is something to do
If the scheduler dies then another instance can be started which can access the database in order to double check which work was already done by the workers and which work still needs to be done. An interuppted job can be restarted based on such state information.

Summary

Redis' PubSub is 'Fire and forget'. It's intended to be used to deliver messages from many (publishers) to many (subscribers). It's indeed a useful feature for notification purposes. However, it's important to understand the differences between a messaging and a message processing use case.

The way how we used it was to inform a single scheduler that some work needs to be done. The scheduler would then hand over to a pool of worker threads in order to process the actual queue. The entire state of the queue was stored in our database as list because PubSub alone is not intended to be used for message queuing use cases. In fact the usage of PubSub for our queuing example was optional.

Indexing with Redis

2017-07-27T11:29:00.004+02:00

If you follow my news on Twitter then you might have realized that I just started to work more with Redis. Redis (=Remote Dictionary Server) is known as a Data Structure Store. This means that we can not just deal with Key-Value Pairs (called Strings in Redis) but in addition with data structures as Hashes (Hash-Maps), Lists, Sets or Sorted Sets. Further details about data structures can be found here:

https://redis.io/topics/data-types-intro

Indexing in Key-Value Stores

With a pure Key-Value Store, you would typically maintain your index structures manually by applying some KV-Store patterns. Here some examples:

Direct access via the primary key: The key itself is semantically meaningful and so you can access a value directly by knowing how the key is structured (by using key patterns). An example would be to access an user profile by knowing the user's id. The key looks like 'user::<uid>'.
Exact match by a secondary key: The KV-Store itself can be seen as a huge Hash-Map, which means that you can use lookup items in order to reference other ones. This gives you a kind of Hash Index. An example would be to find a user by his email address. The lookup item has the key 'email::<email_addr>', whereby the value is the key of the user. In order to fetch the user with a specific email address you just need to do a Get operation on the key with the email prefix and then another one on the key with the user prefix.
Range by a secondary key: This is where it is getting a bit more complicated with pure KV-Stores. Most of them allow you to retrieve a list of all keys, but doing a full 'key space scan' is not efficient (complexity of O(n), n=number of keys). You can indeed build your own tree structure by storing lists as values and by referencing between them, but maintaining these search trees on the application side is really not what you usually want to do.

The Redis Way

So how is Redis addressing these examples? We are leveraging the power of data structures as Hashes and Sorted Sets.

Direct Access via the Primary Key

A Get operation already has a complexity of O(1). This is the same for Redis.

Exact Match by a Secondary Key

Hashes (as the name already indicates) can be directly used to build a hash index in order to support exact match 'queries'. The complexity of accessing an entry in a Redis Hash is indeed O(1). Here an example:

In addition Redis Hashes are supporting operations as HSCAN. This provides you a cursor based approach to scan hashes. Further information can be found here:

https://redis.io/commands/scan

Here an example:

Range By a Secondary Key

Sorted Sets can be used to support range 'queries'. The way how this works is that we use the value for which we are searching as the score (order number). To scan such a Sorted Set has then a complexity of O(log(n)+m) whereby n is the number of elements in the set and m is the result set size.

Here an example:

If you add 2 elements with the same score then they are sorted lexicographically. This is interesting for non-numeric values. The command ZRANGEBYLEX allows you to perform range 'queries' by taking the lexicographic order into account.

Modules

Redis supports now Modules (since v4.0). Modules are allowing you to extend Redis' functionality. One module which perfectly matches the topic of this blog post is RediSearch. RediSearch is basically providing Full Text Indexing and Searching capabilities to Redis. It uses an Inverted Index behind the scenes. Further details about RediSearch can be found here:

http://redisearch.io

Here a very basic example from the RediSearch documentation:

As usual, I hope that you found this article useful and informative. Feedback is very welcome!

Kafka Connect with Couchbase

2017-04-06T16:44:00.000+02:00

About Kafka

Apache Kafka is a distributed persistent message queuing system. It is used in order to realize publish-subscribe use cases, process streams of data in real-time and store a stream of data safely in a distributed replicated cluster. That said Apache Kafka is not a database system but can stream data from a database system in near-real-time. The data is represented as a message stream with Kafka. Producers put messages in a so called message topic and Consumers take messages out of it for further processing. There is a variety of connectors available. A short introduction to Kafka can be found here: https://www.youtube.com/watch?v=fFPVwYKUTHs . This video explains the basic concepts and how Producers and Consumers are looking like. However, Couchbase supports 'Kafka Connect' since version 3.1 of it's connector. The Kafka documentation says "Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems. It makes it simple to quickly define connectors that move large collections of data into and out of Kafka.". Kafka provides a common framework for Kafka connectors. It can run in a distributed or standalone mode and it distributed and scalable by default.

Setup

Kafka uses Apache Zookeeper. Zookeeper is a cluster management service. The documentation states that "ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications ... ZooKeeper aims at distilling the essence of these different services into a very simple interface to a centralized coordination service."

After downloading and extracting the standard distribution of Apache Kafka, you can start a local Zookeeper instance by using the default configuration the following way:

The next step is to configure 3 Kafka message broker nodes. We will run these services just on the same host for demoing purposes but it's obvious that they can also run more distributed. In order to so we need to create configurations for the broker servers. So copy the config/server.properties file to server-1.properties and server-2.properties and then edit it. The file 'sever.properties' has the following settings:

Let's assume that $i is the id of the broker. So the first broker has id '0', listens on port 9092 and logs to 'kafka-logs-0'. The second broker has the id '1', listens on port 9093 and logs to 'kafka-logs-1'. The third broker configuration is self-explaining.

The next step is to download and install and Couchbase Plug-in. Just copy the related libraries to the libs sub-folder and the configuration files to the config sub-folder of your Kafka installation.

Streaming data from Couchbase

Before we can stream data from Couchbase we need to create a topic to which we want to stream to. So let's create a topic which is named 'test-cb'.

You can then describe this topic by using the following command:

The topic which we created has 3 partitions. Each node is the leader for 1 partition. The leader is the node responsible for all reads and writes for the given partition. The Replicas is the list of nodes that replicate the log for this partition.

Now let's create a configuration file for distributed workers under 'config/couchbase-distributed.properties':

The Connect settings are more or less the default ones. Now we also have to provide the connector settings. If using the distributed mode then the settings have to be provided by registering the connector via the Connect REST service:

The configuration file 'couchbase-distributed.json' has a name attribute and an embedded object with the configuration settings:

The Couchbase settings refer to a Couchbase bucket and the topic name to which we want to stream DCP messages out of Couchbase. In order to run the Connect workers in distributed mode, we can now execute:

The log file contains information about the tasks. We configured 2 tasks to run. The output contains the information which task is responsible for which Couchbase shards (vBuckets):

For now let's just consume the 'test-cb' messages by using a console logging consumer:

One entry looks as the following one:

We just used the standard value converter. The value is in reality a JSON document but represented as Base64 encoded string in this case.

Another article will explain how to use Couchbase via Kafka Connect as the sink for messages.

Visualizing time series data from Couchbase with Grafana

2016-09-12T20:29:00.002+02:00

Grafana is a quite popular tool for querying and visualizing time series data and metrics. If you follow my blog then you might have seen my earlier post about how to use Couchbase Server for managing time series data:

http://nosqlgeek.blogspot.de/2016/08/time-series-data-management-with.html

This blog is now about extending this idea by providing a Grafana Couchbase plug-in for visualizing purposes.

After you installed Grafana (I installed it on Ubuntu, but there are installation guides available here for several platforms), you are asked to configure a data source. Before we will use Grafana's 'SimpleJson' data source, it's relevant how the backend of such a data source looks like.

'/': Returns any successful response in order to test if the data source is available
'/search': Returns the available metrics. We will just return 'dax' in our example.
'/annotations': Returns an array of annotations. Such an annotation has a title, a time where it would occur, a text and a tag. We just return an empty array in our example. But you can easily see that it would be possible to create an annotation if a specific value is exceeded or a specific time is reached.
'/query': The request is containing a time range and a target metric. The result is an array which has an entry for every target metric and each of these entries has an array of data points. Each data point is a tuple of the metric value and the time stamp.

We will just extend our example from before with an Grafana endpoint and then point Grafana's generic JSON data source plug-in to it, but I can already see a project on the horizon which standardizes the time series management in Couchbase via a standard REST service which can then be used by a dedicated Grafana Couchbase plug-in.

First let's look at our backend implementation:

As usual, the full code can be found here: https://github.com/dmaier-couchbase/cb-ts-demo/blob/master/routes/grafana.js

Here how we implemented the backend:

'/': As you can see we just return a 'success:true' if the backend is accessible.
'/search': The only metric which our backend provides is the 'dax' one.
'/annotations': Only an example annotation is returned in this case.
'/query': We just check if the requested metric is the 'dax' one. In this first example, we don't take the aggregation documents into account. Instead we just request the relevant data points by using a multi-get based on the time range. Because Grafana expects the datapoints in time order, we have to finally sort them by time. Again, this code will be extended in order to take the several aggregation levels into account (Year->Month->Day->Hour).

Now back to Grafana! Let's assume that you successfully installed the 'SimpleJson' data source:

Then the only thing you need to do is to add a new data source to Grafana by pointing to our backend service (To run the backend service, just execute 'node app.js' after you checked out the full repository and installed all necessary dependencies.):

In this example I actually, just loaded a bit of random data for testing purposes by using the demo_data.js script.

Then all you have to do is to create a Dashboard an place a panel on it:

The rest should work more or less the same as with any other Grafana data source. :-)

Time series data management with Couchbase Server

2016-08-26T09:27:00.001+02:00

Couchbase Server is a Key Value store and Document database. The combination of being able to store time series entries as KV pairs with the possibilities to aggregate data automatically in the background via Map-Reduce and the possibility to dynamically query the data via the query language N1QL makes Couchbase Server a perfect fit for time series management use cases.

The high transaction volume seen in time series use cases is meaning that relational database systems are often not a good fit. A single Couchbase Cluster on the other hand side might support hundreds of thousands (up to millions) of operations per second (indeed dependent on the node and cluster size).

Time series use cases seen with Couchbase are for instance:

Activity tracking: Track the activity of a user whereby each data point is a vector of activity measurement values (e.g location, ...)
Internet of things: Frequently gathering data points of internet connected devices (as cars, alarm systems, home automation devices, ...), storing them as a time series and aggregate them in order monitor and analyse the device behavior
Financial: Store currency or stock courses as time series in order to analyse (e.g. predictive analysis) based on this data. A course diagram is typically showing a time series.
Industrial Manufacturing: Getting measurement values from machine sensors in order to analyse the quality of parts.

But before we start digging deeper into an example, let's talk a bit about the background of time series data management:

A time series is a series of data points in time order. So mathematically spoken a time series is expressed as a diskrete function with (simplified) two dimensions. The first dimension (x-axis) is the time. The second dimension (y-axis) is the data point value, whereby a data point value can be again a vector (which makes it actually 1+n dimensional, whereby n is the vector size). Most commonly the values on the time-axis are on an equidistant grid, which means that the distance between any x values x_1 and x_2 is equal.

So what to do with such a time series?

Analyse the past: Statistics, reporting, ...
Real-time analysis: Monitor current activities, find anomalies, ...
Predictive analysis: Forecast, estimate, extrapolate, classify, ...

Good, time to look at an example. First we need a data source which is frequently providing changing data. Such data could be financial courses, a sensor measurement, a human heart beat and so on.

Let's take a financial course. Google is providing such information via 'Google Finance'. So in order to get the current course of the DAX (this might tell you where I am living ;-) ), you just have to open up https://www.google.com/finance?q=INDEXDB%3A.DAX. In order to get the same information as JSON you can just use https://www.google.com/finance/info?q=INDEXDB%3ADAX .

What we get by accessing this API is:

So far so good. Now let's write a litte Node.js application (by using http://www.ceanjs.org) which is polling every minute for the current course and then writes it into Couchbase. To be more accurate: we actually fetch every 30 seconds in order to reach the granularity of a minute. In this example we decided for the minute granularity but it would work in a similar way for an e.g. seconds granularity. We also just expect that the last fetched value for a minute is the minute value. An even more sophisticated approach would be to store the max. 2 gathered values in an array in our minutes document and already aggregate on those two (avg as the minute value instead the last one). It's a question of accuracy. The key of such a data point is indeed dependent on the time stamp. We are just interested in the course value 'l', the difference 'c' and the time stamp 'lt_dts'. The job logic then looks as the following one:

BTW: The full source code can be found here: https://github.com/dmaier-couchbase/cb-ts-demo/blob/master/course_retrieval_job.js

This looks then as the following in Couchbase.

Fine, so what's next? Let's start with direct access to time series values. In order to fetch all values for a given range, you don't need any index structure because:

The discrete time value is part of the key. So our time-axis is directly expressed via the key space.

It's also easy to see that JSON document value is more or less a vector (as defined above)

So let's write a little service which takes a start time stamp and an end time stamp as a parameter in order to provide you all the requested values.

The service code could look like this:

The full code can be found here: https://github.com/dmaier-couchbase/cb-ts-demo/blob/master/routes/by_time.js

It just takes the start and end time in the format following format:

http://localhost:9000/service/by_time?start=2016-08-25T13:15&end=2016-08-25T13:20

The output looks as the following one:

Let's next calculate some statistics based on these values. Therefore we will create some aggregate documents. As you might already imagine, we will aggregate based on the time. The resulting time dimension for these aggregates will be 'Year -> Month -> Day -> Hour'. So their will be:

An hour aggregate: It aggregates based on the minutes time series. There are 60 minutes per hour to aggregate.
An day aggregate: It aggregates based on the hour aggregates. There are 24 hours per day.
A month aggregate: It aggregates based on the day aggregates. There are between 28 and 31 days per month.
A year aggregate: It aggregates based on the month aggregates. There 12 months per year.

I guess you got it :-) ...

So how to build these aggregates? There are multiple ways to do it. Here just some of them:

Use the built-in views and write the view results for a specific time range back to Couchbase
Execute a N1QL query by using aggregate functions
Do the calculations on the client side by fetching the data and write the results back
Load or stream the data into Spark in order to do the necessary calculations there and write the results back to Couchbase

Let's have a look at Views first. Views provide built-in map-reduce. We want to calculate the following statistic values:

The average value of the course
The maximum value of the course
The minimum value of the course

We will just create one View for this. The following map and reduce functions are created on the Couchbase Server side:

The request parameters for aggregating directly for one hour are looking like:

http://localhost:8092/ts/_design/time/_view/stats?stale=false&startkey=[2016,8,25,14,0,0]&endkey=[2016,8,25,15,0,0]&inclusive_end=false&reduce=true&group_level=4

The result is the hour aggregation based on the time series data points:

It's easy to see that it also allows us to directly access the time function which has the hour (and no longer the minute) as the distance on time axis. The data points are then the aggregation values. The same View can be used to get the monthly and the yearly aggregation values. The trick is to set the range parameters and the group level in the right way. In the example above 'group_level=4' was used because the hour information is at the fourth position of the date array which was emitted as the search key. In order to get the daily aggregation, just use a query like this:

http://localhost:8092/ts/_design/time/_view/stats?stale=false&startkey=[2016,8,25,0,0,0]&endkey=[2016,8,26,0,0,0]&inclusive_end=false&reduce=true&group_level=3

Now let's create an aggregation service which is using this View result in order to return the aggregation for a specific hour. It queries the aggregate for a given hour and stores the aggregation result as an aggregate document if the hour is already a full hour (so if it has 60 data points). In reality you could also run a job in order to make sure that the aggregates are built upfront. In this demo application we just build them at access time. The next time they will be not accessed from the View, but directly from the KV store.

Following the code of the service:

The full code can be found here: https://github.com/dmaier-couchbase/cb-ts-demo/blob/master/routes/agg_by_hour.js

The result in Couchbase would be then:

Their might be the question in your head 'What's if I want to aggregate by taking a specific aggregation level into account, but also need to have the last minutes (highest granularity in our example) into account?'. The answer is to combine the approaches of accessing the minute data points directly and the lower granularity aggregates. Here an example: If you want to access everything since 14:00 until 15:02, whereby 15:00 is not yet a full hour, then you can do this by using the following formula.

Agg(14:00) + Agg(t_15:00, t_15:01, t_15:02)

It's easy to see that you can derive additional formulas for other scenarios.

A related question is how long you should keep the highest granularity values. One year has 525600 minutes. And so we would get every year 525600 minute documents. So for this use case we could decide to remove the minute documents (Couchbase even comes with a TTL feature in order to let them expire automatically) because it's unlikely the case that someone is interested in more than the daily course after one year. How long you keep the finest granularity data points indeed depends on your requirement and how fine your finest granularity actually is.

OK, so this blog article is already getting quite long. Another one will follow which then will cover the following topics:

Visualizing time series data
How query time series data with N1QL

Predictive analysis of time series data with Couchbase and Apache Spark

Caching in JavaEE with Couchbase

2016-07-01T13:47:00.003+02:00

One of Couchbase Server's typical use cases is caching. As you might know it is a KV store. The value of a KV pair can be JSON document. Not only the fact that Couchbase Server can store JSON documents makes it a document database, more the fact that you can index and query on JSON data defines it's characteristic as a JSON document database. Back to the KV store: If you you configure the built-in managed cache in a way that all your data is fitting into memory then Couchbase Server is used as a highly available distributed cache.

If you are a Java developer, then one of your questions might be if it makes sense to use Couchbase as a cache for your applications. I had several projects, where EhCache was replaced by Couchbase because of the Garbage Collection implications. The performance was often quite better with a centralized, low-latency (sub-milliseconds) cache than with one which was colocated with the application instances. This indeed depends on several factors (size of the cache entries, number of cache entries, access throughput). The next questions might be how to best integrate such a cache into your application. A typical pattern is:

Try to read the data from the cache
If it is there, then use it
If is not there then get the data from the source system (e.g. relational DBMS)
Put it into the cache
The next time when you try to access the same data, then it will be most probably in the cache

Couchbase's Java SDK is quite simple for CRUD operations:

C: Insert
R: Get
U: Update, Replace
D: Remove

So as soon as you established a Bucket (a data container) connection, you can use it as a cache. However, this is involving implementation work on your side.

I just looked at the Java standard JCache and also used the chance to play a bit around with CDI (Dependency Injection). JCache is implemented by several providers and look at that there is already a Developer Preview of a Couchbase implementation available (http://blog.couchbase.com/jcache-dp2).

Side note: The Couchbase JCache implementation is not yet officially released. Couchbase has also a good Spring Data integration which also comes with cashing support.

So let's get started. First we need to have a cache instance which we can use for caching purposes.

As you can see, we are creating a CacheProvider, then a Config and finally a CacheManager in order to access the Cache. Our cache is an object cache, whereby objects are stored by a string key. The Factory ensures that we have only a single instance of our ObjectCache. It's not using CDI and so can be also used with JavaSE. In a real world you would probably not use constants for the factory configuration, but it seemed to be sufficient for this example.

Now let's use the factory. Actually we misuse it a bit here because we use it in a Producer. In a pure CDI world, you would just use the code for initializing the cache in the producer method. So the producer is actually your factory, whereby the producer method acts as a source of objects to be injected. The annotation 'CBObjectCache' was bound to the producer.

Now that we have a producer, we can just inject CBObjectCache somewhere else. Let's do this in an Interceptor. We will use this interceptor later in order to cache objects automatically when a method is called. The annotation 'Cached' is bound to our interceptor.

Now in order to use our interceptor, we just have to annotate a method which should cache the passed data. The example below shows that the 'createHelloMessage' is annotated with 'Cached'. So before the actual method code is executed, the value of the variable 'name' will be cached in Couchbase. In order proof this, the value is fetched in the method body again to be printed out by the 'HelloWorldServlet'.

Before I forget it, here how it is looking like in Couchbase:

Hope this small introduction to JCache and CDI was interesting for you. :-)

The full source code can be found here: https://github.com/dmaier-couchbase/cb-jboss/tree/master/hello-jcache-cdi .

How to build Couchbase Server

2016-06-01T11:07:00.000+02:00

Couchbase Server is Open Source under Apache2 license and even if an user would normally not build it from the source code (in fact the custom built versions are not officially supported by Couchbase), you might want to participate in the Couchbase Community by providing some lines of code. The first thing you need is to be able to build Couchbase Server from the source code.

The Couchbase Server source code is not just in one repository. Instead it is spread over multiple Git repositories. A tool which can be used in order to abstract the access to these multiple Git repositories is 'repo'. So 'repo' is a repository management tool on top of Git. It's also by Google for Android and so a short documentation can be found here: https://source.android.com/source/using-repo.html . The installation instructions are available at http://source.android.com/source/downloading.html#installing-repo .

Here some 'repo' commands:

repo init: Installs the repository to the current directory
repo sync: Downloads the new changes and updates the working files in the local directory
repo start: Begins a new branch for development, starting from the revision specified in the manifest

Repo is using manifest files. The Couchbase manifest files can be found here: https://github.com/couchbase/manifest . Let's take a look into one of these files (e.g. /released/4.5.0-beta.xml):

<remote name="couchbase" fetch="git://github.com/couchbase/" review="review.couchbase.org" />
...
<default remote="couchbase" revision="master" />
<project name="bleve" remote="blevesearch" revision="760057afb67ba9d8d7ad52f49a87f2bf9d31a945" path="godeps/src/github.com/blevesearch/bleve"/>
...

As you can see, the manifest includes the Git repos those are containing Couchbase dependencies. By default the master branch was referenced here. Each dependency can be provided with a specific Git Hash or branch name in order to make sure that you build based on the right version of the dependent library.

Before we build it's required to have at least make and cmake installed on your build box. If not the build will fail by telling you what's required. I already had a C development environment, python and Go installed on my computer. The build of Couchbase is actually quite simple:

cd --
mkdir -p src/couchbase
cd src/couchbase
repo init -u git://github.com/couchbase/manifest.git -m 
repo sync
make

The built version of Couchbase is then available in the sub-folder 'install'.

Couchbase Server 4.5's new Sub-Document API

2016-05-13T13:12:00.001+02:00

Introduction

The Beta version of Couchbase Server 4.5 has just been released, so let's try it out! A complete overview of all the great new features can be found here: http://developer.couchbase.com/documentation/server/4.5/introduction/intro.html. This article will highlight the new Sub-Document API feature.

What's a sub-document? The following document contains a sub-document which is accessible via the field 'tags':

So far

With earlier Couchbase versions (<4.5) the update of a document had to follow the following pattern:

Get the whole document which needs to be updated
Update the documents on the client side (e.g. by only updating a few properties)
Write the whole document back

A simple Java code example would be:

Now with 4.5

The new sub-document API is a server side feature which allows you to (surprise, surprise ...) only get or modify a sub-document of an existing document in Couchbase. The advantages are:

Better usability on the client side

CRUD operations can be performed based on paths
In cases where the modification doesn't rely on the previous value, you can update a document without the need to fetch it upfront
You can easier maintain key references between documents

Improved performance

It saves network bandwidth and has a improved latency because you don't need to transfer the whole document over the wire

The sub-document API also allows you to get or modify inner values or arrays of a (sub-)document.

Lookup operations: Queries the document for a specific path, e.g. GET, EXISTS
Mutation operations: Modify one or multiple paths in a document, e.g. UPSERT, ARRAY_APPEND, COUNTER

A more detailed description of the API can be found in the Couchbase documentation: http://developer.couchbase.com/documentation/server/4.5-dp/sub-doc-api.html .

The update of a document can now follow the following pattern:

Update directly a property or subdocument by specifying the path under which it can be found

Our Java example would now be simplified to:

Optimistic "locking"

Couchbase Server does not have a built-in transaction manager, but if you talk about transactional behavior, the requirements are quite often less than what a ACID transaction manager would provide (e.g. handling just concurrent access instead of being fully ACID compliant). In Couchbase a document has a so called C(ompare) A(nd) S(wap) value. This value changes as soon as a document is modified on the server side.

Get a document with a specific CAS value
Change the properties on the client side
Try to replace the document by passing the old CAS value. If the CAS value changed in between on the server side then you know that someone else modified the document in between and so you can retry to apply your changes.

So CAS is used for an optimistic locking approach. It's optimistic because you expect that you can apply your changes and you handle the case that this wasn't possible because someone else changed it before. A pessimistic approach would be to lock the document upfront and so no one else can write it until this lock will be released again.

You could now ask the following question:

What happens if I modify a sub-document and someone else updates the same or another sub-document of the same document?

Sub-document operations are atomic. Atomicity means all or nothing. So if you update a sub-document by not retrieving an error message then you can be sure that the update was performed on the server side. This means if 5 clients are appending an element to an embedded array, then you can be sure that all 5 values were appended. However, atomicity isn't meaning consistency regarding the state. So it isn't telling you about conflicts. So if 2 clients are updating the same sub-document then both updates will be performed but in order to find out if their was a conflict regarding these updates you would still need the CAS value (or use pessimistic locking instead). However, if you are sure that the clients act on different sub-documents then you know that there will be no conflict and then the CAS value would be not required.

Summary

The new Sub-Document API is one of the new great features of Couchbase 4.5 (Beta). It allows you to avoid to fetch the whole document in order to read/modify only a part of it. This means a better usability from a client side point of view. One of the main advantages is that it improves the performance, especially if working with bigger documents.

Microservices and Polyglot Persistence

2016-05-06T13:35:00.002+02:00

Introduction

The idea behind Microservices is already described by it's name. In summary it means to use multiple smaller self-contained services to build up a system, instead of using one monolithic one. This explanation does sound simple, doesn't it? We will see that it is not because breaking up one single big system in several services has quite a lot implications.

Why Microservices?

A monolithic system would be a system which has only one main component. One of the disadvantages is usually that you have to deploy changes in a way that they affect the deployment of the whole system. A today's system is actually not completely monolithic at all, because it normally already consists of several sub-components. Often other decomposition mechanisms are already used. One way would be to build your system modular. Such a module might be actually a good candidate for a microservice, whereby it should optimally have business domain specific functionality and not a pure technical one.

Another aspect, you should be already familiar with as an object oriented developer, is de-coupling (loose coupling). Actually one component should live in a way for it's own. Sure there are well defined dependencies to other components. De-coupling allows you to ensure that you can replace one component of your system without the need to rewrite the a majority system again.

If splitting up a monolithic application into several parts, whereby specific functionalities are provided as services, you end up with a distributed system because each service is deployed by it's own. The idea is exactly to be able to scale these services independently out.

So Microservices are in a way not a complete new invention. Microservices are often just a consequence of what we already know or target regarding software architectures. Service oriented designs are also not completely new for us.

Polyglot character

One system made of multiple smaller services:

Can have a variety of communication protocols: About 10 years ago, I remember to have discussions about SOAP vs. REST. I actually liked SOAP because it was well defined and so your service client could be created just by the service definition. It has message based communication and there was a kind of standard message format (dependent on the binding). REST on the other hand's side had the charm to be less chatty and resource based. The protocol how 2 parties are communicating (which resources are accessed in exactly which way) was not out of the box predefined. Indeed, you also define what a REST service exposes. But it seemed to happen more often that the service did no longer talk exactly the same language as the client. Actually, it was more like it was speaking partially a weird dialect which could no longer be understood by the client and so the client had to learn it as well. There are libs and frameworks those are helping you (e.g annotations in JAX-RS). However, I'm pretty sure that a today's green field solutions would rely on RESTFul services. Sometimes you don't come from a green field and so you might still need to integrate a variety of different kinds of services.
Can be implemented by using several programming languages and frameworks: It's just relevant for another component of your system how to communicate with a specific service. The actual implementation is completely hidden from the other components of your system. So one service might be implemented in Node.js but another one might be implemented in Python. There are sometimes good reasons to develop one part of a system in e.g. C but others with maybe less effort e.g. in Node.js. Not every component might have the same resource and efficiency requirements (e.g. Garbage collection vs. manual disposal)
Can be developed by different kinds of developers: This is indeed related to the different programming languages and frameworks point. From my personal experience I would say that a C and a Java developer are really speaking different languages. Not just regarding the programming language, also how a specific problem would be addressed or how the tool chains are looking like. There is no good or bad, it's just different. So given that different functionalities might be developed by multiple different and independent teams, this point especially makes sense if these teams already got skill sets around specific programming languages and frameworks.
Polyglot persistence: A modern application consists usually of 3 tiers: interface, service and persistence. Given that we split the service tier up into multiple smaller services, there is the fair question what happens with your database/storage tier. We will discuss this a bit later in depth. Important is that the several services can use different database/storage back-ends. One service might need to write content items and stores the content itself in a Blob Store and the meta data in a Document Database. Another service might need to store the information who knows whom and so uses a Graph Database. A third service might handle user sessions and so uses a KV-store. This is what what polyglot persistence means.

What's happening with my Database?

This is actually quite interesting. Even if your system used before several modules, you will quite often see that the modules integrate with each other on the database level. The reasons is that the rules for your schema consolidation (regarding the good database schema design) might conflict with the de-coupling requirements. At the end each independent service should use it's own database. Instead of integrating service functionality on the database tier, the services should talk with each other. Let's use a very simple and stupid example. We talk about orders and customers. Let's assume that your shopping service is independent from the user profile service. Sure, shopping needs to know who the customer is, but not at the same level of detail as the user profile service knows. In a monolithic application you would have a 1-many relationship between customers and orders. So getting all the orders of a customer would be JOIN query. In a more service oriented world, you would ask for the directory service for the customer (by e.g. his email address) and then you would ask the shopping service for all the orders of this customer. If e.g. a new order should be processed then the shopping service would also need to talk with the payment service in order to fulfill the payment. The payment service also only knows the relevant information about the customer and not the complete user profile. Again, a very simple example, but the point is clear. A Microservices approach leads to distributed system, made of several services, which leads to split up databases as well.

Now, the relational database was so far gluing your data focused operations together by talking about transactions. Doesn't the service based approach mean that I loose these transactions on the database tier? Exactly! Given that you decoupled your services by no longer integrating on the database tier, you know have also to take care about the transactions on the service level. Relational database systems vendors are talking since decades about ACID and you got the impression that you absolutely need it? From my experience it's quite often the case that you anyway give up on ACID for performance reasons (weaker isolation level - e.g. read uncommitted) and we tend to rely such a lot on the DB's transaction management (by accidentally tolerating it's overhead) that we forgot that we often don't need ACID but only handling concurrent access to specific data items. The NoSQL system Couchbase Server for instance, doesn't come with a built-in transaction manager, but it comes with a framework which helps you on the client side to handle transactional behavior. You can e.g. lock specific documents (JSON documents or KV-pairs) and so somebody else has to wait until it is released again. Or you can be more optimistic and use C(ompare)A(nd)S(wap). A write operation is then successful if the CAS value for your document is still the same. This means if nobody else did change the document since you have fetched it. Otherwise you can just try it again with the updated document until you are the winner. Sure, there are also strictly transactional cases out there. They can be addressed by using a service side transaction manager (e.g. implementing 2-phase commit).

Not to use one single and big database is also a chance. We already talked about that you want to be able to scale your services out (adding new service instances behind a load balancer - so web scale) independently. Scaling out the service tier is only half of the story. More and more service instances might also raise the need to scale out on the storage/database tier. So instead doing all with your non-scalable relational DBMS, you can now follow the polyglot persistence idea and use the right database for the job, which means that you might introduce a highly scalable NoSQL database system for some of your service.

Summary

As explained Microservices are self-contained services those are providing business domain specific functionality. A system which uses Microservices is per definition a distributed system, with all it's advantages and disadvantages. Getting your system more scalable is easier possible, whereby distributed transactions are harder. Polyglot persistence is one benefit. You can now use the right storage or database system for the job, dependent on the requirements of the specific service.

CBGraph now supports edge list compression

2016-03-26T09:38:00.004+01:00

About CBGraph

CBGraph (https://github.com/dmaier-couchbase/cb-graph) is a Graph API for the NoSQL database system Couchbase Server.

Adjacency list compression

The latest version of CBGraph (v0.9.1) supports now adjacency list compression. An adjacency list is the list of neighbors of a vertex in a Graph.

So far the adjacency lists were stored directly at the vertices but vertices can become quite big if they have a huge amount of incoming or outgoing edges (such a vertex is called a supernode). One of the limitations which such a supernode introduces is that it just takes longer to transfer a e.g. a 10MB vertex over the wire than e.g. a 1KB one. In order support such supernodes better by reducing the network latency, two optimization steps were introduced for CBGraph.

Compress the adjacency list by still storing it at the vertex (as base64 string). The base64 encoding causes that the lists are taking a bit more space for small vertices but you save a lot (saw up to 50% with UUID-s as vertex id-s) for supernodes.
Externalize and compress the adjacency list as a binary

The used compression algorithm is GZIP.

There are the following switches in the graph.properties file in order to controll the compression mode.

graph.compression.enabled
graph.compression.binary

The first property controls if compression is used. The second one controls if the compressed adjacency list is stored as a separated binary (As Couchbase Server is a KV store and a document database, you can directly store binaries as KV pairs).

Compression disabled

The following setting is used in order to disable compression:

graph.compression.enabled=false

The document model which is used in Couchbase looks then as the following one:

Compression enabled by embedding the adjacency list

The following setting is used in order to enable compression:

graph.compression.enabled=true
graph.compression.enabled=false

The document model that is used in Couchbase looks then as the following one:

Compression enabled by storing the adjacency list as a separated binary

The following setting is used in order to enable the storage of the adjacency list as compressed binary:

graph.compression.enabled=true
graph.compression.enabled=true

The document model which is used looks then as the following one:

Conclusion

Edge list compression helps in order to handle supernodes. The advantages are obviously that you can store more edges at a vertex, but also that the average size of a node is lower and so the latency behavior is improved because the network latency for getting a vertex from Couchbase to CBGraph is a function of the size of a vertex. The overall performance for bigger graphs was improved via this feature. As you can see, the underlying model looks different dependent on the compression mode. So the compression mode is a life time decision for a Graph. You can't access an uncompressed Graph via a CBGraph instance which is configured to use compression and vice versa.

Large-scale data processing with Couchbase Server and Apache Spark

2016-02-09T16:02:00.001+01:00

I just had the chance to work a bit with Apache Spark. Apache Spark is a distributed computation framework. So the idea is to spread computation tasks to many machines in a computation cluster. The idea here is to load data from Couchbase, process it in Spark and store the results back to Couchbase. Couchbase is the perfect companion for Spark because it is capable to handle huge amounts of data, provides a high performance (hundreds of thousands ops per second / sub-milliseconds latency) and is horizontally scalable by also being fault tolerant (replica copies, failover, ...).

You might already know Hadoop for this purpose. Sparks approach is similar but different ;-). In Hadoop you typically load everything into the Hadoop distributed file system and then let process it 'co-located' in parallel. In Spark each worker node is processing the data by default in memory. Your data is described by a R(esilient) D(istributed) D(ataset). Such an RDD is in the first step not the data itself. It is more describing from where the data is coming and which kind of data is expected to be processed. The RDD API is also describing how data can be processed (actions, transformations). In the next step the data is retrieved based on this description, whereby it is distributed across multiple workers (executors). RDD-s are not just sharded across the cluster, it is also possible to replicate them for fault tollerance.

Just in case that you don't know Spark, here the components of the Spark stack:

Spark Core: Handle RDD-s from several sources (and store them to several targets); So you could easily combine the data from several data sources with the data in Couchbase in order to derive new data which can then be stored back to your Couchbase bucket.
Spark SQL: Handle data frames (RDD-s with a schema) whereby retrieving them (e.g. from a database system) by using a SQL like query syntax; Couchbase has it's own SQL like query language (N1QL) and so the integration works like a charm.
Spark Streaming: Handle D(iscrete)Streams (RDD-s as micro batches); Couchbase allows you to consume the D(atabase) C(hange) P(rotocol) stream in order to react on changes in a bucket.

This blog post is focusing on these 3 main components, but there are also the following ones:

MLib: An algorithm package for machine learning (to solve e.g. classification-, cluster-, regression- problems)
GraphX: Graph processing and graph-parallel computations

The following diagram shows the main architecture of a Spark cluster:

Driver Program: Creates the Spark context, declares the transformation and actions on RDD-s of data and submits them to the Master
Cluster Manager: Acquires executors on worker nodes in the cluster.
Workers and their Executors: Execute tasks and return results to the driver

So what can we do with Couchbase? The following simple code example (Scala) shows how to read some keys from the database, filter based on the country, log them to the console and count. Finally the driver is storing back the result to Couchbase.

You can also store the RDD-s directly to Couchbase by using 'rdd.saveToCouchbase()'. Here an example from Couchbase's documentation:

The following very simple example shows how to retrieve data via SparkSQL:

And finally, here a streaming example which retrieves a micro-batch every second:

The example source can be found here: https://github.com/dmaier-couchbase/cb-spark-example . It also provides the helper 'Contexts' which I used in order to initialize and access the Spark context.

It's clear that this blog post was just a very brief introduction to Spark. Further, more use case focused, articles will follow. :-)

Further examples can be found in Couchbase's documentation: http://developer.couchbase.com/documentation/server/4.1/connectors/spark-1.0/spark-intro.html .

Document Modeling Basics

2015-11-20T10:03:00.002+01:00

An often asked question of developers those are new to NoSQL is how to start with the document modeling. This article does not aim to give you answers to all document modeling related questions. It is more a starting point.

Flexible Schema

I am personally not a big fan of the word 'schema free'. My personal opinion is that if we talk about structured data, then we also talk about how to structure the data. (BTW: Couchbase also allows to store unstructured data as binaries. Also semi-structured is supported by e.g. embedding base64 encoded strings into JSON documents.) Couchbase Server does not enforce (on the database side) to follow a specific schema. This brings you more flexibility. Some documents might have a specific property, others might not have it. You don't have to specify upfront that a property might be there and then set it to a NULL value if it is not. So what you have is a flexible schema (or better data model), whereby the application is implicitly providing it. If your application/service is managing user profiles then you will find user documents whereby a user has a first name, last name ... and so on. So 'flexible data model' would be the better term.

Key Patterns

Best practice is to use meaningful key patterns. This helps you directly access a document based on it's context. A key could be a combination of a type and a unique attribute value or it can be also an artificial number. If possible, don’t use artificial numbers. This is indeed not every time possible. The following example shows the key of a user with the email address “mmustermann@domain.com”:

“user::mmustermann@domain.com”

Pattern is: $type::$email

If you know that users are accessible via their email address then you can directly get the user without the need to perform a more complex query. (Whereby querying is e.g. supported via SQL like query language - N1QL - in Couchbase).

A more interesting pattern would reflect a hierarchy. If you would assume that one employee belongs only to one company (but a company can have multiple employees) then you can reflect this ‘belongs to’ already in the key. The following shows an example of the key of a user who belongs to the company ‘Foo’ which has the domain ‘foo.org’.

“user::foo.org::12345”

Pattern is: $type::$domain::$id

Types

We have already seen that the key pattern often has a type prefix. It is also best practice to store an extra type attribute. This allows you later to filter more specifically based on this type (e.g. to ask for all users). Here an example of a user:

“user::mmustermann@domain.com” : {

“type” : “user”

“first_name” : “Max”,

“last_name” : “Mustermann”,

“email” : “mmustermann@domain.com”

}

1:1 Relationships

In this case one entity X has a relationship to another one and vice versa. A one to one relationship can be modeled by embedding or by referencing documents. My recommendation would be to model such a relationship in the first step as a key reference and then embed if there are atomicity requirements, which means if there is a requirement to access the two entities most of the times together. The following shows an example of an user who has a session.

“user::mmustermann@domain.com” : {

“type” : “user”,

“first_name” : “Max”,

“last_name” : “Mustermann”,

“email” : “mmustermann@domain.com”,

“session” : {

“source” : “web”,

“token” : “123456”

}

Embedded Document

The same example by expressing it now as a key reference:

“user::mmustermann@domain.com” : {

“type” : “user”,

“first_name” : “Max”,

“last_name” : “Mustermann”,

“email” : “mmustermann@domain.com”,

“session” : “session::mmustermann@domain.com”

}

“session::mmustermann@domain.com” : {

“type” : “session”,

“source” : “web”,

“token” : “123456”,

“user” : “user::mmustermann@domain.com”

}

Explicitly Referenced Document

It’s easy to see that there is a direct relationship via the id of the user, which is the email address in this case. Because the two documents are anyway correlated via their keys we can in this case simplify it to:

“user::mmustermann@domain.com” : {

“type” : “user”,

“first_name” : “Max”,

“last_name” : “Mustermann”,

“email” : “mmustermann@domain.com”

}

“session::mmustermann@domain.com” : {

“type” : “session”,

“source” : “web”,

“token” : “123456”

}

Documents those are implicitly referencing each other

When to embed or to reference is not like black or white. It's more transitioning with some grey values in between. Dependent on the access patterns it is indeed also possible to embed a part of the document and reference to another part.

1:Many Relationship

The one to many relationship means that one document references multiple other ones (1 up to n). A back reference from the referenced one might be suitable. Again, I would by default reference and would embed as an optimization step if there are any atomicity requirements. This is only my personal preference, you could also start with embedding documents and then externalize by optimize regarding cardinalities and data duplication. Here a company which has multiple employees:

“company::domain.com” : {

“type” : “company”,

“name” : “Some name”,

“address” : “Some address”,

“employees” : [ “user::bart.simpson@domain.com”, “user::moe@domain.com”]

}

“user::bart.simpson@domain.com” : {

“type” : “user”,

“first_name” : “Bart”,

“last_name” : “Simpson”,

“email” : “bart.simpson@domain.com”,

“company” : “company::domain.com”

}

“user::moe@domain.com” : {

“type” : “user”,

“first_name” : “Moe”,

“email” : “moe@domain.com”,

“company” : “company::domain.com”

}

1-to-many via key references

Let’s now assume that we embed the users as subdocuments. Another alternative would be to embed them in an array. An array is better if you need only to iterate over the list of embedded documents. If you need to access specific sub-documents by their id then embedding as nested documents would be preferred. It's identical to the question when you, as a developer, use a List or a Map in order to reflect the associations between your classes.

“company::domain.com” : {

“type” : “company”,

“name” : “Some name”,

“address” : “Some address”,

“employees” : {

“user::bart.simpson@domain.com” : {

“type” : “user”,

“first_name” : “Bart”,

“last_name” : “Simpson”,

“email” : “bart.simpson@domain.com”

“user::moe@domain.com” : {

“type” : “user”,

“first_name” : “Moe”,

“email” : “moe@domain.com”

}

1-to-many as embedded document

What happens now if one employee works for multiple companies in the embedded case? Right, you get data duplicates because one user needs now to be fully embedded into 2 company documents.

At the end it is a question of normalization. A completely normalized schema would contain a lot of relations whereby a de-normalized schema would in the worst case have everything in one table. As in relational databases, the truth is something in the middle. You would not embed everything into one document and you would normally also not express every property as an extra document and then use key references to glue them together. What works best depends on the actual requirements and how you need to access the data.

So you reference in order to avoid duplicates but you embed in order to have atomic access. Your documents are usually in average smaller if you reference but you then might have to perform client side joins (or server side ones via N1QL since Couchbase 4.0).

The Many-Many relationship (via references) is quite similar to the 1-Many one. It just means that you have reference arrays (arrays of keys to express the relationships) on both sides.

Lookup Documents

A lookup document is a document which has only the purpose to provide you a direct reference to one or multiple other documents. Lookup documents are quite useful to maintain own indexes (alternative access paths) in Couchbase’s cache. Let’s assume you want to access a user profile via a customer number:

“customer_ref::12345” : {

“ref” : “mmustermann@domain.com”

}

“user::mmustermann@domain.com” : {

“type” : “user”,

“first_name” : “Max”,

“last_name” : “Mustermann”,

“email” : “mmustermann@domain.com”

}

In order to get an user by his customer id, you can now perform 2 get operations. First you get the lookup document based on the customer id, then you read the ‘ref’ attribute which gives you the key of the associated user document. In the next step you can then access the user directly. This way of access is often more efficient than an exact match query because the index lookup is in this case independent from the number of documents in the bucket or entries in the index which is scanned as part of the query processing.

Atomic Counters

Couchbase allows you to increment counter values. This is a useful feature which helps you for instance to generate Id-s. So it’s similar to sequences in the relational world. The following shows some pseudo-code how to increment the counter value and then reuse it for the id generation.

//“count::user” : “0”

id = client.incr(“count::user”);

client.add(“user::” + id, doc);

A typical pattern would be to perform a multi-get based on a range (e.g. 0...count::user) by taking the counter value into account. You could then skip every non existent document by ignoring the ‘DocNotFound’ error messages. This is indeed prefered if you have evolving data with only a low amount of deletes.

We saw in the section ‘Key patterns’ that keys can reflect hierarchies. So you could easily reflect a 1:Many relationship this way by not using explicit references. A user document belongs to a company document if the corresponding key contains the company prefix. We can get all users of a company by knowing the number of users of the company. Here some pseudo code:

count = client.get(“count::foo.org::user”); //e.g. “563”

for ( i=0; i < count; i++ ) {

doc = client.get(“user::foo.org::” + i);

if (! doc.err ) {

result.add(doc);

}

Conclusion

Basic document modeling techniques were presented here. Couchbase allows you a flexible data model. As mentioned, the way how to model your data is not always black or white. My personal preference is to:

Start with the logical data model (e.g. derived from Object Oriented Analysis)
Create a stupid and simple initial model (e.g. by using key references for 1:Many relationships all the time)
Evolve and optimize it step by step regarding the requirements (unnecessary reference lists because the reference is clear via the key pattern, access patterns, atomicity, duplicates, ...).

Here some useful rules/thoughts:

Use meaningful keys and speaking key patterns if possible!
Use counters for the key generation if there is a need to use artificial ids!
Maintain a type attribute!
Embed documents into others in order to allow to write/get them all together with the parent document. (Atomic access)
If not embedding and if possible (e.g. by using the counter value as part of the key, or by having correlated keys) then express the relationship via the key directly without having the need to reference via key arrays.
Reference in order to avoid data duplicates and in order to keep the average document size smaller. Often we just live with duplicates by having other advantages (atomic access, no client side joins). But on the other hand's side we might have such a high amount of data and such a high degree of connectivity that we can't duplicate all the time.
If the cardinalities (number of related documents) are too high and there is no requirement for atomicity then referencing would be preferred over embedding.
Externalize groups of properties from a document (by adding a 1:1 relationship) if you access this group of properties all the the time together and if the overhead of transferring all the other properties of the document the same time would be too high.
It might make sense to externalize reference arrays from a document if the number of references is very high and so you would like to avoid the overhead of transferring these arrays if you usually only access a few properties of the document.

Using a Key-Value Store for Full Text Indexing and Search

2015-11-02T21:08:00.000+01:00

Couchbase Server is a multi-purpose Database System. One of the purposes is to use it as a simple key-value store. A key-value store allows you to store/retrieve any value by its key. Such a value can be a JSON document (Couchbase allows you to index and query based on such JSON documents and so another purpose is the one as document database.), a small binary or a full text index entry. This article explains why such a key-value store can be also used for full text indexing purposes.

Let's explain how full text indexing works in general. A full text index is a so called inverted index. The table below shows how the following sentences would be indexed: 'Tim is sitting next to Bob' and 'Jim is sitting next to Bob'. The word 'Tim' is only existing in the first sentence and there is exactly one occurrence of it.


	Term \| Count \| Reference
	------------------------
	Tim \| 1 \| #1
	is \| 2 \| #1, #2
	sitting \| 2 \| #1, #2
	next \| 2 \| #1, #2
	to \| 2 \| #1, #2
	Bob \| 2 \| #1, #2
	Jim \| 1 \| #2

There are a bunch of specialized systems out there for full text indexing. Couchbase has for instance a very good integration with Elasticsearch. In the future Couchbase will also have it's own full text service which is called 'cbft'. However, this article is not about Elasticsearch and also not about 'cbft'. We want to use Couchbase's key-value store features for full text indexing here.

Let's define the data model first:

	"fts::$field::$term" : { "count" : $numOf, "refs" : [ ...] }

It is actually quite simple. A full text search index entry does point back for a term to the original key-value pairs those are identified by their keys. The refs array contains these keys. The field is just the field on which we want to search. This could be for instance 'address' or 'message'. Let's say that the default field is called '_all'. So if no specific field is used then the '_all' field is used as the fallback field.

In order to index based on a provided text, we can do the following:

Tokenize the text on which should be indexed. This means basically to break the text up into several words (terms). In our case we assume that our text contains the word 'fox'.
Check if the e.g. the key "fts::_all::fox" is already existing. If not then create the document by referencing back to the document id which contained the word 'fox'.
If the full text index entry was existent then check if the reference list does already contain the reference to the document which contains the text on which should be indexed.
If the reference list does not yet contain the key of this document then extend the reference list by adding the key of the document.

Now in order to search for the specific word 'fox', we just have to do the following:

Get "fts::_all::fox"
Perform a multi-get based on the keys in the array 'refs'

Some example source code (Node.js) can be found here: https://github.com/dmaier-couchbase/cb-node-fts . The service is implemented here: https://github.com/dmaier-couchbase/cb-node-fts/blob/master/routes/fts.js . This application was created by using CEAN stack tools (Couchbase + Express + Angular + Node.js ). They are available here: http://www.ceanjs.org .

Given that I already wrote this little demo application, it makes sense to try it out :-) . First let's add 2 sentences:

the_fox: The quick brown fox jumps over the lazy dog
the_cat: The quick brown fox jumps over the lazy cat

Now in the next step let's perform some searches. I implemented the word search in a way that you can enter any sequence of words (separated by white spaces). The following searches for 'cat':

As we can see, only the sentence with the id 'the_cat' contains the word 'cat'. Next lets's search for the word 'fox':

Both sentences contain the word 'fox'. Last but not least let's search for multiple words:

I think you get it ... :-) . The data which is stored in Couchbase looks as the following one:

This article explained how you can use Couchbase to store a full text search index. Such an index can be used for simple and basic text searches, which might be sufficient for some of your development projects. If you need more sophisticated text search or text analysis then a dedicated full text search service might be the better option.


	Term \| Count \| Reference
	------------------------
	Tim \| 1 \| #1
	is \| 2 \| #1, #2
	sitting \| 2 \| #1, #2
	next \| 2 \| #1, #2
	to \| 2 \| #1, #2
	Bob \| 2 \| #1, #2
	Jim \| 1 \| #2

nosqlgeek.org

How to use Redis as a Vector Database for Recommendations

Introduction

How to use Redis

init

create_index

add

vector_search

Putting it all together

Gemeinsames Projekt zu 'AI-gestütztes Tool zur vereinfachten Erfassung von Objekten im Museum' mit der Hochschule Augsburg

Azure DevOps im Überblick

Codecamp for Kids

New meet-up group in Mittelschwaben

Talk at the University of Applied Sciences in Augsburg about practical use cases of NoSQL

NoSQL Geeks is part of the Stückwerk Community

So what exactly is an Event Loop?

A simple but special Redis Web Client

Building a Recommendation Engine with Redis

Asynchronous Operation Execution with Netty on Redis

Data Encryption at Rest

To PubSub or not to PubSub, that is the question

Introduction

Fire and Forget

Message Queues

Summary

Indexing with Redis

Indexing in Key-Value Stores

The Redis Way

Direct Access via the Primary Key

Exact Match by a Secondary Key

Range By a Secondary Key

Modules

Kafka Connect with Couchbase

About Kafka

Setup

Streaming data from Couchbase

Visualizing time series data from Couchbase with Grafana

Time series data management with Couchbase Server

Caching in JavaEE with Couchbase

How to build Couchbase Server

Couchbase Server 4.5's new Sub-Document API

Introduction

So far

Now with 4.5

Optimistic "locking"

Summary

Microservices and Polyglot Persistence

TOC

Introduction

Why Microservices?

Polyglot character

What's happening with my Database?

Summary

CBGraph now supports edge list compression

Large-scale data processing with Couchbase Server and Apache Spark

Document Modeling Basics

Flexible Schema

Key Patterns

Types

1:1 Relationships

1:Many Relationship

Lookup Documents

Atomic Counters

Conclusion

Using a Key-Value Store for Full Text Indexing and Search