Friday, 8 February 2019

Building a Recommendation Engine with Redis

When I was asked which topic I would like to present at this year's OOP conference, I was out of the box thinking about 'Something with Machine Learning' involved. It was years ago at the university when I had a secondary focus on 'Artificial Intelligence and Neural Networks' and I think that's fair to say that the topic was not as 'hot' as it is today. The algorithms were the same as today but the frameworks were not that commodity and calculations happened either on paper, with MatLab or with some very specialized software for neural network training. However, the actual discipline stayed fascinating and even if I would not call myself a Data Scientist (I sticked more with my primary focus which was Database Implementation Techniques - so I am more a database guy :-) ) I am really amazed of the adoption and number of arising frameworks in the field of Machine Learning and Artifical Intelligence.

Machine Learning or Artificial Intelligence is quite a wide field and so I concluded to go with something more specific which has touch points to Machine Learning and AI. The topic with which I finally went is:
  •  Redis Modules for Recommender Systems
Most of you might know Redis already but it's maybe worth to mention what Redis actually is:
Redis is an open source (BSD licensed), in-memory data structure store, used as a database, cache and message broker. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, geospatial indexes with radius queries and streams. Redis has built-in replication, Lua scripting, LRU eviction, transactions and different levels of on-disk persistence.
Redis is very popular. Here a ranking of the most popular database systems (just to highlight how popular Redis is):
Redis is modular, which means that it can be extended by modules (like plug-ins). A list of Redis modules can be found here.
Agenda

As I am already seeing that this article is becoming a bit longer than a blog article should be, here an outlook regarding the topics. I hope this will motivate the one or the other to continue reading ... .
  1. Data Model
  2. Preparations
  3. Content Based Filtering (using Sets)
  4. Collaborative Filtering (using Sets)
  5. Ratings based Collaborative Filtering (using Sorted Sets)
  6. Social Collaborative Filtering (using RedisGraph)
  7. Content Relevance via Full Text Search (using RediSearch)
  8. Probabilistic Data Structures (using the built-in HyperLogLog structure + ReBloom)
  9. Machine Learning for Classifications and Predictions (using Redis-ML and other AI modules)
Data Model

Let's discuss which problem needs to be solved. Therefore the following simplified data model might be interesting:
  • User: A real-life person which is interacting with a system by showing some interests in items. Users can be classified.
  • Item: The thing users can be interested in. Items can be classified.
  • Interest Classification: Classifications can happen based on item properties, the user attributes, the relationship of users to existing items or the relationship of users to other users and their items. A classification can be for instance expressed as a simple 'Class membership' or as a number which is telling how likely something belongs to which class.
  • Recommendation: The actual recommendation, so which items could be interesting for a user is derived from some classifications. 
 Our example code will basicall use the following terms:
  • Users are just named users
  • We are looking at specific items, which means Comic books
  • Several algorithms and approaches are using different kinds of classifications

Preparations

If you want to follow the code samples of this blog post then you can also find a Jupyter notbook here.
I basically prepared the following Redis instances (one per module):
  • Redis (r)
  • Redis + Machine Learning (r_m)
  • Redis (Bloom Filters) + HyperLogLog (r_b)
  • Redis + Graph (r_g)
  • Redis + Full Text Search (r_s)


Python Prep Script


Content Based Filtering

The idea is to look at what a specific user is interested in and then to recommend things those are similar (i.e. having the same class) as other things the user is liking.

Content based filtering can be done based on 'real set' operations. Redis is coming with a 'Set' data structure which is allowing to perform membership checks, scans, intersections, unions and so on.

Let's look at the following example: 

Python Script

The output is:
David could be also interested in: { 'Fantastic Four',  'Wonder Woman',  'Batman',  'Dragon Age', 'Avatar', 'Valerian', 'Spiderman'}


Collarborative Filtering

The underlying idea is that if person A likes the same things as person B, then person B might also like the other items those are liked by person A. So it's mandatory to have details about many other users collected for a proper classification:

We are using again Redis' Set data structure and especially the union and diff operations.

Let's add some demo data:

Python Demo Data Script

Now let's look at the following example:

Python Script

We can now see which other users B could be interested in the same items as A and then derive a recommendation for the given user A based on the other interests of users B:
Users interested in the same items as David: {'david', 'pieter'}
David is interested in: {'Spiderman', 'Batman'}
David could be also interested in: {'Wonder Woman'}

Ratings based Collaborative Filtering

We are talking about collaborative filtering again. So the approach is to derive a recommendation from similarities to other users. In addition we are now interested in 'How much does a user like an item?'. This allows us i.e. to find out if two or more users are liking similar things. Items those are liked by user B but not yet liked by user A could be also interesting for user A.
 
The Redis structure which is used is a 'Sorted Set'. An element in a sorted has a score. We will use this score as our rating value (1-5, i.e. stars).  A very cool feature of sorted sets is that set operations are allowing to aggregate scores. By default, the resulting score of an element is the sum of the scores across the considered sorted sets. You can combine aggregations with weights. Weights are multiplicators for scores. The weight (1,-1) means to subtract the second score value from the first score value.
 
We will first find users B those rated the same items as a specific user A. The idea is then to leverage this aggregation feature in order to calculate the distance vector of ratings between user A and the previously identified users B. We will then use RMS in order to calculate the average distance as a scalar. The R(oot) M(ean) S(quare) value of a set of values is the square root of the arithmetic mean of the squares of the values. Only users with an average rating distance less or equal to 1 (which means that the users rating was very similar) will be considered. Finally we will recommend items of users B to user A, whereby we are only considering items with a score of at least 4.

I added some helper functions to the following script. I am bascially not pasting them here again by hoping that their functionality is self-explaining. The full source code can be found here: https://github.com/nosqlgeek/rl-recsys/blob/master/notebooks/Redis_for_Recommendations.ipynb .


Python Demo Data Script


Let's first find users B those are liking the same things as A:


Python Script

The output is:
The following users rated David's items: ['pieter', 'david']
Now let's calculate the similarities by then proposing the highest rated items of users B:

Python Script

The result is that Pieter has a matching rating distance and so 'Aqua Man' is a highly recommended Comic to David (whatever tells this about Pieter ;-) ):
The rating distance to pieter is [('batman', 1.0), ('superman', -1.0)]
The average distance (RMS) to pieter is 1.0
The following is highly recommended: [('aqua_man', 5.0)]


Social Collaborative Filtering

The previous examples used Sets and Sorted Sets. We are now exploring how to use Graphs. Our example is taking a social ('friend of') aspect into account. A mathematical Graph is described as a set of vertices V and a set of Edges E, whereby E is a subset of VxV. Graph database systems are extending this mathematical definition to a so called 'Property Graph Model' which means that vertices and edges can have properties (KV pairs) associated.

Our idea is to find all comics of friends B of a given user A those are interested in a specific comic category (Super Heros). Comic books that are liked more often by the friends of user A are more relevant and should be recommended.

I am again skipping the helper functions in order to avoid to blow this article even more up. If you are interested, then the full source code can be found here: https://github.com/nosqlgeek/rl-recsys/blob/master/notebooks/Redis_for_Recommendations.ipynb. The function names are hopefully self-explaining.


Python Demo Data Script

Here the actual Graph Query:

Python Script

As 'Wonder Woman' is liked by 2 of David's friends it is more relevant than the other comics:
David has the following friends: [[['name'], ['Pieter'], ['Vassilis'], ['Katrin']]]
David likes [[['name'], ['Spiderman'], ['Batman']]]
Comic 'Wonder Woman' with relevance 2.000000
Comic 'Batman' with relevance 1.000000
Comic 'Superman' with relevance 1.000000

Content Relevance via Full Text Search

RediSearch is a search engine module for Redis. It comes with multiple built-in scoring functions. We will look at T(erm)F(requency)I(inverse)D(ocument)F(requency). It takes the following aspects into account:
  1. Term Frequency: How often does a specific term appear?
  2. Inverse Document Frequency: An inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely (i.e. the relevance of the word 'the'
Furhter details about scoring can be found here:
We are trying to identify how likely something belongs to a specific class/category by performing a text search for terms those are associated to this category.

Helper functions are again skipped in this article but can be found in the source code repo.


Python Script

Spiderman is more likely a super hero than Batman:
[2, 'spiderman', 0.10000000000000001,['name', 'Spiderman'], 'batman', 0.035714285714285712,['name', 'Batman']]

Probabilistic Data Structures

Probabilistic data structures are characterized in the follwoing way: They ...
  • use hash functions for randamization purposes
  • return an approximated result
  • the error is under a specific threshold
  • are much more space efficient than deterministic approaches
  • provide a constant query time
You would use them because sometimes …
  • speed is more important than correctness
  • compactness is more important than correctness
  • you only need certain data guarantees
It's possible to combine them with deterministic approaches (i.e. HLL + det. counter for discovering counter manipulations).

We will take a look at the following two structures:
  • HyperLogLog: Cardinality estimation of a set, i.e. unique visits
  • Bloom Filter: Check if an item is contained in a set whereby false-positves are possible
Our example will not use 'unique visits' but we are more interested in how many unique users 'touched' a specific comic. Just imagine a real-life comic book store. A bunch of nerds (including myself ...) are hanging around and they are browsing for comics. Interesting comics will be removed from the shelf in order to take a closer look. This is what I mean with 'touched'. A comic which is more often touched can be considered as more interesting. We can count these unique touches quite space efficently by using a HyperLogLog:

Python Script

The output is:
HLL initial size: 31
Approx. count: 4
Please wait ...
Final HLL size: 10590 bytes
Approx. count: 99475

The bloom filter can be used to check if a user is interested a specific comic category without storing the users per category in a set:

Python Script

We are also printing the sizes out in order to demonstrate how space efficient Bloom filters are. The output of this script is:
BF size: 115 bytes
BF size: 76 bytes
Is Katrin interested in Fantasy?: 1
Is Katrin interested in Super Heros?: 0
Is David interested in Super Heros?: 1

Machine Learning for Classifications and Predictions

We are closing this blog article by circling back to the introduction of it. Classifications were so far often seen as a given (i.e. a comic book belonging to the 'Fantasy' category).  Others could be derived by taking the existing user interests into account. We also mentioned in the section 'Data Model' that classifications might be derived from user attributes or item properties. Now, Machine Learning is providing us ways to describe a more complex models by taking such attributes (=features) into account. Such features can be represented by structured data (i.e. the comic name, ...) or unstructured data (i.e. the images within a comic book, the used colors within comic book). Feature vectors could be for instance derived from the bitmap of a scanned cover of a comic book. At the end, you can consider every ML approach as a way to approximate a function F(x) -> y, whereby x is a feature vector and y is the output vector. The idea is to create/train a model based on the known values y for given vectors x. These given vectors are called the training features. The idea is to derive a model which is able to approximate/predict a 'good' output vector y for an unknown input vector x.

Here 2 examples for such models:
  • Decision Tree ensembles (random forests). The idea is to conduct a forest of decision trees at training time. RedisML can be used for the Model Serving by leveraging these decision trees for i.e. classification purposes. The class which appears most often will be the winner.
  • Neural networks: Train the weighted connections between neurons by using a learning algorithm (i.e. Backpropagation).
The following example leverages 2 very small decision trees:
  • Users with an age <=20 are liking Manga comics
  • Users with more than 1000 are not liking Manga comics
Python Script

I am feeling that this article could tell much more about 'Neural Networks and Artificial Intelligence', but I am also hoping that it's understandable that this is a very wide field and so I am thinking that it is worth to write a dedicated article about Redis for Machine Learning and Artificial Intelligence at a later point in time.

Finally, here some attitional Redis modules those didn't make it into one of the earlier sections:

  • Neural Redis: Is a Redis module that implements feed forward neural networks as a native data type for Redis. The project goal is to provide Redis users with an extremely simple to use machine learning experience.
  • Countminsketch: An apporximate frequency counter
  • Topk: An almost deterministic top k elements counte
  • Redis-tdigest: T-digest data structure wich can be used for accurate online accumulation of rank-based statistics such as quantiles and cumulative distribution at a point.
Thanks for reading! Feedback regarding this article is very welcome!

Tuesday, 19 June 2018

Asynchronous Operation Execution with Netty on Redis

Netty got my attention a while back and I just wanted to play a bit around with it. Given the fact that I am already fallen in love with Redis, what would be more fun than implementing a low level client for Redis based on Netty?

Let's begin to answer the question "What the hell is Netty?". Netty is an asynchronous (Java based) event-driven network application framework. It is helping you to develop high performance protocol servers and clients.

We are obviously more interested in the client part here, meaning that this article is focusing on how to interact with a Redis Server.

Netty is already coming with RESP support. The package 'io.netty.handler.codec.redis' contains several Redis message formats:


  • RedisMessage: A general Redis message
  • ArrayRedisMessages: An implementation of the RESP Array message
  • SimpleRedisStringMessage: An implementation of a RESP Simple String message
  • ...

So all we need to do is to:

  1. Boostrap a channel: A channel is a nexus to a network socket or component which is capable of I/O operations. Bootstrapping means to assign the relevant components to the channel (Event loop group, handlers, listeners, ...) and to establish the socket connection. An example class can be found here.
  2. Define a channel pipeline: We are using an initialization handler in order to add several other handlers to the channel's pipeline. The pipeline is a list of channel handlers, whereby each handler handles or intercepts inbound events or outbound operations. Our channel pipeline is having the following handlers: RedisDecoder (Inbound handler that decodes into a RedisMessage), RedisBulkStringAggregator (Inbound handler that aggregates an BulkStringHeaderRedisMessage and its following BulkStringRedisContents into a single FullBulkStringRedisMessage), RedisArrayAggregator (Aggregates RedisMessage parts into an ArrayRedisMessage) and RedisEncoder (This outbound handler encodes RedisMessage into
    bytes by following the RESP (REdis Serialization Protocol). Netty will first apply the outbound handlers to the passed in value. Then it will put the encoded message on the socket. When the response will be received then it will apply the inbound handlers. The last handler is then able to work with the decoded (pre-handled) message. An example for such a pipeline definition can be found here.
  3. Add a custom handler: We are also adding a custom duplex handler to the pipeline. It is used in order to execute custom logic when a message is received (channelRead) or sent (write). We are not yet planning to execute business logic based on the RedisMessage but instead want to just fetch it, which means that our handler just allows to retrieve the result. My handler is providing an asynchronous method to do so. The method 'sendAsyncMessage' returns a Future. It's then possible to check if the Future is completed. When it is completed then you can get the RedisMessage from it. This handler is buffering the futures until they are completed. The source code of my example handler can be found here
BTW: It's also possible to attach listeners to a channel. Whereby I found it initially to be a good idea to use listeners in order to react on new messages, I had to realize that channel listeners are invoked before the last handler (the last one is usually your custom one), which means that you face the issue that your received message did not go through the channel pipeline when the listener is invoked. So my conclusion is that channel listeners are more used for side tasks (inform someone that something was received, log a message out, ...) instead of the message processing itself, whereby handlers are designed in order to be used to process the received messages. So if you want to use listeners then a better way is to let the handler work with promises and then attach the listener to the promise of a result.

In addition the following classes were implemented for demoing purposes:
  • GetMsg and SetMsg: Are extending the class ArrayRedisMessage by defining how a GET and a SET message are looking like.
  • AsyncRedisMessageBuffer: A message buffer which uses a blocking queue in order to buffer outgoing and incoming messages. The Redis Client Handler (my custom handler) is doing the following: Sending a message causes that the Future is put into the buffer. When the response arrives then the Future is updated and removed from the buffer. Whoever called the 'sendAsyncMessage' method has hopefully still a reference to the just dequeued Future. I used 'LinkedBlockingDeque' which means that the implementation should be thread safe.
Here a code example how to use the handler in order to execute an asynchronous GET operation:



Hope you enjoyed reading this blog post! Feedback is welcome.


Wednesday, 6 June 2018

Data Encryption at Rest

Data security and protection is currently a hot topic. It seems that we reached the point when the pendulum is swinging back again. After years of voluntary openness by sharing personal information freely with social networks, people are getting more and more concerned about how their personal data is used in order to profile or influence them. Social network vendors are getting currently bad press, but maybe we should ask ourself the fair question "Didn't we know all the time that their services are not for free and that we are paying them with our data?". Maybe not strictly related to prominent (so called) 'data scandals' but at least following the movement of the pendulum is the new European GDPR regulation around data protection. Even if I think that it tends to 'overshoot the mark' (as we would say in German) and leaves data controllers and processors sometimes in the dark (unexpected rhyme ...), it is a good reason for me address some security topics again from a technical point of view. So this article has the subject of 'Data Encryption at Rest' on Linux servers.

To be more accurate this article is mainly focusing on how to ensure that folders are encrypted under Linux. Linux provides the following ways to encrypt your data:
  • Partition level: This allows you to define an encrypted partition on a hard drive. I think that this is the most commonly seen way of encrypting data with Linux. Most Linux distributions are providing this option already during the installation
  • Folder level: Allows you to encrypt specific folders by i.e. mounting them under as specific path. The 'ecryptfs' solution can be used for such a purpose.
  • File level:  It is also possible to encrypt single files. The PGP (Pretty Good Privacy) tools can be used for this purpose.

This article focuses on the 'Folder level' encryption. It has the advantage that you can define encrypted folders on-demand without the need to repartition your drives. It also doesn't just work on single files but allows you to mount your folder directly. Each file which is stored in the encrypted folder is encrypted separately. This is especially useful if you want to encrypt only specific data by providing only specific users unencrypted access. One use case would be to only allow your CIFS service (File Server service) unencrypted access to the folder. I can also easily see that database systems could leverage this feature, whereby I didn't test which performance implication might be seen when using folder level encryption with DBMS.

  • Step 0 - Install 'ecryptfs'
apt-get -y install ecryptfs-utils
  • Step 1 - Create a hidden directory: Let's assume that we have a folder /mnt/data which is the mount point of your main data partition (in my case an EXT4 partition on a RAD1 of 2 spinning HDD-s. We create a hidden folder named encrypted there:
mkdir /mnt/data/.encrypted

  • Step 2 - Create a second folder: This folder is used as our mount point. All access needs to happen via this second folder.
mkdir /mnt/encrypted
  • Step 2 - Mount the hidden folder as an encrypted one: Let's assume we want to access our encryption folder under /mnt/encrypted. This means that each write to the newly mounted folder is involving the encryption of the written data. Here a small script which does the job:
#!/bin/bash
mount -t ecryptfs\
-o rw,relatime,ecryptfs_fnek_sig=82028e5be8a0a05b,\
ecryptfs_sig=55028e0be5a0a08a,ecryptfs_cipher=aes,\
ecryptfs_key_bytes=16,ecryptfs_unlink_sigs\ /mnt/data/.encrypted /mnt/encrypted

The mount command will ask your for the passphrase. The passphrase will be used for every remount.

WARNING: If you loose your passphrase, then you will no longer be able to read your previously encrypted data.

This is what's stored in your mounted folder:

root@ubuntu-server:/mnt/encrypted# ls
hello2.txt  hello.txt
root@ubuntu-server:/mnt/encrypted# cat hello.txt 
Hello world!

Whereby the original folder contains the encrypted data:

root@ubuntu-server:/mnt/data/.encrypted# ls
ECRYPTFS_FNEK_ENCRYPTED.FWYW-ctPtO0USURgl98vtKSoykT9hmQROUa3cBMaMT0UyWKbxkF7KQOiU---  ECRYPTFS_FNEK_ENCRYPTED.FWYW-ctPtO0USURgl98vtKSoykT9hmQROUa3TeggyUTAxFqhqUkBB.a-Bk--
root@ubuntu-server:/mnt/data/.encrypted# cat ECRYPTFS_FNEK_ENCRYPTED.FWYW-ctPtO0USURgl98vtKSoykT9hmQROUa3cBMaMT0UyWKbxkF7KQOiU---
??tY?ì
?"3DUfw`n6?
           ?3ﯙY7?_?_CONSOLE"?[堠zx?ŷZ?G??铅?Lj*?9?.fEN??`????R?:??83?F???{???
                                                                            ??_Z&tx?,?2!?w


Access to the decrypted data is possible under the following circumstances:

  • The logged-in user has permission to read or write the folder /mnt/encrypted
  • The folder /mnt/data/.encrypted was mounted to /mnt/encrypted by providing the passphrase
It's especially no longer possible to read the unencrypted data after removing the hard disk physically from a machine. As said, it's necessary to know the passphrase in order to decrypt the data of this folder again.

Hope this article is helpful :-) .

Monday, 22 January 2018

To PubSub or not to PubSub, that is the question


Introduction 

 The PubSub pattern is quite simple:
  • Publishers can publish messages to channels
  • Subscribers of these channels are able to receive the messages from them
There is no knowledge of the publisher about the functionality of any of the subscribers. So they are acting independently. The only thing which glues them together is a message within a channel.

Here a very brief example with Redis:
  • Open a session via 'redis-cli' and enter the following command in order to subscribe to a channel with the name 'public'

  •  In another 'redis-cli' session enter the following command in order to publish the message 'Hello world' to the 'public' channel:


The result in the first session is:



BTW: It's also possible to subscribe to a bunch of channels by using patterns, e.g. `PSUBSCRIBE pub*` 

 

Fire and Forget 

If we would start additional subscribers after our experiment then they won't receive the previous messages. So we can see that we can only receive messages when we are actively subscribed. Meaning that we can't retrieve missed messages afterwards. In other words:
  • Only currently listening subscribers are retrieving messages
  • A message is retrieved by all active subscribers of a channel
  • If a subscriber dies and comes back later then it might have missed messages
PubSub is completely independent from the key space. So whatever is published to a channel will not directly affect the data in your database. Published messages are not persisted and there are no delivery guarantees. However, you can indeed use it in order to notify subscribers that something affected your key space (e.g. The value of item 'hello:world' has changed, you might fetch the change!). So what's the purpose of PubSub then? It's about message delivery and notifications. Each of the subscribers can decide by himself how to handle the received message. Because all subscribers of a channel receive the same message, it's obviously not about scaling the workload itself. This is an important difference in comparision to message queue use cases.

 

Message Queues

Message queues on the other's hand side are intended to scale the workload. A list of messages is processed by a pool of workers. As the pool of workers is usually limited in size, it's important that messages are buffered until a worker is free in order to process it. Redis (Enterprise) features like
  • Persistency
  • High Availability
are quite more important for such a queuing scenario. So such a queue should surrive a node failure or a node restart. Redis fortunately comes with the LIST data structure for simple queues or a Sorted Set structure for priortiy queues.

It's important to state that there are already plenty of libraries and solutions out there for this purpose. Here two examples:
A very simple queue implementation would use a list. Because entries of the list are strings, it would be good to encode messages into e.g. JSON if they have a more complex structure.
  • Create a queue and inform the scheduler that a new queue is alive:

  • Add 2 messages to the queue:


  • Schedule the workers: We could indeed use a more complex scheduling apporach. However, the simplest and stupidest would be to just assign the next worker of the pool to the next message. So in order to dequeue a message we can just use `LPOP`:



BTW: If our queue would be initially empty then there is a way to wait for a while until something arrives by using the `BLPOP` command.

Using PubSub is actually optional for our message queue example. It's easy to see that the scheduler could also assign workers without getting notified because it can at any time access the queues and messages. However, I found it a bit more dynamic to combine our queue example with PubSub:
  • The scheduler gets notified when new work needs to be assigned to the workers
  • As these notifications are fire and forget, it would be also possible for the scheduler to check from time to time if there is something to do
  • If the scheduler dies then another instance can be started which can access the database in order to double check which work was already done by the workers and which work still needs to be done. An interuppted job can be restarted based on such state information. 

 

Summary 

Redis' PubSub is 'Fire and forget'. It's intended to be used to deliver messages from many (publishers) to many (subscribers). It's indeed a useful feature for notification purposes. However, it's important to understand the differences between a messaging and a message processing use case.

The way how we used it was to inform a single scheduler that some work needs to be done. The scheduler would then hand over to a pool of worker threads in order to process the actual queue. The entire state of the queue was stored in our database as list because PubSub alone is not intended to be used for message queuing use cases. In fact the usage of PubSub for our queuing example was optional.

Thursday, 27 July 2017

Indexing with Redis

If you follow my news on Twitter then you might have realized that I just started to work more with Redis.  Redis (=Remote Dictionary Server) is known as a Data Structure Store. This means that we can not just deal with Key-Value Pairs (called Strings in Redis) but in addition with data structures as Hashes (Hash-Maps), Lists, Sets or Sorted Sets. Further details about data structures can be found here:


Indexing in Key-Value Stores

With a pure Key-Value Store, you would typically maintain your index structures manually by applying some KV-Store patterns. Here some examples:

  • Direct access via the primary key: The key itself is semantically meaningful and so you can access a value directly by knowing how the key is structured (by using key patterns). An example would be to access an user profile by knowing the user's id. The key looks like 'user::<uid>'.
  • Exact match by a secondary key: The KV-Store itself can be seen as a huge Hash-Map, which means that you can use lookup items in order to reference other ones. This gives you a kind of Hash Index. An example would be to find a user by his email address. The lookup item has the key 'email::<email_addr>', whereby the value is the key of the user. In order to fetch the user with a specific email address you just need to do a Get operation on the key with the email prefix and then another one on the key with the user prefix. 
  • Range by a secondary key: This is where it is getting a bit more complicated with pure KV-Stores. Most of them allow you to retrieve a list of all keys, but doing a full 'key space scan' is not efficient (complexity of O(n), n=number of keys). You can indeed build your own tree structure by storing lists as values and by referencing between them, but maintaining these search trees on the application side is really not what you usually want to do.


The Redis Way

So how is Redis addressing these examples? We are leveraging the power of data structures as Hashes and Sorted Sets.

Direct Access via the Primary Key

A Get operation already has a complexity of O(1). This is the same for Redis.

Exact Match by a Secondary Key

Hashes (as the name already indicates) can be directly used to build a hash index in order to support exact match 'queries'. The complexity of accessing an entry in a Redis Hash is indeed O(1). Here an example:

In addition Redis Hashes are supporting operations as HSCAN. This provides you a cursor based approach to scan hashes. Further information can be found here:


Here an example:


Range By a Secondary Key

Sorted Sets can be used to support range 'queries'.  The way how this works is that we use the value for which we are searching  as the score (order number). To scan such a Sorted Set has then a complexity of O(log(n)+m) whereby n is the number of elements in the set and m is the result set size.

Here an example:

If you add 2 elements with the same score then they are sorted lexicographically. This is interesting for non-numeric values. The command ZRANGEBYLEX allows you to perform range 'queries' by taking the lexicographic order into account.


Modules

Redis supports now Modules (since v4.0). Modules are allowing you to extend Redis' functionality. One module which perfectly matches the topic of this blog post is RediSearch. RediSearch is basically providing Full Text Indexing and Searching capabilities to Redis. It uses an Inverted Index behind the scenes. Further details about RediSearch can be found here:

Here a very basic example from the RediSearch documentation:


As usual, I hope that you found this article useful and informative. Feedback is very welcome!


Thursday, 6 April 2017

Kafka Connect with Couchbase

About Kafka

Apache Kafka is a distributed persistent message queuing system. It is used in order to realize publish-subscribe use cases, process streams of data in real-time and store a stream of data safely in a distributed replicated cluster. That said Apache Kafka is not a database system but can stream data from a database system in near-real-time. The data is represented as a message stream with Kafka. Producers put messages in a so called message topic and Consumers take messages out of it for further processing. There is a variety of connectors available. A short introduction to Kafka can be found here: https://www.youtube.com/watch?v=fFPVwYKUTHs . This video explains the basic concepts and how Producers and Consumers are looking like. However, Couchbase supports 'Kafka Connect' since version 3.1 of it's connector. The Kafka documentation says "Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems. It makes it simple to quickly define connectors that move large collections of data into and out of Kafka.". Kafka provides a common framework for Kafka connectors. It can run in a distributed or standalone mode and it distributed and scalable by default.

Setup

Kafka uses Apache Zookeeper. Zookeeper is a cluster management service. The documentation states that "ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications ... ZooKeeper aims at distilling the essence of these different services into a very simple interface to a centralized coordination service."

After downloading and extracting the standard distribution of Apache Kafka, you can start a local Zookeeper instance by using the default configuration the following way:


The next step is to configure 3 Kafka message broker nodes. We will run these services just on the same host for demoing purposes but it's obvious that they can also run more distributed.  In order to so we need to create configurations for the broker servers. So copy the config/server.properties file to server-1.properties and server-2.properties and then edit it. The file 'sever.properties' has the following settings:


Let's assume that $i is the id of the broker. So the first broker has id '0', listens on port 9092 and logs to 'kafka-logs-0'. The second broker has the id '1', listens on port 9093 and logs to 'kafka-logs-1'. The third broker configuration is self-explaining.


The next step is to download and install and  Couchbase Plug-in. Just copy the related libraries to the libs sub-folder and the configuration files to the config sub-folder of your Kafka installation.


Streaming data from Couchbase

Before we can stream data from Couchbase we need to create a topic to which we want to stream to. So let's create a topic which is named 'test-cb'.


You can then describe this topic by using the following command:


The topic which we created has 3 partitions. Each node is the leader for 1 partition. The leader is the node responsible for all reads and writes for the given partition. The Replicas is the list of nodes that replicate the log for this partition.

Now let's create a configuration file for distributed workers under 'config/couchbase-distributed.properties':


The Connect settings are more or less the default ones. Now we also have to provide the connector settings. If using the distributed mode then the settings have to be provided by registering the connector via the Connect REST service:


The configuration file 'couchbase-distributed.json' has a name attribute and an embedded object with the configuration settings:


The Couchbase settings refer to a Couchbase bucket and the topic name to which we want to stream DCP messages out of Couchbase. In order to run the Connect workers in distributed mode, we can now execute:


The log file contains information about the tasks. We configured 2 tasks to run. The output contains the information which task is responsible for which Couchbase shards (vBuckets):


For now let's just consume the 'test-cb' messages by using a console logging consumer:


One entry looks as the following one:


We just used the standard value converter. The value is in reality a JSON document but represented as Base64 encoded string in this case.

Another article will explain how to use Couchbase via Kafka Connect as the sink for messages.

Monday, 12 September 2016

Visualizing time series data from Couchbase with Grafana

Grafana is a quite popular tool for querying and visualizing time series data and metrics. If you follow my blog then you might have seen my earlier post about how to use Couchbase Server for managing time series data:




This blog is now about extending this idea by providing a Grafana Couchbase plug-in for visualizing purposes.

After you installed Grafana (I installed it on Ubuntu, but there are installation guides available here for several platforms), you are asked to configure a data source. Before we will use Grafana's 'SimpleJson' data source, it's relevant how the backend of such a data source looks like.


  • '/': Returns any successful response in order to test if the data source is available
  • '/search': Returns the available metrics. We will just return 'dax' in our example.
  • '/annotations': Returns an array of annotations. Such an annotation has a title, a time where it would occur, a text and a tag. We just return an empty array in our example. But you can easily see that it would be possible to create an annotation if a specific value is exceeded or a specific time is reached.
  • '/query': The request is containing a time range and a target metric. The result is an array which has an entry for every target metric and each of these entries has an array of data points. Each data point is a tuple of the metric value and the time stamp.

We will just extend our example from before with an Grafana endpoint and then point Grafana's generic JSON data source plug-in to it, but I can already see a project on the horizon which standardizes the time series management in Couchbase via a standard REST service which can then be used by a dedicated Grafana Couchbase plug-in.

First let's look at our backend implementation:




As usual, the full code can be found here: https://github.com/dmaier-couchbase/cb-ts-demo/blob/master/routes/grafana.js

Here how we implemented the backend:

  • '/': As you can see we just return a 'success:true' if the backend is accessible.
  • '/search': The only metric which our backend provides is the 'dax' one. 
  • '/annotations':  Only an example annotation is returned in this case. 
  • '/query': We just check if the requested metric is the 'dax' one. In this first example, we don't take the aggregation documents into account. Instead we just request the relevant data points by using a multi-get based on the time range. Because Grafana expects the datapoints in time order, we have to finally sort them by time. Again, this code will be extended in order to take the several aggregation levels into account (Year->Month->Day->Hour).


 Now back to Grafana! Let's assume that you successfully installed the 'SimpleJson' data source:


Then the only thing you need to do is to add a new data source to Grafana by pointing to our backend service (To run the backend service, just execute 'node app.js' after you checked out the full repository and installed all necessary dependencies.):


In this example I actually, just loaded a bit of random data for testing purposes by using the demo_data.js script.

Then all you have to do is to create a Dashboard an place a panel on it:



The rest should work more or less the same as with any other Grafana data source. :-)