Building a Recommendation Engine with Redis
When I was asked which topic I would like to present at this year's OOP conference, I was out of the box thinking about 'Something with Machine Learning' involved. It was years ago at the university when I had a secondary focus on 'Artificial Intelligence and Neural Networks' and I think that's fair to say that the topic was not as 'hot' as it is today. The algorithms were the same as today but the frameworks were not that commodity and calculations happened either on paper, with MatLab or with some very specialized software for neural network training. However, the actual discipline stayed fascinating and even if I would not call myself a Data Scientist (I sticked more with my primary focus which was Database Implementation Techniques - so I am more a database guy :-) ) I am really amazed of the adoption and number of arising frameworks in the field of Machine Learning and Artifical Intelligence.
Machine Learning or Artificial Intelligence is quite a wide field and so I concluded to go with something more specific which has touch points to Machine Learning and AI. The topic with which I finally went is:
Redis is modular, which means that it can be extended by modules (like plug-ins). A list of Redis modules can be found here.
Agenda
As I am already seeing that this article is becoming a bit longer than a blog article should be, here an outlook regarding the topics. I hope this will motivate the one or the other to continue reading ... .
Let's discuss which problem needs to be solved. Therefore the following simplified data model might be interesting:
Preparations
If you want to follow the code samples of this blog post then you can also find a Jupyter notbook here.
I basically prepared the following Redis instances (one per module):
Python Prep Script
Content Based Filtering
The idea is to look at what a specific user is interested in and then to recommend things those are similar (i.e. having the same class) as other things the user is liking.
Content based filtering can be done based on 'real set' operations. Redis is coming with a 'Set' data structure which is allowing to perform membership checks, scans, intersections, unions and so on.
Let's look at the following example:
Python Script
The output is:
Collarborative Filtering
The underlying idea is that if person A likes the same things as person B, then person B might also like the other items those are liked by person A. So it's mandatory to have details about many other users collected for a proper classification:
We are using again Redis' Set data structure and especially the union and diff operations.
Let's add some demo data:
Python Demo Data Script
Now let's look at the following example:
Python Script
We can now see which other users B could be interested in the same items as A and then derive a recommendation for the given user A based on the other interests of users B:
Ratings based Collaborative Filtering
We are talking about collaborative filtering again. So the approach is to derive a recommendation from similarities to other users. In addition we are now interested in 'How much does a user like an item?'. This allows us i.e. to find out if two or more users are liking similar things. Items those are liked by user B but not yet liked by user A could be also interesting for user A.
The Redis structure which is used is a 'Sorted Set'. An element in a sorted has a score. We will use this score as our rating value (1-5, i.e. stars). A very cool feature of sorted sets is that set operations are allowing to aggregate scores. By default, the resulting score of an element is the sum of the scores across the considered sorted sets. You can combine aggregations with weights. Weights are multiplicators for scores. The weight (1,-1) means to subtract the second score value from the first score value.
We will first find users B those rated the same items as a specific user A. The idea is then to leverage this aggregation feature in order to calculate the distance vector of ratings between user A and the previously identified users B. We will then use RMS in order to calculate the average distance as a scalar. The R(oot) M(ean) S(quare) value of a set of values is the square root of the arithmetic mean of the squares of the values. Only users with an average rating distance less or equal to 1 (which means that the users rating was very similar) will be considered. Finally we will recommend items of users B to user A, whereby we are only considering items with a score of at least 4.
I added some helper functions to the following script. I am bascially not pasting them here again by hoping that their functionality is self-explaining. The full source code can be found here: https://github.com/nosqlgeek/rl-recsys/blob/master/notebooks/Redis_for_Recommendations.ipynb .
Python Demo Data Script
Let's first find users B those are liking the same things as A:
Python Script
The output is:
Python Script
The result is that Pieter has a matching rating distance and so 'Aqua Man' is a highly recommended Comic to David (whatever tells this about Pieter ;-) ):
Social Collaborative Filtering
The previous examples used Sets and Sorted Sets. We are now exploring how to use Graphs. Our example is taking a social ('friend of') aspect into account. A mathematical Graph is described as a set of vertices V and a set of Edges E, whereby E is a subset of VxV. Graph database systems are extending this mathematical definition to a so called 'Property Graph Model' which means that vertices and edges can have properties (KV pairs) associated.
Our idea is to find all comics of friends B of a given user A those are interested in a specific comic category (Super Heros). Comic books that are liked more often by the friends of user A are more relevant and should be recommended.
I am again skipping the helper functions in order to avoid to blow this article even more up. If you are interested, then the full source code can be found here: https://github.com/nosqlgeek/rl-recsys/blob/master/notebooks/Redis_for_Recommendations.ipynb. The function names are hopefully self-explaining.
Python Demo Data Script
Here the actual Graph Query:
Python Script
As 'Wonder Woman' is liked by 2 of David's friends it is more relevant than the other comics:
Content Relevance via Full Text Search
RediSearch is a search engine module for Redis. It comes with multiple built-in scoring functions. We will look at T(erm)F(requency)I(inverse)D(ocument)F(requency). It takes the following aspects into account:
We are trying to identify how likely something belongs to a specific class/category by performing a text search for terms those are associated to this category.
Helper functions are again skipped in this article but can be found in the source code repo.
Python Script
Spiderman is more likely a super hero than Batman:
Probabilistic Data Structures
Probabilistic data structures are characterized in the follwoing way: They ...
We will take a look at the following two structures:
Python Script
The output is:
The bloom filter can be used to check if a user is interested a specific comic category without storing the users per category in a set:
Python Script
We are also printing the sizes out in order to demonstrate how space efficient Bloom filters are. The output of this script is:
Machine Learning for Classifications and Predictions
We are closing this blog article by circling back to the introduction of it. Classifications were so far often seen as a given (i.e. a comic book belonging to the 'Fantasy' category). Others could be derived by taking the existing user interests into account. We also mentioned in the section 'Data Model' that classifications might be derived from user attributes or item properties. Now, Machine Learning is providing us ways to describe a more complex models by taking such attributes (=features) into account. Such features can be represented by structured data (i.e. the comic name, ...) or unstructured data (i.e. the images within a comic book, the used colors within comic book). Feature vectors could be for instance derived from the bitmap of a scanned cover of a comic book. At the end, you can consider every ML approach as a way to approximate a function F(x) -> y, whereby x is a feature vector and y is the output vector. The idea is to create/train a model based on the known values y for given vectors x. These given vectors are called the training features. The idea is to derive a model which is able to approximate/predict a 'good' output vector y for an unknown input vector x.
Here 2 examples for such models:
Machine Learning or Artificial Intelligence is quite a wide field and so I concluded to go with something more specific which has touch points to Machine Learning and AI. The topic with which I finally went is:
- Redis Modules for Recommender Systems
Redis is an open source (BSD licensed), in-memory data structure store, used as a database, cache and message broker. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, geospatial indexes with radius queries and streams. Redis has built-in replication, Lua scripting, LRU eviction, transactions and different levels of on-disk persistence.Redis is very popular. Here a ranking of the most popular database systems (just to highlight how popular Redis is):
Redis is modular, which means that it can be extended by modules (like plug-ins). A list of Redis modules can be found here.
Agenda
As I am already seeing that this article is becoming a bit longer than a blog article should be, here an outlook regarding the topics. I hope this will motivate the one or the other to continue reading ... .
- Data Model
- Preparations
- Content Based Filtering (using Sets)
- Collaborative Filtering (using Sets)
- Ratings based Collaborative Filtering (using Sorted Sets)
- Social Collaborative Filtering (using RedisGraph)
- Content Relevance via Full Text Search (using RediSearch)
- Probabilistic Data Structures (using the built-in HyperLogLog structure + ReBloom)
- Machine Learning for Classifications and Predictions (using Redis-ML and other AI modules)
Let's discuss which problem needs to be solved. Therefore the following simplified data model might be interesting:
- User: A real-life person which is interacting with a system by showing some interests in items. Users can be classified.
- Item: The thing users can be interested in. Items can be classified.
- Interest Classification: Classifications can happen based on item properties, the user attributes, the relationship of users to existing items or the relationship of users to other users and their items. A classification can be for instance expressed as a simple 'Class membership' or as a number which is telling how likely something belongs to which class.
- Recommendation: The actual recommendation, so which items could be interesting for a user is derived from some classifications.
- Users are just named users
- We are looking at specific items, which means Comic books
- Several algorithms and approaches are using different kinds of classifications
Preparations
If you want to follow the code samples of this blog post then you can also find a Jupyter notbook here.
I basically prepared the following Redis instances (one per module):
- Redis (r)
- Redis + Machine Learning (r_m)
- Redis (Bloom Filters) + HyperLogLog (r_b)
- Redis + Graph (r_g)
- Redis + Full Text Search (r_s)
Python Prep Script
Content Based Filtering
The idea is to look at what a specific user is interested in and then to recommend things those are similar (i.e. having the same class) as other things the user is liking.
Content based filtering can be done based on 'real set' operations. Redis is coming with a 'Set' data structure which is allowing to perform membership checks, scans, intersections, unions and so on.
Let's look at the following example:
Python Script
The output is:
David could be also interested in: { 'Fantastic Four', 'Wonder Woman', 'Batman', 'Dragon Age', 'Avatar', 'Valerian', 'Spiderman'}
Collarborative Filtering
The underlying idea is that if person A likes the same things as person B, then person B might also like the other items those are liked by person A. So it's mandatory to have details about many other users collected for a proper classification:
We are using again Redis' Set data structure and especially the union and diff operations.
Let's add some demo data:
Python Demo Data Script
Now let's look at the following example:
Python Script
We can now see which other users B could be interested in the same items as A and then derive a recommendation for the given user A based on the other interests of users B:
Users interested in the same items as David: {'david', 'pieter'}
David is interested in: {'Spiderman', 'Batman'}
David could be also interested in: {'Wonder Woman'}
Ratings based Collaborative Filtering
We are talking about collaborative filtering again. So the approach is to derive a recommendation from similarities to other users. In addition we are now interested in 'How much does a user like an item?'. This allows us i.e. to find out if two or more users are liking similar things. Items those are liked by user B but not yet liked by user A could be also interesting for user A.
The Redis structure which is used is a 'Sorted Set'. An element in a sorted has a score. We will use this score as our rating value (1-5, i.e. stars). A very cool feature of sorted sets is that set operations are allowing to aggregate scores. By default, the resulting score of an element is the sum of the scores across the considered sorted sets. You can combine aggregations with weights. Weights are multiplicators for scores. The weight (1,-1) means to subtract the second score value from the first score value.
We will first find users B those rated the same items as a specific user A. The idea is then to leverage this aggregation feature in order to calculate the distance vector of ratings between user A and the previously identified users B. We will then use RMS in order to calculate the average distance as a scalar. The R(oot) M(ean) S(quare) value of a set of values is the square root of the arithmetic mean of the squares of the values. Only users with an average rating distance less or equal to 1 (which means that the users rating was very similar) will be considered. Finally we will recommend items of users B to user A, whereby we are only considering items with a score of at least 4.
I added some helper functions to the following script. I am bascially not pasting them here again by hoping that their functionality is self-explaining. The full source code can be found here: https://github.com/nosqlgeek/rl-recsys/blob/master/notebooks/Redis_for_Recommendations.ipynb .
Python Demo Data Script
Let's first find users B those are liking the same things as A:
Python Script
The output is:
The following users rated David's items: ['pieter', 'david']Now let's calculate the similarities by then proposing the highest rated items of users B:
Python Script
The result is that Pieter has a matching rating distance and so 'Aqua Man' is a highly recommended Comic to David (whatever tells this about Pieter ;-) ):
The rating distance to pieter is [('batman', 1.0), ('superman', -1.0)]
The average distance (RMS) to pieter is 1.0
The following is highly recommended: [('aqua_man', 5.0)]
Social Collaborative Filtering
The previous examples used Sets and Sorted Sets. We are now exploring how to use Graphs. Our example is taking a social ('friend of') aspect into account. A mathematical Graph is described as a set of vertices V and a set of Edges E, whereby E is a subset of VxV. Graph database systems are extending this mathematical definition to a so called 'Property Graph Model' which means that vertices and edges can have properties (KV pairs) associated.
Our idea is to find all comics of friends B of a given user A those are interested in a specific comic category (Super Heros). Comic books that are liked more often by the friends of user A are more relevant and should be recommended.
I am again skipping the helper functions in order to avoid to blow this article even more up. If you are interested, then the full source code can be found here: https://github.com/nosqlgeek/rl-recsys/blob/master/notebooks/Redis_for_Recommendations.ipynb. The function names are hopefully self-explaining.
Python Demo Data Script
Here the actual Graph Query:
Python Script
As 'Wonder Woman' is liked by 2 of David's friends it is more relevant than the other comics:
David has the following friends: [[['name'], ['Pieter'], ['Vassilis'], ['Katrin']]]
David likes [[['name'], ['Spiderman'], ['Batman']]]
Comic 'Wonder Woman' with relevance 2.000000
Comic 'Batman' with relevance 1.000000
Comic 'Superman' with relevance 1.000000
Content Relevance via Full Text Search
RediSearch is a search engine module for Redis. It comes with multiple built-in scoring functions. We will look at T(erm)F(requency)I(inverse)D(ocument)F(requency). It takes the following aspects into account:
- Term Frequency: How often does a specific term appear?
- Inverse Document Frequency: An inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely (i.e. the relevance of the word 'the'
We are trying to identify how likely something belongs to a specific class/category by performing a text search for terms those are associated to this category.
Helper functions are again skipped in this article but can be found in the source code repo.
Python Script
Spiderman is more likely a super hero than Batman:
[2, 'spiderman', 0.10000000000000001,['name', 'Spiderman'], 'batman', 0.035714285714285712,['name', 'Batman']]
Probabilistic Data Structures
Probabilistic data structures are characterized in the follwoing way: They ...
- use hash functions for randamization purposes
- return an approximated result
- the error is under a specific threshold
- are much more space efficient than deterministic approaches
- provide a constant query time
- speed is more important than correctness
- compactness is more important than correctness
- you only need certain data guarantees
We will take a look at the following two structures:
- HyperLogLog: Cardinality estimation of a set, i.e. unique visits
- Bloom Filter: Check if an item is contained in a set whereby false-positves are possible
Python Script
The output is:
HLL initial size: 31
Approx. count: 4
Please wait ...
Final HLL size: 10590 bytes
Approx. count: 99475
The bloom filter can be used to check if a user is interested a specific comic category without storing the users per category in a set:
Python Script
We are also printing the sizes out in order to demonstrate how space efficient Bloom filters are. The output of this script is:
BF size: 115 bytes
BF size: 76 bytes
Is Katrin interested in Fantasy?: 1
Is Katrin interested in Super Heros?: 0
Is David interested in Super Heros?: 1
Machine Learning for Classifications and Predictions
We are closing this blog article by circling back to the introduction of it. Classifications were so far often seen as a given (i.e. a comic book belonging to the 'Fantasy' category). Others could be derived by taking the existing user interests into account. We also mentioned in the section 'Data Model' that classifications might be derived from user attributes or item properties. Now, Machine Learning is providing us ways to describe a more complex models by taking such attributes (=features) into account. Such features can be represented by structured data (i.e. the comic name, ...) or unstructured data (i.e. the images within a comic book, the used colors within comic book). Feature vectors could be for instance derived from the bitmap of a scanned cover of a comic book. At the end, you can consider every ML approach as a way to approximate a function F(x) -> y, whereby x is a feature vector and y is the output vector. The idea is to create/train a model based on the known values y for given vectors x. These given vectors are called the training features. The idea is to derive a model which is able to approximate/predict a 'good' output vector y for an unknown input vector x.
Here 2 examples for such models:
-
Decision Tree ensembles (random forests). The idea is to conduct a
forest of decision trees at training time. RedisML can be used for the
Model Serving by leveraging these decision trees for i.e. classification
purposes. The class which appears most often will be the winner.
-
Neural networks: Train the weighted connections between neurons by
using a learning algorithm (i.e. Backpropagation).
- Users with an age <=20 are liking Manga comics
- Users with more than 1000 are not liking Manga comics
Python Script
I am feeling that this article could tell much more about 'Neural Networks and Artificial Intelligence', but I am also hoping that it's understandable that this is a very wide field and so I am thinking that it is worth to write a dedicated article about Redis for Machine Learning and Artificial Intelligence at a later point in time.
Finally, here some attitional Redis modules those didn't make it into one of the earlier sections:
I am feeling that this article could tell much more about 'Neural Networks and Artificial Intelligence', but I am also hoping that it's understandable that this is a very wide field and so I am thinking that it is worth to write a dedicated article about Redis for Machine Learning and Artificial Intelligence at a later point in time.
Finally, here some attitional Redis modules those didn't make it into one of the earlier sections:
- Neural Redis: Is a Redis module that implements feed forward neural networks as a native data type for Redis. The project goal is to provide Redis users with an extremely simple to use machine learning experience.
- Countminsketch: An apporximate frequency counter
- Topk: An almost deterministic top k elements counte
- Redis-tdigest: T-digest data structure wich can be used for accurate online accumulation of rank-based statistics such as quantiles and cumulative distribution at a point.