Saturday, 5 July 2014

Some thoughts about Redis

Preamble

I just got a question regarding Redis. So I investigated Redis a bit and want to share my thoughts.


Basics

Redis is a Key-Value-Store which supports specific Data Structures. This is the reason why it is also called a Data-Structure-Server. What it means is that your values (regarding the KV-Pairs) can have specific types. The following types are available:
  • Strings
  • Hashes
  • Lists
  • Sets
  • Sorted Sets
So Redis allows you to store object data as E.G. HashMaps. No schema is enforced, so it supports a flexible schema. However, what is eye catching is the fact that even more flexible structures like JSON objects are not supported. Furthermore there is no internal concept of Indexing and Querying, but you can indeed query via 'Key Patterns' and you can build indexes manually by using E.G. Sorted Sets. However, this is less flexible than predefined and automatically maintained indexes and so the possibility to perform queries on them. 

I found the 'Query on Key Patterns by using Wildcards' feature quite interesting, but it is not usable in the most cases, because what Redis does is just to iterate over all Keys (a so called Full Table Scan, ... even if we do not have tables here - it's a term from the relational world) by matching the pattern.

Side note: Other solutions are quite more flexible regarding the 'Indexing/Querying' feature. For instance: Couchbase Server does allow you to use Incremental Map Reduce (on JSON document values) in order to define Views (Indexes) those can be queried via a REST interface.

The following basic operations are available in Redis:
  • SET $key $value - To set a String value
  • GET $key - To get a value
  • HSET $key $field $value - To set the field value of a stored Hashmap
  • HGET $key $field - To get a field value
  • ...
  • LSET $key $index $value - To set a value of the list
  • LINSERT $key $pivot $value - To insert a value to a list
  • LPUSH $key $value - Add a value to a list
  • ...
  • DEL $key - To delete by the key
  • ...
  • (The complete list is available here: http://redis.io/commands)



Persistence

My first impression was that Redis does not really focus on persistence. However, Redis provides two options:
  • Point in time snapshot of your in-memory data
    • Single file
    • Because it is not differential, there is a lot more IO than it would be required
    • Possible data loss, because such a snapshot contains multiple inserts or changes
      • Depends on configuration
      • Default is: snapshot every 60 seconds or after 1000 writes (key changes)
  • Write persistence logs
    • Just contains an entry for every write operation
    • Replayed during the next server startup 
You can combine both options, but the approach seems to be more a kind of 'Warmup my Cache based on a previous State' instead of a 'Store Data in a Database' one.

Side note: Usually, a Database System uses specific storage structures on disk in order to allow later access to the persisted data. The simplest one would be a Heap Structure. It stores data just block-wise in series. Most DBS I know also support B-Trees for storage purposes in order to have a built in primary index which allows you to even perform more efficient range queries on the the primary key of the data.



Memory Management

As far as I understand, all the data which you want to store in Redis has to fit into the main memory (RAM) which is allocated by the Redis server. How much memory should be allocated is configurable. Redis has a 'Virtual Memory' feature, but it just means that data will be swapped to disk as soon as there is no more physical main memory available. I also found the following note in their documentation:
"Redis VM is now deprecated. Redis 2.4 will be the latest Redis version featuring Virtual Memory (but it also warns you that Virtual Memory usage is discouraged). We found that using VM has several disadvantages and problems. In the future of Redis we want to simply provide the best in-memory database (but persistent on disk as usual) ever, without considering at least for now the support for databases bigger than RAM. Our future efforts are focused into providing scripting, cluster, and better persistence."

Side note: From my point of view 'Memory Management' usually means a kind of 'Buffer Scheduling' and so it includes a Replacement Strategy. This is nothing which you should expect from Redis.



Scalability

General terms regarding the 'Scalability' subject are:
  • Replication
  • Clustering
  • Sharding
Redis supports Master-Slave replication. You can configure a Server to become the Slave of another one (the Master). One Master can have multiple Slaves. The 'SLAVEOF' command can be used at run-time to add a Slave (even if you have to double maintain it in the configuration file to make this topology change persistent). To make a node the Slave of a Master causes that the data of the master will be replicated to the Slave. Furthermore, future data changes will be asynchronously replicated from the Master to the Slave. So Replication allows you to keep copies of your date within a Cluster of nodes. So far, no Sharding (Partitioning) is involved. You could also add a Load Balancer on top of the cluster to spread the read load BUT because this is not a 'Shared Nothing' approach, you will obviously have consistency issues (regarding the CAP theorem).

Side note: Because the replication happens asynchronously and there is no acknowledgement regarding the replication (to the client), Redis does not guarantee consistency regarding a write operation (in a replication scenario). Couchbase Server, for instance, allows you to specify a number of successful performed replication operations before the client gets a successful acknowledgement. Or simpler: The client can call a Set operation with a parameter 'ReplicateTo.$NUM' in order to make sure that the data was already replicated to another server. Also a 'PersistTo.$NUM' is possible to make sure that the data was already persisted. However it is just fair to say that the price for such a safer acknowledgement is a performance impact. So it depends on the requirements.

Whereby I already named the previous setup a Cluster (in fact it is only a High Availability Cluster, because you could switch one of the Slaves to a Master when the Master fails), the Redis documentations says the following:

"Redis Cluster provides a way to run a Redis installation where data is automatically sharded across multiple Redis nodes"

BUT it also says

" It is currently not production ready, but finally entered beta stage, so we recommend you to start experimenting with it."
So the recommended way seems to be to use a 3rd party solution (Twemproxy) which sits between multiple Redis instances and the Clients.

However let's spot a light on Redis' own Clustering approach: The data is sharded based on a hash function, which means 'CRC16(key) MOD 16384' . So there are 16384 shards those are evenly spread across several servers. Each server node of the cluster has approximately '16384 DIV n' partitions on it. Let's assume that you have the server nodes A, B and C. Then you can add replication slaves to them (as mentioned before). So you have for instance the Master 'A' and the Slaves 'A1', 'A2', 'A3', ... . The number of Slaves for a Master controls the number of Replicas.

The setup of a (Sharded) Cluster can be performed by configuration, BUT in order to start a cluster you need a bunch of empty Redis nodes, those need to be configured in a specific way. So, as far as I understood, it is not possible to just reuse your existing Redis setup (E.G. start with one normal Master and 2 Slaves) At least 3 Master nodes and 3 Slave nodes are recommended. To add a new node means to configure an additional Master (and optional an additional Slave) via the 'redis-trip' tool and then to reshard the data across the Cluster. Automatic and manual failover is possible if one of the Masters may fail.

Side note: Couchbase, for instance ;-) , gives each server node a double role. Each server is regarding its active data the Replication Master but may be the Replication Slave regarding the active data of another node in the cluster. Replicas (similar to active data items) are also spread across the cluster by using a hashing approach. So there is a quite simpler cluster management. All you have to do is to set the number of Replicas on a bucket (like a database) level and everything else will be handled automatically for you. Up to 3 replica copies are possible.



Summary

Redis is very strong as an In-Memory Key-Value-Store by providing support for additional Data Structures. So you could consider it for caching purposes. An (from my point of view, even better) alternative for this use case is Couchbase Server (it allows you to create Memcached buckets, and so can be also used for pure in-memory uses cases - or as a more mature Memchached replacement) which also provides you JSON document support (including nested data, indexing and querying).

I think that even the Redis Development Team would agree that the Redis persistence could be improved. It covers more the 'Warmup my cache based on previous state' scenario and says 'Hey, I am NOT a Storage Engine!' 

In practice, Redis does not allow data bigger than in RAM.

Redis is able to scale horizontally by providing features like High Availability (via Replicas) and hashed based Auto Sharding. However, the Cluster feature is not yet usable in production. And even if a hashing aproach is used to distribute the data evenly across the Cluster, the Master-Slave-Replication approach seems to be a kind of inconsequent and more historical influenced than conceptional.



No comments:

Post a Comment