Tuesday, 30 September 2014

An Asynchronous World

This article aims to spot a light on asynchronous data processing in the Java world. But let's at first talk a bit about the difference between synchronous and asynchronous operation execution.

A synchronous execution means that you perform an operation on the client side by then waiting until the result is returned by the server. The sends are in sync with the retrievals and so the execution of multiple operations is in sequence. This means that the client side execution is blocked until the operation returns. 

An asynchronous execution, on the other hands side, means that the client is just performing operation calls (requests) to the server by not waiting for the immediate result. The client is not blocked and just continues to perform operations calls. Therefore the result which is returned by the server has to be handled as soon as it arrives at the client side. The order in which the results are arriving may not be the same as the order of the original requests. The function which is handling this 'in the future' arriving result is normally named a Callback Function.


Several programming languages (and more specific the frameworks those are related to them) have this concept implemented. E.G: Java has 'Futures' and AngularJs (an MVC framework from the Java Script world) has 'Promises'.

Let's focus on Java for now. So let's at first take a look on 'Futures'.

'java.util.concurrent.Future' is an interface which provides the following methods:

  • Future<V>
    • V get() - Waits for the result by blocking until the result arrives
    • V get(long timeout, TimeUnit unit) - Waits for max. the provided timeout
    • boolean isDone() - Checks if the work is done (or was interrupted)
    • boolean cancel(boolean mayInteruptIfRunning) - cancels the work
    • boolean isCancelled() - Was it cancelled?
So a Future is the result of an asynchronous computation which gives you the ability to check the state of the computation and to get the result of the execution (in the future).

BTW: Futures are used in the Java world related to multi threaded execution. Work is submitted to an ExecuterService. A thread pool can be used to paralyze the workload. Because one thread may be faster than another one, the result may not be returned in the same order as submitted. This means that the result is returned asynchronously.

If you use Couchbase's earlier Java Client (1.4.x), then you will realize that an asynchronous operation will return you a Future object.

Furthermore it is extending an 'AbstractListenableFuture' which allows you to attach a listener (similar to a Callback Function) to the operation.


OK, so this is how things are working with the Couchbase 1.4.x Java Client. Things are becoming even cooler with the new 2.x version because the new Java Client uses ReactiveX for Java (http://reactivex.io). So let's start to talk a bit what ReactiveX is and which pattern can be implemented by using it.

The ReactiveX website says: "ReactiveX - The Observer pattern done right; ReactiveX is a combination of the best ideas from the Observer pattern, the Iterator pattern, and functional programming."

  • In the Observer pattern, Observers are used to observe a subject (Observable). The Observers are realizing (signaled by the Observable) if the state of the Observable is changing by being able to react on such an event. It's easy to see that you can use it also for asynchronous data access. 
  • The Iterator pattern allows to iterate through a collection of objects without the need to know the internal structure of the collection of objects. To combine the Observer pattern with the Iterator pattern means that you can receive a data stream (instead the whole collection of data) by then reacting on the arrival of new data items. 
  • The functional aspect is that you may pass functions (E.G. as anonymous classes or lambda expressions) as the arguments to an Observable.

In JavaRx, such an Observable object can be used in a similar way as a Java Iterable (E.G. using methods like 'skip', 'map', 'forEach'). The difference is that you just consume data from an Iterable (pull access) but you also produce (push access) data regarding an Observable. There are two new methods available for an Observable:
  • onCompleted() - An Observable calls it's observer's onCompleteMethod()
  • onError() - An Observable calls it's observer's onError() method if an error occurred
Here an example how you can interact with an Observable:

So let's finish this article by showing how Observables are used in the new 2.x Couchbase Java Client:

A more advanced example can be found in Michael's blog post about the new Couchbase Java SDK or in Phil's demo application.




Saturday, 5 July 2014

Some thoughts about Redis

Preamble

I just got a question regarding Redis. So I investigated Redis a bit and want to share my thoughts.


Basics

Redis is a Key-Value-Store which supports specific Data Structures. This is the reason why it is also called a Data-Structure-Server. What it means is that your values (regarding the KV-Pairs) can have specific types. The following types are available:
  • Strings
  • Hashes
  • Lists
  • Sets
  • Sorted Sets
So Redis allows you to store object data as E.G. HashMaps. No schema is enforced, so it supports a flexible schema. However, what is eye catching is the fact that even more flexible structures like JSON objects are not supported. Furthermore there is no internal concept of Indexing and Querying, but you can indeed query via 'Key Patterns' and you can build indexes manually by using E.G. Sorted Sets. However, this is less flexible than predefined and automatically maintained indexes and so the possibility to perform queries on them. 

I found the 'Query on Key Patterns by using Wildcards' feature quite interesting, but it is not usable in the most cases, because what Redis does is just to iterate over all Keys (a so called Full Table Scan, ... even if we do not have tables here - it's a term from the relational world) by matching the pattern.

Side note: Other solutions are quite more flexible regarding the 'Indexing/Querying' feature. For instance: Couchbase Server does allow you to use Incremental Map Reduce (on JSON document values) in order to define Views (Indexes) those can be queried via a REST interface.

The following basic operations are available in Redis:
  • SET $key $value - To set a String value
  • GET $key - To get a value
  • HSET $key $field $value - To set the field value of a stored Hashmap
  • HGET $key $field - To get a field value
  • ...
  • LSET $key $index $value - To set a value of the list
  • LINSERT $key $pivot $value - To insert a value to a list
  • LPUSH $key $value - Add a value to a list
  • ...
  • DEL $key - To delete by the key
  • ...
  • (The complete list is available here: http://redis.io/commands)



Persistence

My first impression was that Redis does not really focus on persistence. However, Redis provides two options:
  • Point in time snapshot of your in-memory data
    • Single file
    • Because it is not differential, there is a lot more IO than it would be required
    • Possible data loss, because such a snapshot contains multiple inserts or changes
      • Depends on configuration
      • Default is: snapshot every 60 seconds or after 1000 writes (key changes)
  • Write persistence logs
    • Just contains an entry for every write operation
    • Replayed during the next server startup 
You can combine both options, but the approach seems to be more a kind of 'Warmup my Cache based on a previous State' instead of a 'Store Data in a Database' one.

Side note: Usually, a Database System uses specific storage structures on disk in order to allow later access to the persisted data. The simplest one would be a Heap Structure. It stores data just block-wise in series. Most DBS I know also support B-Trees for storage purposes in order to have a built in primary index which allows you to even perform more efficient range queries on the the primary key of the data.



Memory Management

As far as I understand, all the data which you want to store in Redis has to fit into the main memory (RAM) which is allocated by the Redis server. How much memory should be allocated is configurable. Redis has a 'Virtual Memory' feature, but it just means that data will be swapped to disk as soon as there is no more physical main memory available. I also found the following note in their documentation:
"Redis VM is now deprecated. Redis 2.4 will be the latest Redis version featuring Virtual Memory (but it also warns you that Virtual Memory usage is discouraged). We found that using VM has several disadvantages and problems. In the future of Redis we want to simply provide the best in-memory database (but persistent on disk as usual) ever, without considering at least for now the support for databases bigger than RAM. Our future efforts are focused into providing scripting, cluster, and better persistence."

Side note: From my point of view 'Memory Management' usually means a kind of 'Buffer Scheduling' and so it includes a Replacement Strategy. This is nothing which you should expect from Redis.



Scalability

General terms regarding the 'Scalability' subject are:
  • Replication
  • Clustering
  • Sharding
Redis supports Master-Slave replication. You can configure a Server to become the Slave of another one (the Master). One Master can have multiple Slaves. The 'SLAVEOF' command can be used at run-time to add a Slave (even if you have to double maintain it in the configuration file to make this topology change persistent). To make a node the Slave of a Master causes that the data of the master will be replicated to the Slave. Furthermore, future data changes will be asynchronously replicated from the Master to the Slave. So Replication allows you to keep copies of your date within a Cluster of nodes. So far, no Sharding (Partitioning) is involved. You could also add a Load Balancer on top of the cluster to spread the read load BUT because this is not a 'Shared Nothing' approach, you will obviously have consistency issues (regarding the CAP theorem).

Side note: Because the replication happens asynchronously and there is no acknowledgement regarding the replication (to the client), Redis does not guarantee consistency regarding a write operation (in a replication scenario). Couchbase Server, for instance, allows you to specify a number of successful performed replication operations before the client gets a successful acknowledgement. Or simpler: The client can call a Set operation with a parameter 'ReplicateTo.$NUM' in order to make sure that the data was already replicated to another server. Also a 'PersistTo.$NUM' is possible to make sure that the data was already persisted. However it is just fair to say that the price for such a safer acknowledgement is a performance impact. So it depends on the requirements.

Whereby I already named the previous setup a Cluster (in fact it is only a High Availability Cluster, because you could switch one of the Slaves to a Master when the Master fails), the Redis documentations says the following:

"Redis Cluster provides a way to run a Redis installation where data is automatically sharded across multiple Redis nodes"

BUT it also says

" It is currently not production ready, but finally entered beta stage, so we recommend you to start experimenting with it."
So the recommended way seems to be to use a 3rd party solution (Twemproxy) which sits between multiple Redis instances and the Clients.

However let's spot a light on Redis' own Clustering approach: The data is sharded based on a hash function, which means 'CRC16(key) MOD 16384' . So there are 16384 shards those are evenly spread across several servers. Each server node of the cluster has approximately '16384 DIV n' partitions on it. Let's assume that you have the server nodes A, B and C. Then you can add replication slaves to them (as mentioned before). So you have for instance the Master 'A' and the Slaves 'A1', 'A2', 'A3', ... . The number of Slaves for a Master controls the number of Replicas.

The setup of a (Sharded) Cluster can be performed by configuration, BUT in order to start a cluster you need a bunch of empty Redis nodes, those need to be configured in a specific way. So, as far as I understood, it is not possible to just reuse your existing Redis setup (E.G. start with one normal Master and 2 Slaves) At least 3 Master nodes and 3 Slave nodes are recommended. To add a new node means to configure an additional Master (and optional an additional Slave) via the 'redis-trip' tool and then to reshard the data across the Cluster. Automatic and manual failover is possible if one of the Masters may fail.

Side note: Couchbase, for instance ;-) , gives each server node a double role. Each server is regarding its active data the Replication Master but may be the Replication Slave regarding the active data of another node in the cluster. Replicas (similar to active data items) are also spread across the cluster by using a hashing approach. So there is a quite simpler cluster management. All you have to do is to set the number of Replicas on a bucket (like a database) level and everything else will be handled automatically for you. Up to 3 replica copies are possible.



Summary

Redis is very strong as an In-Memory Key-Value-Store by providing support for additional Data Structures. So you could consider it for caching purposes. An (from my point of view, even better) alternative for this use case is Couchbase Server (it allows you to create Memcached buckets, and so can be also used for pure in-memory uses cases - or as a more mature Memchached replacement) which also provides you JSON document support (including nested data, indexing and querying).

I think that even the Redis Development Team would agree that the Redis persistence could be improved. It covers more the 'Warmup my cache based on previous state' scenario and says 'Hey, I am NOT a Storage Engine!' 

In practice, Redis does not allow data bigger than in RAM.

Redis is able to scale horizontally by providing features like High Availability (via Replicas) and hashed based Auto Sharding. However, the Cluster feature is not yet usable in production. And even if a hashing aproach is used to distribute the data evenly across the Cluster, the Master-Slave-Replication approach seems to be a kind of inconsequent and more historical influenced than conceptional.



Friday, 13 June 2014

Understanding Couchbase's Elasticsearch configuration

Preamble 

Couchbase Server can be used together with Elasticsearch. This makes quite sense because Full-Text-Search is not one of the main purposes of a Key-Value-Store. Instead an external (also scalable) search engine should be used in order to tokenize and index the data for full text search purposes. How it works with Couchbase is that you can install a transport plug-in to your Elasticsearch instance. You then can configure Couchbase's XDCR (Cross Data Centre Replication) feature in order to replicate the JSON documents to Elasticsearch and to let index your JSON data there. Then you can perform full text searches by using the Lucene Query Languge by getting the resuting keys (of Documents those are stored in Couchbase) back. Further details can be found at http://docs.couchbase.com/couchbase-elastic-search/.

Elasticsearch Basics

Let's talk a bit about the Elasitcsearch terminology before we spot a light on Couchbase's default Elasticsearch configuration.

Document Index

You can have multiple indexes. Elasticsearch can store multiple types in one single index. An index is called an inverted index because it works in an inverted way in comparison to E.G. relational indexes. In non-inverted indexes, we have the situation that we index on one or multiple key-values to find the address of a specific data item.So you can understand a usual index as a two column table: whereby in this case the 2 sentences "Tim is sitting next to Bob" and "Bob is sitting next to Jim" are indexed.

Document Type

A type is just a logical container. It contains multiple fields. Important is that two fields with the same name must have everytime the same primitive (core) data type if both fields are associated to the same type. To add a JSON document to the index, the following REST call can be used:
The ${id} parameter is optional and if it is not given Elasticsearch will generate an identifier automatically.

Document Field

A JSON document has multiple fields. The following example shows the fields firstName and lastName.
The field type will be automatically determined E.G. 1 is handled as a number but "1" as a String. However, an explicit Mapping is recommended.

A mapping file is basically looking like:
Basic options are:
  •  type: The Core Type E.G. long, string, boolen, used to tell the Search Engine how to analyze the field
  • store: If it should be stored. This is a kind of confusing. So as far as I understood a field can be searched (via the inverted index) if it is indexed. If you set 'store' to 'no' and index to 'yes' then you can search for it but not show the contents of the document. BUT if you set the whole document as '_source' then you can also access the contents of it
  •  index: Index mode 'analyzed','no', 'not_analyzed', the value 'no' means that the field is not searchable. The difference between 'no' and 'not_analyzed' is that with the last one the field is indexed but let's say not tokenized. Instead only "perfect match queries" are possible on the field.
  • null_value: Default value if the field value is not available
  • include_in_all: The _all field is a special field in which all the other fields are automatically included. If set to 'no' then the field will not be included in the _all field
Built in Fiels are:
  • _uid: Unique identifier composed of the document's type and _id
  • _id: The identifier of the document
  • _type: The type of the document (as indexed by type)
  • _all: To store the data of all the other fields in a single field in order to simplify searching. By default every field will be added to the _all field
  • _source: Is used to store the orginal source document. This is enabled by default. To avoid storage overhead the 'includes' and 'excludes' option could be used to override the default.
Other fiels are:
  • _index
  • _size
  • _timestamp
  • _ttl

 

 The Default Mapping

As mentioned before Elasticsearch maps automatically by default. Let's name this 'implicite mapping' whereby the other one is an 'explicite' one. So if a new type is encountered (because you put data to it) and you would not have a excplicte mapping for it then the _default_ mapping is used. It looks by default like the follwowing one:
You can override the default mapping (globally, or by index). Everything which is added will just override the old default value by being then the new default value.

 Dynamic Templates

These are giving a better controll (if necessary). So you can apply mappings based on the field name by using patterns. An example is to use a different analyzer for a field that has name which ends with '_es'. Therefor the match parameter is used (Match a field based on a pattern, E.G. "match" : "*_es"). Additionally you could match on paths of properties in a JSON document (E.G. address.*.name).

The Couchbase Template

Elasticsearch allows you to perform the configuration (E.G. the mapping) via it's RESTFul interface.
So the Couchbase mapping file looks like the following one:
The setting '"template" : "*"' means that the template applies to every index. A value of E.G. 'cou*' would mean that the template is only applied to indexes those have a name which starts with 'cou'.

Multiple templates can potentially match one index. In this case the mappings are merged. So the "order" setting specifies the order of this merge operation. So lower numbers are meaning that it is merged earlier.

The type 'couchbaseCheckpoint' has at first a '_source' setting which causes that only the 'doc' (and not the 'meta') part of documents of this type is accessibla as content. Furthermore a dynamic template is created which is named 'store_no_index'. This template matches every document of the type 'couchbaseCheckpoint' by not storing it, by not indexing it and by not making it accessible via the '_all' field BUT you should keep in mind that the content of the 'doc' part of such a JSON document is still accessible as the 'source'.

The '_default_' mapping overrides the global one in this case. The '_source' setting causes that by default just the content of the 'meta' part of the incoming JSON document is stored as the 'source'.

The meta property of a document is mapped to the type 'object' and it is not included in the '_all' field. Because the other defaults are not overridden, this means that it is indexed and stored. This also means that all other fields are indexed and stored.

What may be interesting is the fact that Couchases transfers objects by letting them index as the 'couchbaseDocument' type. So the '_default_' section configures especially how Couchbase Documents are handled.

Sunday, 13 April 2014

How to use Couchbase Lite with Android

This article should just explain what you need to do in order to use Couchbase Lite in your Android App. It is more or less following the tutorial of the documentation by summarizing it a bit.

So what you need first is an Android project. I used Android Studio (and Gradle as the build framework). The build script is looking as the following one:


Build script

It shows which additional dependencies are required. Additionally you should download the following library and save it under the 'libs' directory of your project


Additional dependency

After you synchronized your Gradle project (All the other additional dependencies will be downloaded from the Maven repositories) you should be able to use the Couchbase Lite classes in your Android application. Let's begin with the Layout definition of our mobile app:

Layout definition

Our example app is just connecting to a database (by creating it) and then stores a document with a specific id. We will then retrieve the document again in order to show some data in a text field. Following the Java code of it:

Main class

The result then looks as the following one:



Couchbase Lite can not just store the local data of your App. The data can be synchronized to a Couchbase Server Cluster. Another blog post will cover this subject.