Saturday, 14 December 2013

Who are the players?

This post tries to give a short overview about the NoSQL market. It does not aim to be complete. It also gives you an idea of the several kinds of NoSQL approaches. So we will see that there is not in general the one and only NoSQL system. There are advantages and disadvantages.

Key-Value Stores

Key value stores are just that simple as they sound. If you know how a Java Map works, then you basically know the concept behind it. They allow you to store/update/retrieve values by keys. The key is often a string value (or a number, or a date). Because there has not to be a kind of specific container (like a table in the relational world) the key may follow conventions (E.G. 'houses:name:1' to store the name of the first house). Document oriented database systems could be seen as a special kind of database Key-Value Stores, because it just means that the value has the format of a JSON document. The major vendor seems to be:
  • Riak: This is a distributed Key-Value Store whereby a value can be Plain Text, JSON, XML or even binary data. It is distributed without a single point of failure (Consistent-Hashing). Databases are called 'Buckets'. It supports Map-Reduce.
Advantages are:
  • Highly scaleable via clustering
  • Easy to use
  • Some of them are InMemory and so really fast
  • Values can be stored in a format which is directly usable by Web Applications (JSON, HTML, ...)
  • Access via RESTFul services
Disadvantes are:
  • In general a key value store does not support links between the objects. (Except if you use a more powerful format like XML or JSON)
  • Some key value stores do not really allow to perform queries by attributes. The only thing you can do is to access the data by a key. So to better find data, you may need to build own help structures within the store.

Document-Oriented

The difference to Key-Value Stores is more or less that Document-Oriented database systems are focusing on JSON documents. Because there is now a restriction regarding the format, it is easier possible to perform Adhoc-Queries. In general you can also build indexes as alternative access paths to your data. So they are addressing Web Application developers (currently JavaScript is the most used language to realize Web Applications). The most JavaScript frameworks allow you to query data via HTTP (RESTFul) and because JSON means just Java Script Object Notations such a framework can immediately work with the data which is provided by the database system. (Maybe you remember: A few years ago MySQL and PHP were the dream team for web development, because there were out of the PHP modules available to access the data of a MySQL database. This is a kind of similar. A think a JavaScript developer just has to love this Document Oriented Stores.). There are for instance the following players on the market:
  • MongoDB: MongoDB supports distribution. A container is called a 'collection'. You can simply add JSON documents (db.collection_name.insert($JSON_DATA}). Keys can be generated automatically. A find-Command allows you to list every document of a collection. You can define functions as short cuts in order to filter the data. Queries work with a kind of pattern matching regarding the stored JSON document. So to find every document with the name set to 'Max' you just filter for '{name:"Max"}'. It also supports Map-Reduce. The cluster setup is more complicated.
  •  Couchbase: It's also a highly scalable Document Store with focus on Web Application development, but also for bigger scenarios. (From the Smartphone to the Datacenter). In earlier versions no Adhoc-Queries were possible, because it comes more from a Key-Value-Store, but it now has full JSON support. It comes with a useful Web Interface.You can define Views. Views are defined by using Map-Reduce. A view provides an alternative ... view to your data. An example would be to define a view to find a specific document by the name property. Views can be queried (Min, Max, ...). So a View can be used to find specific data faster, if you do not  already know the keys of it. 
Advantages of Document-Oriented databases are:
  • Highly scaleable via clustering
  • Still easy to use
  • Indexing and Adhoc-Queries
  • Flexible because more or less schema less
  • E.G. Couchbase is InMemory and so really fast
  • Values are stored to be directly consumable by  Web Applications or Mobile Apps
  • In most cases, the data is accessible via RESTFul services
Disadvantes are:
  • Schema less also means no Constraints (no value constraints, no referential constraints)

Graph Databases

It should be quite easy for me to write something about GraphDB-s because I designed the persistent storage engine for the 'sones GraphDB' some years ago. So in summary GraphDB-s are focusing on the storage and the processing of mathematical graphs by enriching with the feature to store data on the Nodes and Edges. A graph database is especially the right choice if you have highly connected data and if you want to describe the connections between the data more efficient. To find data usually means to follow edges in the graph or to traverse a graph. The cool thing about graph databases is that Graph algorithms are working on the stored data. So you can ask for the Shortest Path from node x to node y. You can build the spanning tree of the graph or you can perform Network Optimizations. A lot of these algorighms are more general purpose usable than you would imagine. Even if the data model is a kind of human understandable, an Application Developer would maybe find it hard to see all these advantages out of the box. There are also some challanges regarding distribution of the data. Because the data is highly connected, you would need to decide on which criteria partitions can be build (Algorithms are available), but the result may be that you have one really big partition and some small ones, which would not perfectly scale. 
  • Neo4j: I think they are currently the biggest player in this field. Neo4j provides distributed Graphs, several Graph algorithms, Graph visiualization and a Graph API.
  • InfiniteGraph
  • Titan
  • (sones GraphDB: No longer on the market.)
Advantages are:
  • Very fast for connected data
  • Out of the box algorithms for processing data
  • Access via RESTFul services
  • Intuitive visualization of data as networks
  • Indexfree Adjacency (You just need one start point to explore the data close to it, This approach can be itterated via traversals)
Disadvantes are:
  • Harder to distribute in a useful way (regarding performance increase)
  • The API is more complex than for other Key-Value-Stores
  • Not that useful for scenarios where less connected data is stored

Column-Oriented

The base of this kind of database systems was provided by Google. They described a storage model which is named Big Table. Big Table is a distributed storage system for distributed data. So as you can see, distribution is in the focus again.
  • HBase: You should not mix it up with Hadoop. Whereby HBase is the database system, Hadoop is a Map-Reduce framework and a distributed file system on which HBase is built on. So however, you will meet them both together. HBase is designed for Big-Data, so you would usually not use it in smaller scenarios. In HBase containers are named tables, even if they mean something which is a bit different. A table is like a HashMap of HashMaps (so it contains Key-Value-Pairs and each Value is again a structure which contains Key-Value-Pairs). So let's imagine it as a table. A table contains rows and columns. Columns are grouped to column families. Each row is identified by a row key. The data in a row can be accessed by a column family, which then contains the column's value. So let's imagine that you have a car table. The row key column would contain values like "car id_i". There would be a column family which is named 'parts'. The 'parts' column family contains the values for the columns 'engine', 'tire', ... . An interesting feature is that every insert is versioned by using timestamps. So to insert in the same cell (row by column-group by column) means that the old value will be kept with an earlier timestamp. Column families allow you fine tune the system by column family. So you can make several configuration decissions dependent on the characteristics of the data which is stored in a column familiy. So you could interprete a column family as a kind subcontainer. A typical use case is log file analysis. Get structured data from unstructured data by performing Map-Reduce by then storing it inside a HBase database,
Advantages are:
  • Very scalable
  • Access to data on a column level
  • Versioning for free
Disadvantes are:
  • Not that handy, and so not suitable for small usage scenarios

Appendix

This post mentioned sometimes Map-Reduce or Consistent Hashing. I will blog later about these two subjects.

No comments:

Post a Comment