1. Why NoSQL?
Since schooling most of us are taught to structure information, such that it can be represented in tabular form. But not all information can follow that structure, hence the existence of NULL values. The NULL value represents cells without information. To avoid NULL s, we must split one table into multiples, thus introducing the concept of normalization. In normalization, we split the tables, based on the level of normalization we select. These levels are 1NF (first normal form), 2NF, 3NF, BCNF (BoyceCodd normal form, or 3.5NF), 4NF, and 5NF, to name just a few. Every level dictates the split, and, most commonly, people use 3NF, which is largely free of insert, update, and delete anomalies.
To achieve normalization, one must split information into multiple tables and then, while retrieving, join all the tables to make sense of the split information. This concept poses few problems, and it is still perfect for online transaction processing (OLTP) .
Working on a system that handles data populated from multiple data streams and adheres to one defined structure is extremely difficult to implement and maintain. The volume of data is often humongous and mostly unpredictable. In such cases, splitting data into multiple pieces while inserting and joining the tables during data retrieval will add excessive latency.
We can solve this problem by inserting the data in its natural form. As there is no or minimal transformation required, the latency during inserting, updating, deleting, and retrieving will be drastically reduced. With this, scaling up and scaling out will be quick and manageable. Given the flexibility of this solution, it is the most appropriate one for the problem defined. The solution is NoSQL, also referred to as not only, or non-relational, SQL.
One can further prioritize performance over consistency, which is possible with a NoSQL solution and defined by the CAP (consistency, availability, and partition tolerance ) theorem. In this chapter, I will discuss NoSQL, its diverse types, its comparison with relational database management systems (RDBMS) , and its future applications.
Types of NoSQL
In NoSQL , data can be represented in multiple forms. Many forms of NoSQL exist, and the most commonly used ones are key-value, columnar, document, and graph. In this section, I will summarize the forms most commonly used.
Key-Value Pair
This is the simplest data structure form but offers excellent performance . All the data is referred only through keys, making retrieval very straightforward. The most popular database in this category is Redis Cache. An example is shown in Table .
Table 1-1
Key-Value Representation
Key | Value |
---|
C1 | XXX XXXX XXXX |
C2 | 123456789 |
C3 | 10/01/2005 |
C4 | ZZZ ZZZZ ZZZZ |
The keys are in the ordered list, and a HashMap is used to locate the keys effectively.
Columnar
This type of database stores the data as columns instead of rows (as RDBMS do) and are optimized for querying large data sets. This type of database is generally known as a wide column store. Some of the most popular databases in this category include Cassandra, Apache Hadoops HBase, etc.
Unlike key-value pair databases, columnar databases can store millions of attributes associated with the key forming a table, but stored as columns. However, being a NoSQL database, it will not have any fixed name or number of columns, which makes it a true schema-free database.
Document
This type of NoSQL database manages data in the form of documents. Many implementations exist for this kind of database, and they have different various types of document representation. Some of the most popular store data as JSON, XML, BSON, etc. The basic idea of storing data in document form is to retrieve it faster, by matching to its meta information (see Figures ).
Figure 1-1
Sample document structure (JSON) code
Figure 1-2
Sample document structure (XML) code
Documents can contain many different forms of data key-value pairs, key-array pairs, or even nested documents. One of the popular databases in this category is MongoDB.
Graph
This type of database stores data in the form of networks, e.g., social connections, family trees, etc. (see Figure ). Its beauty lies in the way it stores the data: using a graph structure for semantic queries and representing it in the form of edges and nodes.
Nodes are leaf information that represent the entity, and the relationship (or relationships) between two nodes is defined using edges. In the real world, our relationship to every other individual is different which can be distinguished by various attributes, at the edges level.
Figure 1-3
Graph form of data representation
The graph form of data usually follows the standards defined by Apache TinkerPop, and the most popular database in this category is Neo4J (see Figure .
Figure 1-4a
Gremlin Query on TinkerPop Console to Fetch All the Records
Figure 1-4b
Result in TinkerPop console
What to Expect from NoSQL
To better understand the need for using NoSQL, lets compare it to RDBMS from a transactional standpoint. For RDBMS, any transaction will have certain characteristics, which are known as ACIDatomicity, consistency, isolation, and durability .
Atomicity
This property ensures that a transaction should be completed or doesnt exist at all. If, for any reason, a transaction fails, a full set of changes that has occurred through the course of transaction will be removed. This is called rollback .
Consistency
This property ensures that the system will be in a consistent state after completion of a transaction (failed or successful).
Isolation
This property ensures that every transaction will have exclusivity over the resources, e.g., tables, rows, etc. The reads and writes of the transaction will not be visible to reads and writes of any other transaction.
Durability
This property ensures that the data should be persistent and shouldnt get lost during a hardware, power, software, or any other failure. To achieve this, the system will log all the steps performed in the transaction and the state will get re-created whenever required.