Appendix A. NOSQL Overview
Recent years have seen a meteoric rise in popularity of a family of data storage technologies known as NOSQL (a cheeky acronym for Not Only SQL , or more confrontationally, No to SQL ). But NOSQL as a term defines what those data stores are nottheyre not SQL-centric relational databasesrather than what they are, which is an interesting and useful set of storage technologies whose operational, functional, and architectural characteristics are many and varied.
Why were these new databases created? What problems do they address? Here well discuss some of the new data challenges that have emerged in the past decade. Well then look at four families of NOSQL databases, including graph databases.
The Rise of NOSQL
Historically, most enterprise-level web apps ran on top of a relational database. But in the past decade, weve been faced with data that is bigger in volume, changes more rapidly, and is more structurally varied than can be dealt with by traditional RDBMS deployments. The NOSQL movement has arisen in response to these challenges.
Its no surprise that as storage has increased dramatically, volume has become the principal driver behind the adoption of NOSQL stores by organizations. Volume may be defined simply as the size of the stored data .
As is well known, large datasets become unwieldy when stored in relational databases; in particular, query execution times increase as the size of tables and the number of joins grow (so-called join pain ). This isnt the fault of the databases themselves; rather, it is an aspect of the underlying data model, which builds a set of all possible answers to a query before filtering to arrive at the correct solution.
In an effort to avoid joins and join pain, and thereby cope better with extremely large datasets, the NOSQL world has adopted several alternatives to the relational model. Though more adept at dealing with very large datasets, these alternative models tend to be less expressive than the relational one (with the exception of the graph model, which is actually more expressive).
But volume isnt the only problem modern web-facing systems have to deal with. Besides being big, todays data often changes very rapidly. Velocity is the rate at which data changes over time.
Velocity is rarely a static metric: internal and external changes to a system and the context in which it is employed can have considerable impact on velocity. Coupled with high volume, variable velocity requires data stores to not only handle sustained levels of high write loads, but also deal with peaks.
There is another aspect to velocity, which is the rate at which the structure of the data changes. In other words, in addition to the value of specific properties changing, the overall structure of the elements hosting those properties can change as well. This commonly occurs for two reasons. The first is fast-moving business dynamics: as the business changes, so do its data needs. The second is that data acquisition is often an experimental affair: some properties are captured just in case, others are introduced at a later point based on changed needs; the ones that prove valuable to the business stay around, others fall by the wayside. Both these forms of velocity are problematic in the relational world, where high write loads translate into a high processing cost, and high schema volatility has a high operational cost.
Although commentators have later added other useful requirements to the original quest for scale, the final key aspect is the realization that data is far more varied than the data weve dealt with in the relational world. For existential proof, think of all those nulls in our tables and the null checks in our code. This has driven out the final widely agreed upon facet, variety , which we define as the degree to which data is regularly or irregularly structured, dense or sparse, connected or disconnected.
ACID versus BASE
When we first encounter NOSQL we often consider it in the context of what many of us are already familiar with: relational databases. Although we know the data and query model will be different (after all, theres no SQL!), the consistency models used by NOSQL stores can also be quite different from those employed by relational databases. Many NOSQL databases use different consistency models to support the differences in volume, velocity, and variety of data discussed earlier.
Lets explore what consistency features are available to help keep data safe, and what trade-offs are involved when using (most) NOSQL stores.[]
In the relational database world, were all familiar with ACID transactions, which have been the norm for some time. The ACID guarantees provide us with a safe environment in which to operate on data:
Atomic All operations in a transaction succeed or every operation is rolled back. Consistent On transaction completion, the database is structurally sound. Isolated Transactions do not contend with one another. Contentious access to state is moderated by the database so that transactions appear to run sequentially. Durable The results of applying a transaction are permanent, even in the presence of failures.
These properties mean that once a transaction completes, its data is consistent (so-called write consistency ) and stable on disk (or disks, or indeed in multiple distinct memory locations). This is a wonderful abstraction for the application developer, but requires sophisticated locking, which can cause logical unavailability, and is typically considered to be a heavyweight pattern for most use cases.
For many domains, ACID transactions are far more pessimistic than the domain actually requires. In the NOSQL world, ACID transactions have gone out of fashion as stores loosen the requirements for immediate consistency, data freshness, and accuracy in order to gain other benefits, like scale and resilience. Instead of using ACID, the term BASE has arisen as a popular way of describing the properties of a more optimistic storage strategy.
Basic availability The store appears to work most of the time. Soft-state Stores dont have to be write-consistent, nor do different replicas have to be mutually consistent all the time. Eventual consistency Stores exhibit consistency at some later point (e.g., lazily at read time).
The BASE properties are considerably looser than the ACID guarantees, and there is no direct mapping between them. A BASE store values availability (because that is a core building block for scale), but does not offer guaranteed consistency of replicas at write time. BASE stores provide a less strict assurance: that data will be consistent in the future, perhaps at read time (e.g., Riak), or will always be consistent, but only for certain processed past snapshots (e.g., Datomic).
Given such loose support for consistency, we as developers need to be more knowledgable and rigorous when considering data consistency. We must be familiar with the BASE behavior of our chosen stores and work within those constraints. At the application level we must choose on a case-by-case basis whether we will accept potentially inconsistent data, or whether we will instruct the database to provide consistent data at read time, incurring the latency penalty that that implies. (In order to guarantee consistent reads, the database will need to compare all replicas of a data element, and in an inconsistent outcome even perform remedial repair work on that data.) From a development perspective this is a far cry from the simplicity of relying on transactions to manage consistent state on our behalf, and though thats not necessarily a bad thing, it does require effort.