Preface
Distributed database systems are an integral part of most businesses and the vast majority of software applications. These applications provide logic and a user interface, while database systems take care of data integrity, consistency, and redundancy.
Back in 2000, if you were to choose a database, you would have just a few options, and most of them would be within the realm of relational databases, so differences between them would be relatively small. Of course, this does not mean that all databases were completely the same, but their functionality and use cases were very similar.
Some of these databases have focused on horizontal scaling (scaling out)improving performance and increasing capacity by running multiple database instances acting as a single logical unit: Gamma Database Machine Project, Teradata, Greenplum, Parallel DB2, and many others. Today, horizontal scaling remains one of the most important properties that customers expect from databases. This can be explained by the rising popularity of cloud-based services. It is often easier to spin up a new instance and add it to the cluster than scaling vertically (scaling up) by moving the database to a larger, more powerful machine. Migrations can be long and painful, potentially incurring downtime.
Around 2010, a new class of eventually consistent databases started appearing, and terms such as NoSQL, and later, big data grew in popularity. Over the last 15 years, the open source community, large internet companies, and database vendors have created so many databases and tools that its easy to get lost trying to understand use cases, details, and specifics.
The Dynamo paper , published by the team at Amazon in 2007, had so much impact on the database community that within a short period it inspired many variants and implementations. The most prominent of them were Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; and Riak, created by former Akamai engineers.
Today, the field is changing again: after the time of key-value stores, NoSQL, and eventual consistency, we have started seeing more scalable and performant databases, able to execute complex queries with stronger consistency guarantees.
Audience of This Book
In conversations at technical conferences, I often hear the same question: How can I learn more about database internals? I dont even know where to start. Most of the books on database systems do not go into details of storage engine implementation, and cover the access methods, such as B-Trees, on a rather high level. There are very few books that cover more recent concepts, such as different B-Tree variants and log-structured storage, so I usually recommend reading papers.
Everyone who reads papers knows that its not that easy: you often lack context, the wording might be ambiguous, theres little or no connection between papers, and theyre hard to find. This book contains concise summaries of important database systems concepts and can serve as a guide for those whod like to dig in deeper, or as a cheat sheet for those already familiar with these concepts.
Not everyone wants to become a database developer, but this book will help people who build software that uses database systems: software developers, reliability engineers, architects, and engineering managers.
If your company depends on any infrastructure component, be it a database, a messaging queue, a container platform, or a task scheduler, you have to read the project change-logs and mailing lists to stay in touch with the community and be up-to-date with the most recent happenings in the project. Understanding terminology and knowing whats inside will enable you to yield more information from these sources and use your tools more productively to troubleshoot, identify, and avoid potential risks and bottlenecks. Having an overview and a general understanding of how database systems work will help in case something goes wrong. Using this knowledge, youll be able to form a hypothesis, validate it, find the root cause, and present it to other project maintainers.
This book is also for curious minds: for the people who like learning things without immediate necessity, those who spend their free time hacking on something fun, creating compilers, writing homegrown operating systems, text editors, computer games, learning programming languages, and absorbing new information.
The reader is assumed to have some experience with developing backend systems and working with database systems as a user. Having some prior knowledge of different data structures will help to digest material faster.
Why Should I Read This Book?
We often hear people describing database systems in terms of the concepts and algorithms they implement: This database uses gossip for membership propagation (see ).
We need abstractions to discuss complex concepts, and we cant have a discussion about terminology every time we start a conversation. Having shortcuts in the form of common language helps us to move our attention to other, higher-level problems.
One of the advantages of learning the fundamental concepts, proofs, and algorithms is that they never grow old. Of course, there will always be new ones, but new algorithms are often created after finding a flaw or room for improvement in a classical one. Knowing the history helps to understand differences and motivation better.
Learning about these things is inspiring. You see the variety of algorithms, see how our industry was solving one problem after the other, and get to appreciate that work. At the same time, learning is rewarding: you can almost feel how multiple puzzle pieces move together in your mind to form a full picture that you will always be able to share with others.