Technology is a powerful force in our society. Data, software, and communication can be used for bad:to entrench unfair power structures, to undermine human rights, and to protect vested interests. Butthey can also be used for good: to make underrepresented peoples voices heard, to createopportunities for everyone, and to avert disasters. This book is dedicated to everyone workingtoward the good.
Computing is pop culture. [] Pop culture holds a disdain for history. Pop culture is all aboutidentity and feeling like youre participating. It has nothing to do with cooperation, the past orthe futureits living in the present. I think the same is true of most people who write code formoney. They have no idea where [their culture came from].
Preface
If you have worked in software engineering in recent years, especially in server-side and backendsystems, you have probably been bombarded with a plethora of buzzwords relating to storage andprocessing of data. NoSQL! Big Data! Web-scale! Sharding! Eventual consistency! ACID! CAP theorem!Cloud services! MapReduce! Real-time!
In the last decade we have seen many interesting developments in databases, in distributed systems,and in the ways we build applications on top of them. There are various driving forces for thesedevelopments:
Internet companies such as Google, Yahoo!, Amazon, Facebook, LinkedIn, Microsoft, and Twitter are handlinghuge volumes of data and traffic, forcing them to create new tools that enable them to efficientlyhandle such scale.
Businesses need to be agile, test hypotheses cheaply, and respond quickly to new market insightsby keeping development cycles short and data models flexible.
Free and open source software has become very successful and is now preferred to commercial orbespoke in-house software in many environments.
CPU clock speeds are barely increasing, but multi-core processors are standard, and networks aregetting faster. This means parallelism is only going to increase.
Even if you work on a small team, you can now build systems that are distributed across manymachines and even multiple geographic regions, thanks to infrastructure as a service (IaaS) suchas Amazon Web Services.
Many services are now expected to be highly available; extended downtime due to outages ormaintenance is becoming increasingly unacceptable.
Data-intensive applications are pushing the boundaries of what is possible by making use of thesetechnological developments. We call an application data-intensive if data is its primarychallengethe quantity of data, the complexity of data, or the speed at which it is changingasopposed to compute-intensive, where CPU cycles are the bottleneck.
The tools and technologies that help data-intensive applications store and process data have beenrapidly adapting to these changes. New types of database systems (NoSQL) have been getting lots ofattention, but message queues, caches, search indexes, frameworks for batch and stream processing,and related technologies are very important too. Many applications use some combination of these.
The buzzwords that fill this space are a sign of enthusiasm for the new possibilities, which is agreat thing. However, as software engineers and architects, we also need to have a technicallyaccurate and precise understanding of the various technologies and their trade-offs if we want tobuild good applications. For that understanding, we have to dig deeper than buzzwords.
Fortunately, behind the rapid changes in technology, there are enduring principles that remain true,no matter which version of a particular tool you are using. If you understand those principles,youre in a position to see where each tool fits in, how to make good use of it, and how to avoidits pitfalls. Thats where this book comes in.
The goal of this book is to help you navigate the diverse and fast-changing landscape oftechnologies for processing and storing data. This book is not a tutorial for one particular tool,nor is it a textbook full of dry theory. Instead, we will look at examples of successful datasystems: technologies that form the foundation of many popular applications and that have to meetscalability, performance, and reliability requirements in production every day.
We will dig into the internals of those systems, tease apart their key algorithms, discuss theirprinciples and the trade-offs they have to make. On this journey, we will try to find useful ways ofthinking about data systemsnot just how they work, but also why they work that way, andwhat questions we need to ask.
After reading this book, you will be in a great position to decide which kind of technology isappropriate for which purpose, and understand how tools can be combined to form the foundation of agood application architecture. You wont be ready to build your own database storage engine fromscratch, but fortunately that is rarely necessary. You will, however, develop a good intuition forwhat your systems are doing under the hood so that you can reason about their behavior, make gooddesign decisions, and track down any problems that may arise.
Who Should Read This Book?
If you develop applications that have some kind of server/backend for storing or processing data,and your applications use the internet (e.g., web applications, mobile apps, or internet-connectedsensors), then this book is for you.
This book is for software engineers, software architects, and technical managers who love to code.It is especially relevant if you need to make decisions about the architecture of the systems youwork onfor example, if you need to choose tools for solving a given problem and figure out howbest to apply them. But even if you have no choice over your tools, this book will help you betterunderstand their strengths and weaknesses.