Foreword
Jeremy Edberg, Information Cowboy, December 2012
In mid-2008, I was handling operations for reddit.com, an online community for sharing and discussing links, serving a few tens of millions of page views per month. At the time, we were hosting thewhole site on 21 1U HP servers (in addition to four of the originalservers for the site) in two racks in a San Francisco data center.Around that time, Steve, one of the founders of reddit, came to me andsuggested I check out this AWS thing that his buddies at Justin.tv hadbeen using with some success; he thought it might be good for us, too.I set up a VPN; we copied over a set of our data, and started using itfor batch processing.
In early 2009, we had a problem: we needed more servers for livetraffic, so we had to make a choicebuild out another rack ofservers, or move to AWS. We chose the latter, partly because we didntknow what our growth was going to look like, and partly because itgave us enormous flexibility for resiliency and redundancy by offeringmultiple availability zones, as well as multiple regions if we ever got to that point. Also, I was tired of running to the data center every time a disk failed, a fan died, a CPU melted, etc.
When designing any architecture, one of the first assumptions oneshould make is that any part of the system can break at any time.AWS is no exception. Instead of fearing this failure, one must embraceit. At reddit, one of the things we got right with AWS from the startwas making sure that we had copies of our data in at least two zones.This proved handy during the great EBS outage of 2011. While we weredown for a while, it was for a lot less time than most sites, in largepart because we were able to spin up our databases in the other zone,where we kept a second copy of all of our data. If not for that, wewould have been down for over a day, like all the other sites in thesame situation.
During that EBS outage, I, like many others, watched Netflix, alsohosted on AWS. It is said that if youre on AWS and your site isdown, but Netflix is up, its probably your fault you are down. Itwas that reputation, among other things, that drew me to move fromreddit to Netflix, which I did in July 2011. Now that Im responsiblefor Netflixs uptime, it is my job to help the company maintain thatreputation.
Netflix requires a superior level of reliability. With tens ofthousands of instances and 30 million plus paying customers,reliability is absolutely critical. So how do we do it? We expectthe inevitable failure, plan for it, and even cause it sometimes. AtNetflix, we follow our monkey theorywe simulate things that gowrong and find things that are different. And thus was born the SimianArmy, our collection of agents that constructively muck with our AWSenvironment to make us more resilient to failure.
The most famous of these is the Chaos Monkey, which kills randominstances in our production accountthe same account that servesactual, live customers. Why wait for Amazon to fail when you caninduce the failure yourself, right? We also have the Latency Monkey,which induces latency on connections between services to simulatenetwork issues. We have a whole host of other monkeys too (most ofthem available on Github).
The point of the Monkeys is to make sure we are ready for any failuremodes. Sometimes it works, and we avoid outages, and sometimes newfailures come up that we havent planned for. In those cases, ourresiliency systems are truly tested, making sure they are generic andbroad enough to handle the situation.
One failure that we werent prepared for was in June 2012. A severestorm hit Amazons complex in Virginia, and they lost power to one oftheir data centers (a.k.a. Availability Zones). Due to a bug in themid-tier load balancer that we wrote, we did not route traffic awayfrom the affected zone, which caused a cascading failure. Thisfailure, however, was our fault, and we learned an important lesson.This incident also highlighted the need for the Chaos Gorilla, whichwe successfully ran just a month later, intentionally taking out anentire zones worth of servers to see what would happen (everythingwent smoothly). We ran another test of the Chaos Gorilla a few monthslater and learned even more about what were are doing right and where wecould do better.
A few months later, there was another zone outage, this time due tothe Elastic Block Store. Although we generally dont use EBS, many ofour instances use EBS root volumes. As such, we had to abandon anavailability zone. Luckily for us, our previous run of Chaos Gorillagave us not only the confidence to make the call to abandon a zone,but also the tools to make it quick and relatively painless.
Looking back, there are plenty of other things we could have done tomake reddit more resilient to failure, many of which I have learnedthrough ad hoc trial and error, as well as from working at Netflix. Unfortunately, I didnt have a book like this one to guide me. Thisbook outlines in excellent detail exactly how to build resilientsystems in the cloud. From the crash course in systems to thedetailed instructions on specific technologies, this book includesmany of the very same things we stumbled upon as we flailed wildly,discovering solutions to problems. If I had had this book when I wasfirst starting on AWS, I would have saved myself a lot of time andheadache, and hopefully you will benefit from its knowledge afterreading it.
This book also teaches a very important lesson: to embrace andexpect failure, and if you do, you will be much better off.
Preface
Thank you (again) for picking up one of our books! If you have read Programming Amazon EC2, you probably have some expectations about this book.
The idea behind this book came from Mike Loukides, one of our editors. He was fascinated with the idea of resilience and reliability in engineering. At the same time, Amazon Web Services (AWS) had been growing and growing.
As is the case for other systems, AWS does not go without service interruptions. The underlying architecture and available services are designed to help you deal with this. But as outages have shown, this is difficult, especially when you are powering the majority of the popular web services.
So how do we help people prepare? We already have a good book on the basics of engineering on AWS. But it deals with relatively simple applications, solely comprised of AWSs infrastructural components. What we wanted to show is how to build service components yourself and make them resilient and reliable.
The heart of this book is a collection of services we run in our infrastructures. Well show things like Postgres and Redis, but also elasticsearch and MongoDB. But before we talk about these, we will introduce AWS and our approach to Resilience and Reliability.
We want to help you weather the next (AWS) outage!
Audience
If Amazon Web Services is new to you, we encourage you to pick up a copy of Programming Amazon EC2 . Familiarize yourself with the many services AWS offers. It certainly helps to have worked (or played) with many of them.
Even though many of our components are nothing more than a collection of scripts (bash, Python, Ruby, PHP) dont be fooled. The lack of a development environment does not make it easier to engineer your way out of many problems.
Therefore, we feel this book is probably well-suited for software engineers. We use this term inclusivelynot every programmer is a software engineer, and many system administrators are software engineers. But you at least need some experience building complex systems. It helps to have seen more than one programming language. And it certainly helps to have been responsible for operations.