Foreword
IT'S BEEN OVER A DECADE SINCE THE FIRST WEBSITES REACHED REAL SCALE . We were there then, in those early days, watching our sites growing faster than anyone had seen before or knew how to manage. It was up to us figure out how to keep everything running, to make things happen, to get things done.
While everyone else was at the launch party, we were deep in the bowels of the datacenter racking and stacking the last servers. Then we sat at our desks late into the night, our faces lit with the glow of logfiles and graphs streaming by.
Our experiences were universal: Our software crashed or couldn't scale. The databases crashed and data was corrupted, while every server, disk, and switch failed in ways the manufacturer absolutely, positively said it wouldn't. Hackers attackedfirst for fun and then for profit. And just when we got things working again, a new feature would be pushed out, traffic would spike, and everything would break all over again.
In the early days, we used what we could find because we had no budget. Then we grew from mismatched, scavenged machines hidden in closets to megawatt-scale datacenters spanning the globe filled with the cheapest machines we could find.
As we got to scale, we had to deal with the real world and its many dangers. Our datacenters caught fire, flooded, or were ripped apart by hurricanes. Our power failed. Generators didn't kick inor started and then ran out of fuelor were taken down when someone hit the Emergency Power Off. Cooling failed. Sprinklers leaked. Fiber was cut by backhoes and squirrels and strange creatures crawling along the seafloor.
Man, machine, and Mother Nature challenged us in every way imaginable and then surprised us in ways we never expected.
We worked from the instant our pagers woke us up or when a friend innocently inquired, "is the site down?" or when the CEO called scared and furious. We were always the first ones to know it was down and the last to leave when it was back up again.
Always.
Every day we got a little smarter, a little wiser, and learned a few more tricks. The scripts we wrote a decade ago have matured into tools and languages of their own, and whole industries have emerged around what we do. The knowledge, experiences, tools, and processes are growing into an art we call Web Operations.
We say that Web Operations is an art, not a science, for a reason. There are no standards, certifications, or formal schooling (at least not yet). What we do takes a long time to learn and longer to master, and everyone at every skill level must find his or her own style. There's no "right way," only what works (for now) and a commitment to doing it even better next time.
The web is changing the way we live and touches every person alive. As more and more people depend on the web, they depend on us.
Web Operations is work that matters.
Jesse Robbins
The contributors to this book have donated their payments to the 826 Foundation, which helps kids learn to love reading at places like the Superhero Supply Company, the Greenwood Space Travel Supply Company, and the Liberty Street Robot Supply & Repair Shop .
Preface
DESIGNING, BUILDING, AND MAINTAINING A GROWING WEBSITE has unique challenges when it comes to the fields of systems administration and software development. For one, the Web never sleeps. Because websites are globally used, there is no "good" time for changes, upgrades, or maintenance windows, only fewer "bad" times. This also means that outages are guaranteed to affect someone, somewhere using the site, no matter what time it is.
As web applications become an increasing part of our daily lives, they are also becoming more complex. With that complexity comes more parts to build and maintain and, unfortunately, more parts to fail. On top of that, there are requirements for being fast, secure, and always available across the planet. All these things add up to what's become a specialized field of engineering: web operations.
This book was conceived to gather insights into this still-evolving field from web veterans around the industry. Jesse Robbins and I came up with a list of tip-of-iceberg topics and asked these experts for their hard-earned advice and stories from the trenches.
How This Book Is Organized
The chapters in this book are organized as follows:
by Theo Schlossnagle, describes what this field actually encompasses and underscores how the skills needed are gained by experience and less about formal education.
by Justin Huff, explains how Picnik.com went about deploying and sustaining its infrastructure on a mix of on-premise hardware and cloud services.
by Matt Massie and myself, discusses the importance of gathering metrics from both your application and your infrastructure, and considerations on how to gather them.
by Eric Ries, gives his take on the advantages of deploying code to production in small batches, frequently.
by Adam Jacob, gives an overview about the theory and approaches for configuration and deployment management.
by Patrick Debois, discusses the various considerations when designing a monitoring system.
, is Dr. Richard Cook's whitepaper on systems failure and the nature of complexity that is often found in web architectures. He also adds some web operationsspecific notes to his original paper.
, is my interview with Heather Champ on the topic of how outages and degradations should be handled on the human side of things.
by Brian Moon, talks about the experiences with huge traffic deluges at Dealnews.com and what they did to mitigate disaster.
by Paul Hammond, lists some of the places where development and operations can come together to enable the business, both technically and culturally.
by Alistair Croll and Sean Power, discusses metrics that can be used to illustrate what the real experience of your site is.
by Baron Schwartz, lays out common approaches to database architectures and some pitfalls that come with increasing scale.
by Jake Loomis, goes into what makes or breaks a good postmortem and root cause analysis process.