Compliments of
Anatomy
of an
Incident
Googles Approach to
Incident Management
for Production Services
Ayelet Sachto & Adrienne Walcer
with Jessie Yang
REPORT
Want to
know more
about SRE?
To learn more, visit https://sre.google
Anatomy of an Incident
Googles Approach to Incident
Management for Production Services
Ayelet Sachto and Adrienne Walcer,
with Jessie Yang
Beijing Boston Farnham Sebastopol Tokyo
Anatomy of an Incident
by Ayelet Sachto and Adrienne Walcer, with Jessie Yang
Copyright 2022 OReilly Media, Inc. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles ( http://oreil y.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreil y.com.
Acquisition Editor: John Devins
Proofreader: Piper Editorial Consulting, LLC
Development Editor: Virginia Wilson
Interior Designer: David Futato
Production Editor: Beth Kelly
Cover Designer: Karen Montgomery
Copyeditor: Audrey Doyle
Illustrator: Kate Dullea
January 2022:
First Edition
Revision History for the First Edition
2022-01-24: First Release
The OReilly logo is a registered trademark of OReilly Media, Inc. Anatomy of anIncident, the cover image, and related trade dress are trademarks of OReilly Media, Inc.
The views expressed in this work are those of the authors and do not represent the publishers views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
This work is part of a collaboration between OReilly and Google. See our statement
of editorial independence.
978-1-098-11372-8
[LSI]
Table of Contents
iii
iv | Table of Contents
CHAPTER 1
Introduction
Make no mistakethe coming N weeks are going to be personally and professionally stressful, and at times we will race to keep ahead of events as they unfold. But we have been preparing for crises for over a decade, and were ready. At a time when people around the world need information, communication, and computation more than ever, we will ensure that Google is there to help them.
Benjamin Treynor Sloss, Vice President, Engineering,
Googles Site Reliability Engineering Team, March 3, 2020
Failure is an inevitability (kind of depressing, we know). As scientists and engineers, you look at problems on the long scale and design systems to be optimally sustainable, scalable, reliable, and secure. But youre designing systems with only the knowledge you currently have. And when implementing solutions, you do so without having complete knowledge of the future. You cant always anticipate the next zero-day event, viral media trend, weather disaster, config management error, or shift in technology. Therefore, you need to be prepared to respond when these things happen and affect your systems.
One of Googles biggest technical challenges of the decade was brought on by the COVID-19 pandemic. The pandemic created a series of rapidly emerging incidents that we needed to mitigate in order to continue serving our users. We had to aggressively boost service capacity, pivot our workforce to be productive at home, and build new ways to efficiently repair servers despite supply chain constraints. As the quotation from Ben Treynor Sloss details, Google was able to continue bringing services to the world during this
paradigm-shifting sequence of incidents because we had prepared for it. For more than a decade, Google has proactively invested in incident management. This preparation is the most important thing an organization can do to improve its incident response capa-bility. Preparation builds resilience. And resilience and the ability to handle failure become a key discipline for measuring technological success on the long scale (think decades). Beyond doing the best engineering that you can, you also need to be continually prepared to handle failure when it happens.
Resiliency is one of the critical pillars in company operations. In that regard, incident management is a mandatory company process.
Incidents are expensive, not only in their impact on customers but also in the burden they place on human operators. Incidents are stressful, and they usually demand human intervention. Effective incident management, therefore, prioritizes preventive and proactive work over reactive work.
We know that managing incidents comes with a lot of stress, and finding and training responders is hard; we also know that some accidents are unavoidable and failures happen. Instead of asking
What do you do if an incident happens? we want to address the question What do you do when the incident happens? Reducing the ambiguity in this way not only reduces human toil and responders stress, it also improves resolution time and reduces the impact on your users.
We wrote this report to be a guide on the practice of technical incident response. We start by building some common language to discuss incidents, and then get into how you encourage engineers, engineering leaders, and executives to think about incident management within the organization. We aim to cover everything from preparing for incidents, responding to incidents, and recovering from incidents to some of that secret glue that maintains a healthy organization which can scalably fight fires. Lets get started.
What Is an Incident?
Incident is a loaded term. Its meaning can differ depending on the group using it. In ITIL, for example, an incident is any unplanned interruption, such as a ticket, a bug, or an alert. No matter how the word is used, its important that you align on a specific 2 | Chapter 1: Introduction
At Google, incidents are issues that:
Are escalated (because theyre too big to handle alone)
Require an immediate response
Require an organized response
Sometimes an incident can be caused by an outage, which is a period of service unavailability. Outages can be planned; for example, during a service maintenance window in which your system is inten-tionally unavailable in order to implement updates. If an outage is planned and communicated to users, its not an incidentnothing is going on that requires an immediate, organized response. But usually, well be referring to unexpected outages caused by unanticipated failures. Most unexpected outages are incidents, or become incidents.
Next page