• Complain

Ayelet Sachto and Adrienne Walcer - Anatomy of an Incident: Google’s Approach to Incident Management for Production Services

Here you can read online Ayelet Sachto and Adrienne Walcer - Anatomy of an Incident: Google’s Approach to Incident Management for Production Services full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. year: 2022, publisher: O’Reilly Media, Inc., genre: Romance novel. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:

Romance novel Science fiction Adventure Detective Science History Home and family Prose Art Politics Computer Non-fiction Religion Business Children Humor

Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.

No cover

Anatomy of an Incident: Google’s Approach to Incident Management for Production Services: summary, description and annotation

We offer to read an annotation, description, summary or preface (depends on what the author of the book "Anatomy of an Incident: Google’s Approach to Incident Management for Production Services" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.

When it comes to system design, failure is inevitable. Scientists and engineers implement solutions based on the available information, without a complete knowledge of the future. You cant always anticipate the next zero-day event, viral media trend, weather disaster, or shift in technology. But you can be prepared to respond when incidents like these affect your systems.With this report, SRE and DevOps practitioners, IT managers, and engineering leaders will explore methods to help your organization prepare for, respond to, and recover from incidents. With advice from Ayelet Sachto, Adrienne Walcer, and Jessie Yang, youll learn how to be prepared to handle failure if and when it happens.Learn the stages of the incident management lifecycle: preparedness, response, recovery, and mitigation Deal proactively with incidents: issues that escalate beyond metrics and alerts Be prepared: practice disaster role playing and incident response exercises Learn the characteristics of the incident-response organizational structure Examine steps to recovery and mitigation after an incident has occurred Conduct postmortems to analyze what went wrong Explore a real-world example from Google: The Mayan Apocalypse Learn how to measure and reduce incidents impact Use postmortems as a tool for prevention and psychological safety

Ayelet Sachto and Adrienne Walcer: author's other books


Who wrote Anatomy of an Incident: Google’s Approach to Incident Management for Production Services? Find out the surname, the name of the author of the book and a list of all author's works by series.

Anatomy of an Incident: Google’s Approach to Incident Management for Production Services — read online for free the complete book (whole text) full work

Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "Anatomy of an Incident: Google’s Approach to Incident Management for Production Services" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.

Light

Font size:

Reset

Interval:

Bookmark:

Make

Compliments of Anatomy of an Incident Googles Approach to Incident - photo 1

Compliments of Anatomy of an Incident Googles Approach to Incident - photo 2

Compliments of Anatomy of an Incident Googles Approach to Incident - photo 3

Compliments of

Anatomy

of an

Incident
Googles Approach to

Incident Management

for Production Services

Ayelet Sachto & Adrienne Walcer

with Jessie Yang

REPORT

Want to know more about SRE To learn more visit httpssregoogle Anatomy - photo 4

Want to

know more

about SRE?

To learn more, visit https://sre.google

Anatomy of an Incident

Googles Approach to Incident

Management for Production Services

Ayelet Sachto and Adrienne Walcer,

with Jessie Yang
Beijing Boston Farnham Sebastopol Tokyo

Anatomy of an Incident

by Ayelet Sachto and Adrienne Walcer, with Jessie Yang

Copyright 2022 OReilly Media, Inc. All rights reserved.

Printed in the United States of America.

Published by OReilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles ( http://oreil y.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreil y.com.

Acquisition Editor: John Devins

Proofreader: Piper Editorial Consulting, LLC

Development Editor: Virginia Wilson

Interior Designer: David Futato

Production Editor: Beth Kelly

Cover Designer: Karen Montgomery

Copyeditor: Audrey Doyle

Illustrator: Kate Dullea

January 2022:

First Edition

Revision History for the First Edition

2022-01-24: First Release

The OReilly logo is a registered trademark of OReilly Media, Inc. Anatomy of anIncident, the cover image, and related trade dress are trademarks of OReilly Media, Inc.

The views expressed in this work are those of the authors and do not represent the publishers views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.

This work is part of a collaboration between OReilly and Google. See our statement

of editorial independence.

978-1-098-11372-8

[LSI]

Table of Contents

iii

iv | Table of Contents

CHAPTER 1

Introduction

Make no mistakethe coming N weeks are going to be personally and professionally stressful, and at times we will race to keep ahead of events as they unfold. But we have been preparing for crises for over a decade, and were ready. At a time when people around the world need information, communication, and computation more than ever, we will ensure that Google is there to help them.

Benjamin Treynor Sloss, Vice President, Engineering,

Googles Site Reliability Engineering Team, March 3, 2020

Failure is an inevitability (kind of depressing, we know). As scientists and engineers, you look at problems on the long scale and design systems to be optimally sustainable, scalable, reliable, and secure. But youre designing systems with only the knowledge you currently have. And when implementing solutions, you do so without having complete knowledge of the future. You cant always anticipate the next zero-day event, viral media trend, weather disaster, config management error, or shift in technology. Therefore, you need to be prepared to respond when these things happen and affect your systems.

One of Googles biggest technical challenges of the decade was brought on by the COVID-19 pandemic. The pandemic created a series of rapidly emerging incidents that we needed to mitigate in order to continue serving our users. We had to aggressively boost service capacity, pivot our workforce to be productive at home, and build new ways to efficiently repair servers despite supply chain constraints. As the quotation from Ben Treynor Sloss details, Google was able to continue bringing services to the world during this

paradigm-shifting sequence of incidents because we had prepared for it. For more than a decade, Google has proactively invested in incident management. This preparation is the most important thing an organization can do to improve its incident response capa-bility. Preparation builds resilience. And resilience and the ability to handle failure become a key discipline for measuring technological success on the long scale (think decades). Beyond doing the best engineering that you can, you also need to be continually prepared to handle failure when it happens.

Resiliency is one of the critical pillars in company operations. In that regard, incident management is a mandatory company process.

Incidents are expensive, not only in their impact on customers but also in the burden they place on human operators. Incidents are stressful, and they usually demand human intervention. Effective incident management, therefore, prioritizes preventive and proactive work over reactive work.

We know that managing incidents comes with a lot of stress, and finding and training responders is hard; we also know that some accidents are unavoidable and failures happen. Instead of asking

What do you do if an incident happens? we want to address the question What do you do when the incident happens? Reducing the ambiguity in this way not only reduces human toil and responders stress, it also improves resolution time and reduces the impact on your users.

We wrote this report to be a guide on the practice of technical incident response. We start by building some common language to discuss incidents, and then get into how you encourage engineers, engineering leaders, and executives to think about incident management within the organization. We aim to cover everything from preparing for incidents, responding to incidents, and recovering from incidents to some of that secret glue that maintains a healthy organization which can scalably fight fires. Lets get started.

What Is an Incident?

Incident is a loaded term. Its meaning can differ depending on the group using it. In ITIL, for example, an incident is any unplanned interruption, such as a ticket, a bug, or an alert. No matter how the word is used, its important that you align on a specific 2 | Chapter 1: Introduction

At Google, incidents are issues that:

Are escalated (because theyre too big to handle alone)

Require an immediate response

Require an organized response

Sometimes an incident can be caused by an outage, which is a period of service unavailability. Outages can be planned; for example, during a service maintenance window in which your system is inten-tionally unavailable in order to implement updates. If an outage is planned and communicated to users, its not an incidentnothing is going on that requires an immediate, organized response. But usually, well be referring to unexpected outages caused by unanticipated failures. Most unexpected outages are incidents, or become incidents.

Next page
Light

Font size:

Reset

Interval:

Bookmark:

Make

Similar books «Anatomy of an Incident: Google’s Approach to Incident Management for Production Services»

Look at similar books to Anatomy of an Incident: Google’s Approach to Incident Management for Production Services. We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.


Reviews about «Anatomy of an Incident: Google’s Approach to Incident Management for Production Services»

Discussion, reviews of the book Anatomy of an Incident: Google’s Approach to Incident Management for Production Services and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.