97 Things Every SRE Should Know
by Emil Stolarsky and Jaime Woo
Copyright 2021 Incident Labs, Inc. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .
- Acquisitions Editor: John Devins
- Developmental Editor: Corbin Collins
- Production Editor: Beth Kelly
- Copyeditor: nSight, Inc.
- Proofreader: Shannon Turlington
- Indexer: nSight, Inc.
- Interior Designer: David Futato
- Cover Designer: Randy Comer
- Illustrator: Kate Dullea
- November 2020: First Edition
Revision History for the First Edition
- 2020-11-18: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781492081494 for release details.
The OReilly logo is a registered trademark of OReilly Media, Inc. 97 Things Every SRE Should Know, the cover image, and related trade dress are trademarks of OReilly Media, Inc.
The views expressed in this work are those of the authors, and do not represent the publishers views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
978-1-492-08149-4
[LSI]
Preface
If there is one defining trait of an SRE, it would be curiosity. Theres something about trying to understand how a system works, bringing it back from failure, or generally improving it that tickles the parts of our brains where curiosity lives. This trait is probably common through most, if not all, engineering practices. Theres a story we both love that seems to encompass this trait perfectly.
On November 14, 1969, as Apollo 12 was lifting off from its launchpad in Cape Canaveral, Florida, it was struck by lightning. Twice. First at 36.5 seconds after liftoff and then again at 52 seconds. Later the incident reports would show that the lightning had caused a power surge and inadvertently disconnected the fuel cells, leading to a voltage drop.
In the moment though, there was anything but clarity.
In an instant, every alarm in the Apollo 12 command capsule went off. Telemetry readings in Houston were complete gibberish. For an organization that thinks through everything, they never thought to ask what to do when lightning strikes. What were the chances?
Even worse, the stakes couldnt be higher. If the mission is aborted, NASA loses a $1.2 billion rocket. If not, and the safety of the astronauts is compromised, you end up broadcasting a catastrophe to the whole world. When listening back to a recording of mission control, you can feel the tension and stress.
Theres a moment of silence on the audio loop before someone cuts in: try SCE to Aux. This wasnt something ever tried before. So much so, someone radios back what the hell is that? With no better options, the command is relayed to the astronauts. And it worked. After searching for the switch, they flip it, and everything immediately returns back to normal.
The NASA engineer John Aaron gave the obscure suggestion. A year earlier hed been working in an Apollo capsule simulator and ended up with a similar mess of telemetry readings. Rather than reset the simulator, he decided to play around and try fixing the problem. Hed discover that by shifting the signal conditioning electronics, or SCE, system to its auxiliary setting, it could operate in low-voltage conditions, restoring telemetry. SCE to Aux.
The lightning strike was a black swan event, something NASA had never simulated before. What inspired John Aaron to dig around to uncover the cause of that specific data signature? In an oral history with NASA, he credits a natural curiosity with why things work and how they work.
Curiosity is a trait found in many SREs. We were reminded of a conversation with an SRE friend in Dublin who shared how she was the type to keep asking why about the systems she worked with. That echoes John Aaron talking about how he always wanted to know how things around him worked, and not stopping until he had a deep understanding.
That willingness to learn makes sense for SREs, given the need to work with complex systems. The systems change constantly, and the role requires someone wanting to ask questions about how they work. The inquisitivity means rather than seeing one specific part of the system as their domain, SREs instead wonder about all the parts of the system, and how they function together.
But its not just the technical system. SREs need to be curious about people too, the socio- part of the sociotechnical system. Without that, you couldnt bring different teams together to create meaningful SLOs. You couldnt navigate personality types to properly respond to incidents. Youd be satisfied with just the five whys and miss out on uncovering the lessons to be learned post-incident.
We want this book to give you an opportunity to explore, play, and satisfy your curiosity. Here, weve laid out essays to do so. (You may notice there are actually 98 essays! We figured everyone likes a little something extra on the house.) Theyre written by experts from across the industry, guiding you through a range of topics from the fundamentals of SRE to the bleeding edge. This book was written and edited during the pandemic, and we are deeply grateful for everyone who contributed during such a trying time.
We believe that SRE needs to be filled with many voices, and that new voices should always be welcome. New ideas from different points of view and a wide range of experiences will help evolve this field that is, honestly, remarkably still in its early days. Our dream is that as you read these essays, they spark your curiosity, and move you forward in your SRE journey, no matter where youre currently at.
Were beyond curious to read what a batch of essays on SRE will look like in 5 or 10 years.
How We Structured the Book
SRE, although it deals with complex technical systems, is ultimately a cultural practice. Culture is the product of people, and that inspired us to organize this book into sections based on the number of SREs you have in your organizationwhat you specifically tackle and how your day looks like depends on how many SREs there are. Weve broken the books essays into New to SRE, 0-1 SRE, 1-10 SREs, 10-100 SREs, and the Future of SRE.
Readers looking for guidance on where to start first can jump right to the section that applies most to them; however, you will still find value in reading essays from sections that dont currently apply to your day-to-day.
At 0 to 1 SRE, no one has been designated an SRE yet, or you have found your very first one, a role that can seem almost lonely.
At 1 to 10 SREs, you are forming a team, and there is sharing of knowledge and the ability to divvy up work.