Praise for Site Reliability Engineering
Googles SREs have done our industry an enormous service by writing up the principles, practices and patternsarchitectural and culturalthat enable their teams to combine continuous delivery with world-class reliability at ludicrous scale. You owe it to yourself and your organization to read this book and try out these ideas for yourself.
Jez Humble, coauthor of Continuous Delivery and Lean Enterprise
I remember when Google first started speaking at systems administration conferences. It was like hearing a talk at a reptile show by a Gila monster expert. Sure, it was entertaining to hear about a very different world, but in the end the audience would go back to their geckos.
Now we live in a changed universe where the operational practices of Google are not so removed from those who work on a smaller scale. All of a sudden, the best practices of SRE that have been honed over the years are now of keen interest to the rest of us. For those of us facing challenges around scale, reliability and operations, this book comes none too soon.
David N. Blank-Edelman, Director, USENIX Board of Directors, and founding co-organizer of SREcon
I have been waiting for this book ever since I left Googlesenchanted castle.
It is the gospel I am preaching to my peers atwork.
Bjrn Rabenstein, Team Lead of Production Engineering at SoundCloud, Prometheusdeveloper, and Google SRE until 2013
A thorough discussion of Site Reliability Engineering from the company that invented the concept. Includes not only the technical details but also the thought process, goals, principles, and lessons learned over time. If you want to learn what SRE really means, start here.
Russ Allbery, SRE and Security Engineer
With this book, Google employees have shared the processes they have taken, including the missteps, that have allowed Google services to expand to both massive scale and great reliability. I highly recommend that anyone who wants to create a set of integrated services that they hope will scale to read this book. The book provides an insiders guide to building maintainable services.
Rik Farrow, USENIX
Writing large-scale services like Gmail is hard. Running them with high reliability is even harder, especially when you change them every day. This comprehensive recipe book shows how Google does it, and youll find it much cheaper to learn from our mistakes than to make them yourself.
Urs Hlzle, SVP Technical Infrastructure, Google
Site Reliability Engineering
Edited by Betsy Beyer , Chris Jones , Jennifer Petoff , and Niall Richard Murphy
Copyright 2016 Google, Inc. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .
- Editor: Brian Anderson
- Production Editor: Kristen Brown
- Copyeditor: Kim Cofer
- Proofreader: Rachel Monaghan
- Indexer: Judy McConville
- Interior Designer: David Futato
- Cover Designer: Karen Montgomery
- Illustrator: Rebecca Demarest
- April 2016: First Edition
Revision History for the First Edition
- 2016-03-21: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781491929124 for release details.
The OReilly logo is a registered trademark of OReilly Media, Inc. Site Reliability Engineering, the cover image, and related trade dress are trademarks of OReilly Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
978-1-491-92912-4
[LSI]
Foreword
Googles story is a story of scaling up. It is one of the great success stories of the computing industry, marking a shift towards IT-centric business. Google was one of the first companies to define what business-IT alignment meant in practice, and went on to inform the concept of DevOps for a wider IT community. This book has been written by a broad cross-section of the very people who made that transition a reality.
Google grew at a time when the traditional role of the system administrator was being transformed. It questioned system administration, as if to say: we cant afford to hold tradition as an authority, we have to think anew, and we dont have time to wait for everyone else to catch up. In the introduction to Principles of Network and System Administration , I claimed that system administration was a form of human-computer engineering. This was strongly rejected by some reviewers, who said we are not yet at the stage where we can call it engineering. At the time, I felt that the field had become lost, trapped in its own wizard culture, and could not see a way forward. Then, Google drew a line in the silicon, forcing that fate into being. The revised role was called SRE, or Site Reliability Engineer. Some of my friends were among the first of this new generation of engineer; they formalized it using software and automation. Initially, they were fiercely secretive, and what happened inside and outside of Google was very different: Googles experience was unique. Over time, information and methods have flowed in both directions. This book shows a willingness to let SRE thinking come out of the shadows.
Here, we see not only how Google built its legendary infrastructure, but also how it studied, learned, and changed its mind about the tools and the technologies along the way. We, too, can face up to daunting challenges with an open spirit. The tribal nature of IT culture often entrenches practitioners in dogmatic positions that hold the industry back. If Google overcame this inertia, so can we.
This book is a collection of essays by one company, with a single common vision. The fact that the contributions are aligned around a single companys goal is what makes it special. There are common themes, and common characters (software systems) that reappear in several chapters. We see choices from different perspectives, and know that they correlate to resolve competing interests. The articles are not rigorous, academic pieces; they are personal accounts, written with pride, in a variety of personal styles, and from the perspective of individual skill sets. They are written bravely, and with an intellectual honesty that is refreshing and uncommon in industry literature. Some claim never do this, always do that, others are more philosophical and tentative, reflecting the variety of personalities within an IT culture, and how that too plays a role in the story. We, in turn, read them with the humility of observers who were not part of the journey, and do not have all the information about the myriad conflicting challenges. Our many questions are the real legacy of the volume: Why didnt they do X? What if theyd done Y? How will we look back on this in years to come? It is by comparing our own ideas to the reasoning here that we can measure our own thoughts and experiences.