Susan J. Fowler
Production-Ready Microservices
by Susan J. Fowler
Copyright 2017 Susan Fowler. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .
- Editors: Nan Barber and Brian Foster
- Production Editor: Kristen Brown
- Copyeditor: Amanda Kersey
- Proofreader: Jasmine Kwityn
- Indexer: Wendy Catalano
- Interior Designer: David Futato
- Cover Designer: Karen Montgomery
- Illustrator: Rebecca Demarest
- December 2016: First Edition
Revision History for the First Edition
- 2016-11-23: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781491965979 for release details.
The OReilly logo is a registered trademark of OReilly Media, Inc. Production-Ready Microservices, the cover image, and related trade dress are trademarks of OReilly Media, Inc.
While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
978-1-491-96597-9
[LSI]
Preface
This book was born out of a production-readiness initiative I began running several months after I joined Uber Technologies as a site reliability engineer (SRE). Ubers gigantic, monolithic API was slowly being broken into microservices, and at the time I joined, there were over a thousand microservices that had been split from the API and were running alongside it. Each of these microservices was designed, built, and maintained by an owning development team, and over 85% of these services had little to no SRE involvement, nor any access to SRE resources.
Hiring SREs and building SRE teams is an absurdly difficult task, because SREs are probably the hardest type of engineers to find: site reliability engineering as a field is still relatively new, and SREs must be experts (at least to some degree) in software engineering, systems engineering, and distributed systems architecture. There was no way to quickly staff all of the teams with their own embedded SRE team, and so my team (the Consulting SRE Team) was born. Our directive from above was simple: find a way to drive high standards across the 85% of microservices that had no SRE involvement.
Our mission was simple, and the directive was vague enough that it allowed me and my team a considerable amount of freedom to define a set of standards that every microservice at Uber could follow. Coming up with high standards that could apply to every single microservice running within this large engineering organization was not easy, and so, with some help from my amazing colleague Rick Boone (whose high standards for the microservices he supported inspired this book), I created a detailed checklist of the standards that I believed every service at Uber should meet before being allowed to host production traffic.
Doing so required identifying a set of overall, umbrella principles that every specific requirement would fall under, and we came up with eight such principles: every microservice at Uber, we said, should be stable, reliable, scalable, fault tolerant, performant, monitored, documented, and prepared for any catastrophe. Under each of these principles were separate criteria that defined what it meant for a service to be stable, reliable, scalable, fault tolerant, performant, monitored, documented, and prepared for any catastrophe. Importantly, we demanded that each principle be quantifiable, and that each criterion provide us with measurable results that dramatically increased the availability of our microservices. A service that met these criteria, a service that fit these requirements, we deemed production-ready.
Driving these standards across teams in an effective and efficient way was the next step. I created a careful process in which SRE teams met with business-critical services (services whose outages would bring the application down), ran architecture reviews with the teams, put together audits of their services (simple checklists that said yes or no to whether the service met each production-readiness requirement), created detailed roadmaps (step-by-step guides that detailed how to bring the service in question to a production-ready state), and assigned production-readiness scores to each service.
Running the architecture reviews was the most important part of the process: my team would gather all of the developers working on a service in a conference room and ask them to whiteboard the architecture of their service in 30 minutes or less. Doing this allowed both my team and the host team to quickly and easily identify where and why the service was failing: when a microservice was diagrammed in all of its glory (endpoints, request flows, dependencies and all), every point of failure stood out like a sore thumb.
Every architecture review produced a great deal of work. After each review, wed work through the checklist and see if the service met any of the production-readiness requirements, and then wed share this audit out with the managers and developers of the team. Scoring was added to the audits when I realized that the production-ready or not idea was simply not granular enough to be useful when we evaluated the production-readiness of services, so each requirement was assigned a certain number of points and then an overall score given to the service.
From the audits came roadmaps. Roadmaps contained a list of the production-readiness requirements that the service did not meet, along with links to information about recent outages caused by not meeting that requirement, descriptions of the work that needed to be done in order to meet the requirement, a link to an open task, and the name of the developer(s) assigned to the relevant task.
After doing my own production-readiness check on this process (also known as Susan-Fowlers-production-readiness-process-as-a-service), I knew that the next step would need to be the automation of the entire process that would run on all Uber microservices, all of the time. At the time of the writing of this book, this entire production-readiness system is being automated by an amazing SRE team at Uber led by the fearless Roxana del Toro.
Each of the production-readiness requirements within the production-readiness standards and the details of their implementation came out of countless hours of careful, deliberate work by myself and my colleagues in the Uber SRE organization. In making the list of requirements, and in trying to implement them across all Uber microservices, we took countless notes, argued with one another at great length, and researched whatever we could find in the current microservice literature (which is very sparse, and almost nonexistent). I met with a wide variety of microservice developer teams, both at Uber and at other companies, trying to determine how microservices could be standardized and whether there existed a universal set of standardization principles that could be applied to every microservice at every company and produce measurable, business-impactful results. From those notes, arguments, meetings, and research came the foundations of this book.