Foreword for Second Edition
Architecting for Scale is a comprehensive book for managers who realize that all companies have shifted away from simply calling themselves digital businesses and instead now recognize that if they dont actually operate as one, they will go out of business. Banking, insurance, and other industries that used to have huge moats are being disrupted by upstart companies that deliver amazing experiences because they operate like a digital business rather than merely talking about being a digital business .
Architecting for Scale is a definitive guide for directors, managers, and architects who want an actionable roadmap on operating at scale with high reliability, implementing modern operational principles (DevOps, site reliability engineering), as well as how to use current state of the art concepts and services (microservices, cloud, edge).
I had the pleasure of working with Lee at New Relic, which enables companies to monitor their digital business across the globe. While at New Relic, Lee traveled around the world, helping companies navigate digital transformation, accelerate ideas into production, and deliver services that were up 100% of the time.
Time and time again, I have seen Lee leapfrog companies transformation progress in a single thirty-minute meeting. Enjoy the book! It will be impactful to your company and your career!
Ken Gavranovic
Former EVP & GM, New Relic
CEO/Founder, Interland (now Web.com)
Foreword for First Edition
We are living in interesting times, a software Cambrian explosion if you will, where the cost of building new systems has fallen by orders of magnitude and the connectivity of systems has grown by equal orders of magnitude. Resources like Amazons AWS, Microsofts Azure, and Googles GCP make it possible for us to physically scale our systems to sizes that we could only have imagined a few years ago.
The economics of these resources and seemingly limitless capacity is producing a uniquely rapid radiation of new ideas, new products, and new markets in ways that were never possible before. But all of these new explorations are possible only if the systems we build can scale. While it is easier than ever to build something small, building a system that can scale quickly and reliably proves to be a lot harder than just spinning up more hardware and more storage.
Software systems go through a predictable lifecycle starting with small well-crafted solutions fully understood by a single person, through the rapid growth into a monolith of technical debt, thence fissioning into an ad hoc collection of fragile services, and finally into a well-engineered distributed system able to scale reliably in both breadth (more users) and depth (more features). Its easy to see what needs to be done from the outside (make it more reliable!) and much harder to see the path from the inside. Fortunately, this book is the essential guidebook for the journeyfrom availability to service tiers, from game days to risk matrices, Lee describes the key decisions and practices for systems that scale.
Lee joined me at New Relic when we were first moving from being a single product monolith into being a multiproduct company, all while enjoying the hypergrowth in satisfied customers that made New Relic so successful. Lee came with a lot of experience at Amazon, both on the retail side, where they grew a lot, and on the AWS side, whereguess what?they grew a lot. Lee has been part of teams and led teams and been actively involved in a whole lot of scaling, and he has the scars to prove it. Fortunately for us, hes lived through the mistakes and suffered through fiendishly difficult outages and is now passing along those lessons so that we dont have to get those same scars.
When Lee joined New Relic, we were suffering through our awkward teenage fail whale years. Our primitive monolith was suffering from our success, and our availability, reliability, and performance were not good. By putting in place the techniques hes written about in this book, we graduated from those high school years and built the robust enterprise-level service that exists today. One of our tools was establishing four levels of availability engineering: Bronze, Silver, Gold, and Platinum. To earn the Bronze level, a team had to have a risk matrixit had to have defined SLAs. To earn the Silver level, a team had to be monitoring for the problems identified in the matrix and be using game days; Gold meant that the risks were mitigated; and Platinum was like a CMM Level 5 where the systems were self-healing and the focus was on continuous improvement. We prioritized these efforts for the Tier 1 services first, then the Tier 2 services, etc., and we eventually got everyone to at least Silver and most of the teams through Gold (and a couple to Platinum).