The Art of Capacity Planning
Second Edition
Scaling Web Resources in the Cloud
Arun Kejariwal and John Allspaw
The Art of Capacity Planning
by Arun Kejariwal and John Allspaw
Copyright 2018 Arun Kejariwal, John Allspaw. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .
- Editors: Brian Anderson and Virginia Wilson
- Production Editor: Nicholas Adams
- Copyeditor: Octal Publishing, Inc.
- Proofreader: Kim Cofer
- Indexer: Ellen Troutman-Zaig
- Interior Designer: David Futato
- Cover Designer: Karen Montgomery
- Illustrator: Rebecca Demarest
- September 2008: First Edition
- October 2017: Second Edition
Revision History for the Second Edition
- 2017-09-21: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781491939208 for release details.
The OReilly logo is a registered trademark of OReilly Media, Inc. The Art of Capacity Planning, the cover image, and related trade dress are trademarks of OReilly Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
978-1-491-93920-8
[LSI]
Preface
Prior to the 2014 FIFA World Cup, one of the common stories being discussed at Twitter was how the service routinely became unavailable during the previous FIFA World Cup. In particular, every time Brazil or Japan scored a goal in their matches, the spike in the tweet volume used to take down the service. The Fail Whale (shown below) had become popular with the availability issues during the early days of Twitter. So, one of the goals for the 2014 FIFA World Cup was to have absolutely zero downtime. Further, another key goal was to ensure high performance of the Twitter mobile app sharing photos or the like should be blazingly fast. How does one go about achieving that?
Akin to the preceding anecdote, with the increasing use of Twitter during mega events such as the Super Bowl, another key emphasis was to ensure high availability in spite of traffic tweets, retweets, favorites, DMs spikes. Conceivably, we can analyze the magnitude of the past spikes relative to the normal traffic and then come up with a first-cut estimate of the magnitude of the spike going forward. Having said that, should you deploy capacity to handle such one-time events, particularly given that the capacity would most likely be underutilized for most of the year? How do you handle unplanned events such as the power failure that occurred during Super Bowl XLVII in 2013? Ensuring high availability during such events calls for a systematic approach toward architectural design and capacity planning.
Capacity planning has been around since ancient times, with roots in everything from economics to engineering. In a basic sense, capacity planning is resource management. When resources are finite and come at a cost, you need to do some capacity planning. When a civil engineering firm designs a new highway system, its planning for capacity, as is a power company planning to deliver electricity to a metropolitan area. In some ways, their concerns have a lot in common with web operations; many of the basic concepts and concerns can be applied to all three disciplines.
Although systems administration has been around since the 1960s, the branch focused on serving websites is still emerging. A large part of web operations is capacity planning and management. Those are processes, not tasks, and they are composed of many different parts. Although every organization goes about it differently, the basic concepts are the same:
- Ensure that proper resources (servers, storage, network, etc.) are available to handle expected and unexpected loads
- Have a clearly defined procurement and approval system in place
- Be prepared to justify capital expenditures in support of the business
- Have a deployment and management system in place to manage the resources after they are deployed
Why We Wrote and Revised This Book
One of the common frustrations of engineers in an operations organization and of software developers is not having somewhere to turn for help when figuring out how much capacity is needed to keep the website or mobile app running. Existing books on the topic of computer capacity planning were focused on the mathematical theory of resource planning, rather than the practical implementation of the entire process (refer to ). Further, in an Agile environment, which is a norm today, capacity planning is a continuous process and should be flexible and adaptive to the situation at hand. Basing capacity planning on static theoretical models would be a recipe for failure .
A lot of literature addressed only rudimentary models of website use cases, and lacked specific information or advice. Instead, they tended to offer mathematical models designed to illustrate the principles of queuing theory, which is the foundation of traditional capacity planning. This approach might be mathematically interesting and elegant (it also can be useful in determining what magnitude of a traffic spike can be absorbed by the various services, owing to built-in queues, without affecting the availability of a website/mobile app), but it doesnt help an operations engineer or a software developer when informed that he has a week to prepare for some unknown amount of additional traffic perhaps due to the launch of a super new feature or seeing the site dying under the weight of a link from Facebook, the New York Times, Reddit, Digg, and so on.
Weve found most books on web capacity planning were written with the implied assumption that concepts and processes found in nonweb environments such as manufacturing or industrial engineering applied uniformly to website environments, as well. Even though some of the theory surrounding such planning might indeed be similar, the practical application of those concepts doesnt map very well to the short timelines of website development. In most web development settings, its been our observation that change happens too fast and too often to allow for the detailed and rigorous capacity investigations common to other fields. By the time an operations engineer or a software developer comes up with the queuing model for her system, new code is deployed and the usage characteristics have likely already changed dramatically. In a 2016 Association for Computing Machinery (ACM) article titled Why Google Stores Billions of Lines of Code in a Single Repository, authors R. Potvin and J. Levenberg (both of Google) mentioned the following: