Data Management at Scale
by Piethein Strengholt
Copyright 2023 Piethein Strengholt. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .
- Acquisitions Editor: Michelle Smith
- Development Editor: Shira Evans
- Production Editor: Katherine Tozer
- Copyeditor: Rachel Head
- Proofreader: Piper Editorial Consulting, LLC
- Indexer: nSight, Inc.
- Interior Designer: David Futato
- Cover Designer: Karen Montgomery
- Illustrator: Kate Dullea
- April 2023: Second Edition
Revision History for the Second Edition
- 2023-04-10: First Release
See https://oreilly.com/catalog/errata.csp?isbn=9781098138868 for release details.
The OReilly logo is a registered trademark of OReilly Media, Inc. Data Management at Scale, the cover image, and related trade dress are trademarks of OReilly Media, Inc.
The views expressed in this work are those of the author, and do not represent the publishers views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
This work is part of a collaboration between OReilly and Microsoft. See our statement of editorial independence.
978-1-098-15207-9
[LSI]
Foreword
Whenever we talk about software, we inevitably end up talking about datahow much there is, where it all lives, what it means, where it came from or needs to go, and what happens when it changes. These questions have stuck with us over the years, while the technology we use to manage our data has changed rapidly. Todays databases provide instantaneous access to vast online datasets; analytics systems answer complex, probing questions; event-streaming platforms not only connect different applications but also provide storage, query processing, and built-in data management tools.
As these technologies have evolved, so have the expectations of our users. A user is often connected to many different backend systems, located in different parts of a company, as they switch from mobile to desktop to call center, change location, or move from one application to another. All the while, they expect a seamless and real-time experience. I think the implications of this are far greater than many may realize. The challenge involves a large estate of software, data, and people that must appearat least to our usersto be a single joined-up unit.
Managing company-wide systems like this has always been a dark art, something I got a feeling for when I helped build the infrastructure that backs LinkedIn. All of LinkedIns data is generated continuously, 24 hours a day, by processes that never stop. But when I first arrived at the company, the infrastructure for harnessing that data was often limited to big, slow, batch data dumps at the end of the day and simplistic lookups, jerry-rigged together with homegrown data feeds. The concept of end-of-the-day batch processing seemed to me to be some legacy of a bygone era of punch cards and mainframes. Indeed, for a global business, the day doesnt end.
As LinkedIn grew, it too became a sprawling software estate, and it was clear to me that there was no off-the-shelf solution for this kind of problem. Furthermore, having built the NoSQL databases that powered LinkedIns website, I knew that there was an emerging renaissance of distributed systems techniques, which meant solutions could be built that werent possible before. This led to Apache Kafka, which combined scalable messaging, storage, and processing over the profile updates, page visits, payments, and other event streams that sat at the core of LinkedIn.
While Kafka streamlined LinkedIns dataflows, it also affected the way applications were built. Like many Silicon Valley firms at the turn of the last decade, we had been experimenting with microservices, and it took several iterations to come up with something that was both functional and stable. This problem was as much about data and people as it was about software: a complex, interconnected system that had to evolve as the company grew. Handling a problem this big required a new kind of technology, but it also needed a new skill set to go with it.
Of course, there was no manual for navigating this problem back then. We worked it out as we went along, but this book may well have been the missing manual we needed. In it, Piethein provides a comprehensive strategy for managing data not simply in a solitary database or application but across the many databases, applications, microservices, storage layers, and all other types of software that make up todays technology landscapes.
He also takes an opinionated view, with an architecture to match, grounded in a well-thought-out set of principles. These help to bound the decision space with logical guardrails, inside of which a host of practical solutions should fit. I think this approach will be very valuable to architects and engineers as they map their own problem domain to the trade-offs described in this book. Indeed, Piethein takes you on a journey that goes beyond data and applications into the rich fabric of interactions that bind entire companies together.
Jay Kreps
Cofounder and CEO at Confluent
Preface
Data management is an emerging and disruptive subject. Datafication is everywhere. This transformation is happening all around us: in smartphones, TV devices, ereaders, industrial machines, self-driving cars, robots, and so on. Its changing our lives at an accelerating speed.
As the amount of data generated skyrockets, so does its complexity. Disruptive trends like cloudification, API and ecosystem connectivity, microservices, open data, software as a service (SaaS), and new software delivery models have a tremendous effect on data management. In parallel, we see an enormous number of new applications transforming our businesses. All these trends are fragmenting the data landscape. As a result, we are seeing more point-to-point interfaces, endless discussions about data quality and ownership, and plenty of ethical and legal dilemmas regarding privacy, safety, and security. Agility, long-term stability, and clear data governance compete with the need to develop new business cases swiftly. We sorely need a clear vision for the future of data management.
This books perspective on data management is informed by my personal experience driving the data architecture agenda for a large enterprise as chief data architect. Executing that role showed me clearly the impact a good data strategy can have on a large organization. After leaving that company, I started working as the chief data officer for Microsoft Netherlands. In this exciting new position, Ive worked with over 50 large customers discussing and attempting to come up with a perfect data solution. Here are some of the common threads Ive identified across all enterprises: