front matter
preface
This is the book I wish I had available to refer to over the past few years, while scaling out the big data platform of the Customer Growth and Analytics team in Azure. As our data science team grew and the insights generated by the team became more and more critical to the business, we had to ensure that our platform was robust.
The world of big data is relatively new, and the playbook is still being written. I believe our story is common: data teams start small with a handful of people, who first prove they can generate valuable insights. At this stage, a lot of work happens ad hoc, and there is no immediate need for big engineering investments. A data scientist can run a machine learning (ML) model on their machine, generate some predictions, and email the results.
Over time, the team grows and more workloads become mission critical. The same ML model now plugs into a system serving live traffic and needs to run on a daily basis with more than a hundred times the data it was originally prototyped with. At this point, solid engineering practices are critical; we need scale, reliability, automation, monitoring, etc.
This book contains several years of hard-learned lessons in data engineering. To name a few examples:
Empowering every data scientist on the team to deploy new analytics and data movement pipelines onto our platform while maintaining a reliable production environment
Architecting an ML platform to streamline and automate execution of dozens of ML models
Building a metadata catalog to make sense of the large number of available datasets
Implementing various ways to test the quality of the data and sending alerts when issues are identified
The underlying theme of this book is DevOps, bringing the decades-old best practices of software engineering to the world of big data. Data governance is another important topic; making sense of the data, ensuring quality, compliance, and access control are all a critical part of governance.
The patterns and practices described in this book are platform agnostic. They should be just as valid regardless of which cloud you use. That said, we cant be too abstract, so I provide some concrete examples through a reference implementation. The reference implementation is Azure. Even here, there is a wide selection of services we can pick from.
The reference implementation uses a set of services, but keep in mind, the book is less about the particular set of services and more about the data engineering practices realized through them. I hope you enjoy the book, and that you find some best practices you can apply to your environment and business space.
acknowledgments
Many thanks to my wife, Diana, and daughter, Ada, for their support. Thanks for bearing with me for a second round!
This book wouldnt be what it is without the great input and advice from Michael Stephens and Elesha Hyde. Also, thanks go to Danny Vinson for reviewing the early draft and to Karsten Strbk for checking all the code samples. I thank all the reviewers for their time and feedback: Albert Nogus, Arun Thangasamy, Dave Corun, Geoff Clark, Glenn Swonk, Hilde Van Gysel, Jess A. Jurez Guerrero, Johannes Verwijnen, Kelum Senanayake, Krzysztof Kamyczek, Luke Kupka, Matthias Busch, Miranda Whurr, Oliver Korten, Peter Kreyenhop, Peter Morgan, Phil Allen, Philippe Van Bergen, Richard B. Ward, Richard Vaughan, Robert Walsh, Sven Stumpf, Todd Cook, Vishwesh Ravi Shrimali, and Zekai Otles.
Many thanks go to the Customer Growth and Analytics leadership team for their support and for giving me the opportunity to learn: Tim Wong, Greg Koehler, Ron Sielinski, Merav Davidson, Vivek Dalvi, and everyone else on the team.
I was also fortunate to partner with many other teams across Microsoft. I want to thank the IDEAs team, especially Gerardo Bodegas Martinez, Wayne Yim, and Ayyappan Balasubramanian; the Azure Data Explorer team, Oded Sacher and Ziv Caspi; the Azure Purview team, Naga Krishna Yenamandra and Gaurav Malhotra; and the Azure Machine Learning team, especially Tzvi Keisar.