A Note for Early Release Readers
With Early Release ebooks, you get books in their earliest formthe authors raw and unedited content as they writeso you can take advantage of these technologies long before the official release of these titles.
This will be the Preface of the final book. Please note that the GitHub repo is available here: https://github.com/gizm00/oreilly_dataeng_book.
If you have comments about how we might improve the content and/or examples in this book, or if you notice missing material within this chapter, please reach out to the editor at vwilson@oreilly.com.
In my work on data pipelines, the biggest cost impact I have seen to date was due to a bug. For months our pipeline was incorrectly transforming data, undetected until our customers noticed the data was wrong.
We could have caught this bug with schema validation tests, which youll learn about in this book. Instead, we spent a significant chunk of our annual cloud bill recomputing the bad data. It cost us the trust of our customers as well, to the point that the validity of the project as a whole was questioned. Sure, the cloud bill was bad, but the costs of inadequate development practices nearly scuttling the project was worse.
If you search the web for cloud cost optimization strategies you may read horror stories about an Amazon Web Services (AWS) Lambda function gone awry, or get some vague advice that you should right size your compute resources without any specifics on how to do it. These are important strategies that you will learn how to do in this book, but theres more to it than that.
My experience has been that costs in cloud data pipeline development come from the difficulties of wrangling a system that spans unknown third party data sources, cloud services, extremely sophisticated big data processing engines, and multiple code bases. Couple this with a fast-paced production environment and you can quickly devolve into a reactive work mode where code turns into spaghetti, pipelines become difficult to evolve and test, and no one knows whats really going on because theres insufficient monitoring.
Altogether this can create an environment where change is hard, resulting in longer lead times for bringing new functionality onboard. Bugs and burnout are common, eroding customer trust and adoption. These issues hit a companys bottom line in more ways than just the cloud bill.
I wrote this book because this isnt the way it has to be. With a focus on effective monitoring, software development best practices, and targeted advice on designing cloud compute and storage, this book will get you set up for success from the outset and enable you to manage the evolution of data pipelines in a cost effective way.
Ive used these approaches in batch and streaming systems, handling anywhere from a few thousand rows to petabytes of data running the gambit from well-defined, structured data to semi-structured sources that change frequently.
Who this book is for
Ive geared the content toward an intermediate to advanced audience, assuming you have some familiarity with software development best practices, some basics about working with cloud compute and storage, and a general idea about how batch and streaming data pipelines operate.
This book is written from my experience in the day to day development of data pipelines. If this is work you do already or aspire to do in the future you can consider this book a virtual mentor, advising you of common pitfalls and providing guidance honed from working on a variety of data pipeline projects.
If youre coming from a data analysis background youll find advice on software best practices to help you build testable, extendable pipelines. This will aid you in connecting analysis with data acquisition and storage to create end to end systems.
Developer velocity and cost conscious design are areas everyone from individual contributors to technical leads should have on their mind. In this book youll find advice on how to build quality into the development process, make efficient use of cloud resources, and reduce costs. Youll also learn the elements that go into monitoring to not only keep tabs on system health and performance, but gain insight into where redesign should be considered as well.
If you manage data engineering teams youll find helpful tips on effective development practices, areas where costs can escalate, and an overall approach to putting the right practices in place to help your team succeed.
What you will learn
If you would like to learn or improve at the following, this book will be a useful guide:
Drive pipeline evolution, head off performance issues, and quickly debug with effective monitoring
Reduce cloud spend with lower cost cloud service offerings and smart design strategies
Minimize waste without sacrificing performance by right sizing compute resources
Set up development and test environments that minimize cloud service dependencies
Create data pipeline codebases that are testable and extensible, fostering rapid development and evolution
Improve data quality and pipeline operation through validation and testing
What this book is not
This is not an architecture book. There are aspects of the guidance I provide that can tie back into architecture and system requirements, but I will not be discussing different architectural approaches or trade-offs. I do not cover topics such as data governance, data cataloging, or data lineage.