97 Things Every Data Engineer Should Know
by Tobias Macey
Copyright 2021 OReilly Media, Inc.. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles ( http://oreilly.com ). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .
- Editors: Jessica Haberman and Jill Leonard
- Production Editor: Beth Kelly
- Copyeditor: FILL IN COPYEDITOR
- Proofreader: FILL IN PROOFREADER
- Indexer: FILL IN INDEXER
- Interior Designer: Monica Kamsvaag
- Cover Designer: FILL IN COVER DESIGNER
- Illustrator: Kate Dullea
Revision History for the First Edition
- YYYY-MM-DD: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781492062417 for release details.
The OReilly logo is a registered trademark of OReilly Media, Inc. 97 Things Every Data Engineer Should Know, the cover image, and related trade dress are trademarks of OReilly Media, Inc.
The views expressed in this work are those of the author(s), and do not represent the publishers views. While the publisher and the author(s) have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author(s) disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
978-1-492-06241-7
[FILL IN]
Preface
Data engineering as a distinct role is relatively new, but the responsibilities have existed for decades. Broadly speaking, a data engineer makes data available for use in analytics, machine learning, business intelligence, etc. The introduction of big data technologies, data science, distributed computing, and the cloud have all contributed to making the work of the data engineer more necessary, more complex, and (paradoxically) more possible. It is an impossible task to write a single book that encompasses everything that you will need to know to be effective as a data engineer, but there are still a number of core principles that will help you in your journey.
This book is a collection of advice from a wide range of individuals who have learned valuable lessons about working with data the hard way. To save you the work of making their same mistakes, we have collected their advice to give you a set of building blocks that can be used to lay your own foundation for a successful career in data engineering.
In these pages you will find career tips for working in data teams, engineering advice for how to think about your tools, and fundamental principles of distributed systems. There are many paths into data engineering, and no two people will use the same set of tools, but we hope that you will find the inspiration that will guide you on your journey. So regardless of whether this is your first step on the road, or you have been walking it for years we wish you the best of luck in your adventures.
OReilly Online Learning
For more than 40 years, OReilly Media has provided technology and business training, knowledge, and insight to help companies succeed.
Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. OReillys online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from OReilly and 200+ other publishers. For more information, visit http://oreilly.com.
How to Contact Us
Please address comments and questions about this book to the publisher:
- OReilly Media, Inc.
- 1005 Gravenstein Highway North
- Sebastopol, CA 95472
- 800-998-9938 (in the United States or Canada)
- 707-829-0515 (international or local)
- 707-829-0104 (fax)
We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/97-things-data-eng.
Email to comment or ask technical questions about this book.
Visit http://oreilly.com for news and information about our books and courses.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://youtube.com/oreillymedia
Acknowledgments
I would like to thank my wife for her help and support while I worked on this book, the numerous contributors for sharing their time and expertise, and the OReilly team for all of their hard work to make this book a reality.
Chapter 1. Three Distributed Programming Concepts to Be Aware of When Choosing an Open Source Framework
Adi Polak
Many data engineers create pipelines for extract, transform, and load (ETL) or extract, load, and transform (ELT) operations. During a transform (T) task, you might be working with data that fits in one machines memory. However, often the data will require you to use frameworks/solutions that leverage distributed parallel computation to achieve the desired goal. To support that, many researchers have developed models of distributed programming and computation embodied in known frameworks such as Apache Spark, Apache Cassandra, Apache Kafka, TensorFlow, and many more. Lets look at the three most used distributed programming models for data analytics and distributed machine learning.
MapReduce Algorithm
MapReduce is a distributed computation algorithm developed by Google in 2004. As developers, we specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. This approach is an extension of the split-apply-combine strategy for data analysis.
In practice, every task is split into multiple map and reduce functions. Data is distributed over multiple nodes/machines, and each chunk of data is processed on a node. A logic function is applied to the data on that node, and later the reduce operation combines data via the shuffle mechanism. In this process, the nodes redistribute the data based on the map function keys output.
Later we can apply more logic to the combined data or go for another round of split-apply-combine if necessary. The open source solutions implementing these concepts are Apache Spark, Hadoop MapReduce, Apache Flink, and more.