Praise for Scaling Machine Learning with Spark
If there is one book the Spark community has been craving for the last decade, its this. Writing about the combination of Spark and AI requires broad knowledge, a deep technical skillset, and the ability to break down complex concepts so theyre easy to understand. Adi delivers all of this and more while covering big data, AI, and everything in between .
Andy Petrella, founder at Kensu and author of Fundamentals of Data Observability (OReilly)
Scaling Machine Learning with Spark is a wealth of knowledge for data and ML practitioners, providing a holistic and creative approach to building end-to-end scalable machine learning solutions. The authors expertise and knowledge, combined with a focus on collaboration and understanding, makes this book a must-read for anyone in the industry .
Noah Gift, Duke executive in residence
Adis book is without any doubt a good reference and resource to have beside you when working with Spark and distributed ML. You will learn best practices she has to share along with her experience working in the industry for many years. Worth the investment and time reading it.
Laura Uzcategui, machine learning engineer at TalentBait
This book is an amazing synthesis of knowledge and experience. I consider it essential reading for both novice and veteran machine learning engineers. Readers will deepen their understanding of general principles for machine learning in distributed systems while simultaneously engaging with the technical details required to integrate and scale the most widely used tools of the trade including Spark, PyTorch, Tensorflow.
Matthew Housley, CTO and coauthor of Fundamentals of Data Engineering (OReilly)
Adis done a wonderful job at creating a very readable, practical, and insanely detailed deep dive into machine learning with Spark.
Joe Reis, coauthor of Fundamentals of Data Engineering (OReilly) and recovering data scientist
Scaling Machine Learning with Spark
by Adi Polak
Copyright 2023 Adi Polak. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (https://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com.
- Acquisitions Editor: Nicole Butterfield
- Development Editor: Jill Leonard
- Production Editor: Jonathon Owen
- Copyeditor: Rachel Head
- Proofreader: Piper Editorial Consulting, LLC
- Indexer: Potomac Indexing, LLC
- Interior Designer: David Futato
- Cover Designer: Karen Montgomery
- Illustrator: Kate Dullea
- March 2023: First Edition
Revision History for the First Edition
- 2023-03-02: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781098106829 for release details.
The OReilly logo is a registered trademark of OReilly Media, Inc. Scaling Machine Learning with Spark, the cover image, and related trade dress are trademarks of OReilly Media, Inc.
The views expressed in this work are those of the author and do not represent the publishers views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
978-1-098-10682-9
[LSI]
Preface
Welcome to Scaling Machine Learning with Spark: Distributed ML with MLlib, TensorFlow, and PyTorch. This book aims to guide you in your journey as you learn more about machine learning (ML) systems. Apache Spark is currently the most popular framework for large-scale data processing. It has numerous APIs implemented in Python, Java, and Scala and is used by many powerhouse companies, including Netflix, Microsoft, and Apple. PyTorch and TensorFlow are among the most popular frameworks for machine learning. Combining these tools, which are already in use in many organizations today, allows you to take full advantage of their strengths.
Before we get started, though, perhaps you are wondering why I decided to write this book. Good question. There are two reasons. The first is to support the machine learning ecosystem and community by sharing the knowledge, experience, and expertise I have accumulated over the last decade working as a machine learning algorithm researcher, designing and implementing algorithms to run on large-scale data. I have spent most of my career working as a data infrastructure engineer, building infrastructure for large-scale analytics with all sorts of formatting, types, schemas, etc., and integrating knowledge collected from customers, community members, and colleagues who have shared their experience while brainstorming and developing solutions. Our industry can use such knowledge to propel itself forward at a faster rate, by leveraging the expertise of others. While not all of this books content will be applicable to everyone, much of it will open up new approaches for a wide array of practitioners.
This brings me to my second reason for writing this book: I want to provide a holistic approach to building end-to-end scalable machine learning solutions that extends beyond the traditional approach. Today, many solutions are customized to the specific requirements of the organization and specific business goals. This will most likely continue to be the industry norm for many years to come. In this book, I aim to challenge the status quo and inspire more creative solutions while explaining the pros and cons of multiple approaches and tools, enabling you to leverage whichever tools are used in your organization and get the best of all worlds. My overall goal is to make it simpler for data and machine learning practitioners to collaborate and understand each other better.
Who Should Read This Book?
This book is designed for machine learning practitioners with previous industry experience who want to learn about Apache Sparks MLlib and increase their understanding of the overall system and flow. It will be particularly relevant to data scientists and machine learning engineers, but MLOps engineers, software engineers, and anyone interested in learning about or building distributed machine learning models and building pipelines with MLlib, distributed PyTorch, and TensorFlow will also find value. Technologists who understand high-level concepts of working with machine learning and want to dip their feet into the technical side as well should also find the book interesting and accessible.
Do You Need Distributed Machine Learning?
As with every good thing, it depends. If you have small datasets that fit into your machines memory, the answer is no. If at some point you will need to scale out your code and make sure you can train a model on a larger dataset that does not fit into a single machines memory, then yes.
It is often better to use the same tools across the software development lifecycle, from the local development environment to staging and production. Take into consideration, though, that this also introduces other complexities involved in managing a distributed system, which typically will be handled by a different team in your organization. Its a good idea to have a common language to collaborate with your colleagues.