Praise for Learning Spark, Second Edition
This book offers a structured approach to learning Apache Spark, covering new developments in the project. It is a great way for Spark developers to get started with big data.
Reynold Xin, Databricks Chief Architect and
Cofounder and Apache Spark PMC Member
For data scientists and data engineers looking to learn Apache Spark and how to build scalable and reliable big data applications, this book is an essential guide!
Ben Lorica, Databricks Chief Data Scientist,
Past Program Chair OReilly Strata Conferences,
Program Chair for Spark + AI Summit
Learning Spark
by Jules S. Damji , Brooke Wenig , Tathagata Das , and Denny Lee
Copyright 2020 Databricks, Inc. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .
- Acquisitions Editor: Jonathan Hassell
- Development Editor: Michele Cronin
- Production Editor: Deborah Baker
- Copyeditor: Rachel Head
- Proofreader: Penelope Perkins
- Indexer: Potomac Indexing, LLC
- Interior Designer: David Futato
- Cover Designer: Karen Montgomery
- Illustrator: Rebecca Demarest
- January 2015: First Edition
- July 2020: Second Edition
Revision History for the Second Edition
- 2020-06-24: First Release
- 2020-08-03: Second Release
See http://oreilly.com/catalog/errata.csp?isbn=9781492050049 for release details.
The OReilly logo is a registered trademark of OReilly Media, Inc. Learning Spark, the cover image, and related trade dress are trademarks of OReilly Media, Inc.
The views expressed in this work are those of the authors, and do not represent the publishers views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
This work is part of a collaboration between OReilly and Databricks. See our statement of editorial independence.
978-1-492-05004-9
[GP]
Foreword
Apache Spark has evolved significantly since I first started the project at UC Berkeley in 2009. After moving to the Apache Software Foundation, the open source project has had over 1,400 contributors from hundreds of companies, and the global Spark meetup group has grown to over half a million members. Sparks user base has also become highly diverse, encompassing Python, R, SQL, and JVM developers, with use cases ranging from data science to business intelligence to data engineering. I have been working closely with the Apache Spark community to help continue its development, and I am thrilled to see the progress thus far.
The release of Spark 3.0 marks an important milestone for the project and has sparked the need for updated learning material. The idea of a second edition of Learning Spark has come up many timesand it was overdue. Even though I coauthored both Learning Spark and Spark: The Definitive Guide (both OReilly), it was time for me to let the next generation of Spark contributors pick up the narrative. Im delighted that four experienced practitioners and developers, who have been working closely with Apache Spark from its early days, have teamed up to write this second edition of the book, incorporating the most recent APIs and best practices for Spark developers in a clear and informative guide.
The authors approach to this edition is highly conducive to hands-on learning. The key concepts in Spark and distributed big data processing have been distilled into easy-to-follow chapters. Through the books illustrative code examples, developers can build confidence using Spark and gain a greater understanding of its Structured APIs and how to leverage them. I hope that this second edition of Learning Spark will guide you on your large-scale data processing journey, whatever problems you wish to tackle using Spark.
Matei Zaharia, Chief Technologist,
Cofounder of Databricks, Asst. Professor at Stanford,
and original creator of Apache Spark
Preface
We welcome you to the second edition of Learning Spark. Its been five years since the first edition was published in 2015, originally authored by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia. This new edition has been updated to reflect Apache Sparks evolution through Spark 2.x and Spark 3.0, including its expanded ecosystem of built-in and external data sources, machine learning, and streaming technologies with which Spark is tightly integrated.
Over the years since its first 1.x release, Spark has become the de facto big data unified processing engine. Along the way, it has extended its scope to include support for various analytic workloads. Our intent is to capture and curate this evolution for readers, showing not only how you can use Spark but how it fits into the new era of big data and machine learning. Hence, we have designed each chapter to build progressively on the foundations laid by the previous chapters, ensuring that the content is suited for our intended audience.
Who This Book Is For
Most developers who grapple with big data are data engineers, data scientists, or machine learning engineers. This book is aimed at those professionals who are looking to use Spark to scale their applications to handle massive amounts of data.
In particular, data engineers will learn how to use Sparks Structured APIs to perform complex data exploration and analysis on both batch and streaming data; use Spark SQL for interactive queries; use Sparks built-in and external data sources to read, refine, and write data in different file formats as part of their extract, transform, and load (ETL) tasks; and build reliable data lakes with Spark and the open source Delta Lake table format.
For data scientists and machine learning engineers, Sparks MLlib library offers many common algorithms to build distributed machine learning models. We will cover how to build pipelines with MLlib, best practices for distributed machine learning, how to use Spark to scale single-node models, and how to manage and deploy these models using the open source library MLflow.
While the book is focused on learning Spark as an analytical engine for diverse workloads, we will not cover all of the languages that Spark supports. Most of the examples in the chapters are written in Scala, Python, and SQL. Where necessary, we have infused a bit of Java. For those interested in learning Spark with R, we recommend Javier Luraschi, Kevin Kuo, and Edgar Ruizs Mastering Spark with R (OReilly).
Finally, because Spark is a distributed engine, building an understanding of Spark application concepts is critical. We will guide you through how your Spark application interacts with Sparks distributed components and how execution is decomposed into parallel tasks on a cluster. We will also cover which deployment modes are supported and in what environments.