Learning Spark
by Holden Karau , Andy Konwinski , Patrick Wendell , and Matei Zaharia
Copyright 2015 Databricks. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .
- Editors: Ann Spencer and Marie Beaugureau
- Production Editor: Kara Ebrahim
- Copyeditor: Rachel Monaghan
- Proofreader: Charles Roumeliotis
- Indexer: Ellen Troutman
- Interior Designer: David Futato
- Cover Designer: Ellie Volckhausen
- Illustrator: Rebecca Demarest
- February 2015: First Edition
Revision History for the First Edition
- 2015-01-26: First Release
- 2015-03-27: Second Release
- 2015-05-08: Third Release
See http://oreilly.com/catalog/errata.csp?isbn=9781449358624 for release details.
The OReilly logo is a registered trademark of OReilly Media, Inc. Learning Spark, the cover image of a small-spotted catshark, and related trade dress are trademarks of OReilly Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
978-1-449-35862-4
[LSI]
Foreword
In a very short time, Apache Spark has emerged as the next generation big data processing engine, and is being applied throughout the industry faster than ever. Spark improves over Hadoop MapReduce, which helped ignite the big data revolution, in several key dimensions: it is much faster, much easier to use due to its rich APIs, and it goes far beyond batch applications to support a variety of workloads, including interactive queries, streaming, machine learning, and graph processing.
I have been privileged to be closely involved with the development of Spark all the way from the drawing board to what has become the most active big data open source project today, and one of the most active Apache projects! As such, Im particularly delighted to see Matei Zaharia, the creator of Spark, teaming up with other longtime Spark developers Patrick Wendell, Andy Konwinski, and Holden Karau to write this book.
With Sparks rapid rise in popularity, a major concern has been lack of good reference material. This book goes a long way to address this concern, with 11 chapters and dozens of detailed examples designed for data scientists, students, and developers looking to learn Spark. It is written to be approachable by readers with no background in big data, making it a great place to start learning about the field in general. I hope that many years from now, you and other readers will fondly remember this as the book that introduced you to this exciting new field.
Ion Stoica, CEO of Databricks and Co-director, AMPlab, UC Berkeley
Preface
As parallel data analysis has grown common, practitioners in many fields have sought easier tools for this task.Apache Spark has quickly emerged as one of the most popular, extending and generalizing MapReduce.Spark offers three main benefits.First, it is easy to useyou can develop applications on your laptop, using a high-level API that lets you focus on the content of your computation.Second, Spark is fast, enabling interactive use and complex algorithms.And third, Spark is a general engine, letting you combine multiple types of computations (e.g., SQL queries, text processing, and machine learning) that might previously have required different engines.These features make Spark an excellent starting point to learn about Big Data in general.
This introductory book is meant to get you up and running with Spark quickly.Youll learn how to download and run Spark on your laptop and use it interactively to learn the API.Once there, well cover the details of available operations and distributed execution.Finally, youll get a tour of the higher-level libraries built into Spark, including libraries for machine learning, stream processing, and SQL.We hope that this book gives you the tools to quickly tackle data analysis problems, whether you do so on one machine or hundreds.
Audience
This book targets data scientists and engineers.We chose these two groups because they have the most to gain from using Spark to expand the scope of problems they can solve.Sparks rich collection of data-focused libraries (like MLlib) makes it easy for data scientists to go beyond problems that fit on a single machine while using their statistical background.Engineers, meanwhile, will learn how to write general-purpose distributed programs in Spark and operate production applications.Engineers and data scientists will both learn different details from this book, but will both be able to apply Spark to solve large distributed problems in their respective fields.
Data scientists focus on answering questions or building models from data. They often have a statistical or math background and some familiarity with tools like Python, R, and SQL. We have made sure to include Python and, where relevant, SQL examples for all our material, as well as an overview of the machine learning and library in Spark. If you are a data scientist, we hope that after reading this book you will be able to use the same mathematical approaches to solve problems, except much faster and on a much larger scale.
The second group this book targets is software engineers who have some experience with Java, Python, or another programming language. If you are an engineer, we hope that this book will show you how to set up a Spark cluster, use the Spark shell, and write Spark applications to solve parallel processing problems. If you are familiar with Hadoop, you have a bit of a head start on figuring out how to interact with HDFS and how to manage a cluster, but either way, we will cover basic distributed execution concepts.
Regardless of whether you are a data scientist or engineer, to get the most out of this book you should have some familiarity with one of Python, Java, Scala, or a similar language. We assume that you already have a storage solution for your data and we cover how to load and save data from many common ones, but not how to set them up. If you dont have experience with one of those languages, dont worry: there are excellent resources available to learn these. We call out some of the books available in .
How This Book Is Organized
The chapters of this book are laid out in such a way that you should be able to go through the material front to back. At the start of each chapter, we will mention which sections we think are most relevant to data scientists and which sections we think are most relevant for engineers. That said, we hope that all the material is accessible to readers of either background.
The first two chapters will get you started with getting a basic Spark installation on your laptop and give you an idea of what you can accomplish with Spark. Once weve got the motivation and setup out of the way, we will dive into the Spark shell, a very useful tool for development and prototyping. Subsequent chapters then cover the Spark programming interface in detail, how applications execute on a cluster, and higher-level libraries available on Spark (such as Spark SQL and MLlib).