High Performance Spark
by Holden Karau and Rachel Warren
Copyright 2017 Holden Karau, Rachel Warren. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .
Editor: Shannon Cutt | Indexer: Ellen Troutman-Zaig |
Production Editor: Kristen Brown | Interior Designer: David Futato |
Copyeditor: Kim Cofer | Cover Designer: Karen Montgomery |
Proofreader: James Fraleigh | Illustrator: Rebecca Demarest |
Revision History for the First Edition
- 2017-05-22: First Release
The OReilly logo is a registered trademark of OReilly Media, Inc. High Performance Spark, the cover image, and related trade dress are trademarks of OReilly Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
978-1-491-94320-5
[LSI]
Preface
We wrote this book for data engineers and data scientists who are looking to get the most out of Spark. If youve been working with Spark and invested in Spark but your experience so far has been mired by memory errors and mysterious, intermittent failures, this book is for you. If you have been using Spark for some exploratory work or experimenting with it on the side but have not felt confident enough to put it into production, this book may help. If you are enthusiastic about Spark but have not seen the performance improvements from it that you expected, we hope this book can help. This book is intended for those who have some working knowledge of Spark, and may be difficult to understand for those with little or no experience with Spark or distributed computing. For recommendations of more introductory literature see .
We expect this text will be most useful to those who care about optimizing repeated queries in production, rather than to those who are doing primarily exploratory work. While writing highly performant queries is perhaps more important to the data engineer, writing those queries with Spark, in contrast to other frameworks, requires a good knowledge of the data, usually more intuitive to the data scientist. Thus it may be more useful to a data engineer who may be less experienced with thinking critically about the statistical nature, distribution, and layout of data when considering performance. We hope that this book will help data engineers think more critically about their data as they put pipelines into production. We want to help our readers ask questions such as How is my data distributed?, Is it skewed?, What is the range of values in a column?, and How do we expect a given value to group? and then apply the answers to those questions to the logic of their Spark queries.
However, even for data scientists using Spark mostly for exploratory purposes, this book should cultivate some important intuition about writing performant Spark queries, so that as the scale of the exploratory analysis inevitably grows, you may have a better shot of getting something to run the first time. We hope to guide data scientists, even those who are already comfortable thinking about data in a distributed way, to think critically about how their programs are evaluated, empowering them to explore their data more fully, more quickly, and to communicate effectively with anyone helping them put their algorithms into production.
Regardless of your job title, it is likely that the amount of data with which you are working is growing quickly. Your original solutions may need to be scaled, and your old techniques for solving new problems may need to be updated. We hope this book will help you leverage Apache Spark to tackle new problems more easily and old problems more efficiently.
First Edition Notes
You are reading the first edition of High Performance Spark, and for that, we thank you!If you find errors, mistakes, or have ideas for ways to improve this book, please reach out to us at . If you wish to be included in a thanks section in future editions of the book, please include your preferred display name.
Supporting Books and Materials
For data scientists and developers new to Spark, by Franois Garillot may also be of use once it is available.
Beyond books, there is also a collection of intro-level Spark training material available.For individuals who prefer video, Paco Nathan has an excellent introduction video series on OReilly.Commercially, Databricks as well as Cloudera and other Hadoop/Spark vendors offer Spark training.Previous recordings of Spark camps, as well as many other great resources, have been posted on the Apache Spark documentation page.
If you dont have experience with Scala, we do our best to convince you to pick up Scala in
Conventions Used in This Book
The following typographical conventions are used in this book:
ItalicIndicates new terms, URLs, email addresses, filenames, and file extensions.
Constant width
Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.
Constant width bold
Shows commands or other text that should be typed literally by the user.
Constant width italic
Shows text that should be replaced with user-supplied values or by values determined by context.
Tip
This element signifies a tip or suggestion.
Note
This element signifies a general note.
Warning
This element indicates a warning or caution.
Warning
Examples prefixed with Evil depend heavily on Apache Spark internals, and will likely break in future minor releases of Apache Spark.Youve been warnedbut we totally understand you arent going to pay much attention to that because neither would we.
Using Code Examples
Supplemental material (code examples, exercises, etc.) is available for download from .
This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless youre reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from OReilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. The code is also available under an Apache 2 License. Incorporating a significant amount of example code from this book into your products documentation may require permission.