LitArk » Books » Home and family

Holden Karau - High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark

Here you can read online Holden Karau - High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. year: 2017, publisher: O’Reilly Media, genre: Home and family. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:

Romance novel Science fiction Adventure Detective Science History Home and family Prose Art Politics Computer Non-fiction Religion Business Children Humor

Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.

Book:
High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark
Author:
Holden Karau / Rachel Warren
Publisher:
O’Reilly Media
Genre:
Books / Home and family
Year:
2017
Rating:
3 / 5
Favourites:
Add to favourites
Your mark:
- 60
- 1
- 2
- 3
- 4
- 5

Description
Author's other books
Similar books

High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark: summary, description and annotation

We offer to read an annotation, description, summary or preface (depends on what the author of the book "High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.

Apache Spark is amazing when everything clicks. But if you havent seen the performance improvements you expected, or still dont feel confident enough to use Spark in production, this practical book is for you. Authors Holden Karau and Rachel Warren demonstrate performance optimizations to help your Spark queries run faster and handle larger data sizes, while using fewer resources.

Ideal for software engineers, data engineers, developers, and system administrators working with large-scale data applications, this book describes techniques that can reduce data infrastructure costs and developer hours. Not only will you gain a more comprehensive understanding of Spark, youll also learn how to make it sing.

With this book, youll explore:

How Spark SQLs new interfaces improve performance over SQLs RDD data structure
The choice between data joins in Core Spark and Spark SQL
Techniques for getting the most out of standard RDD transformations
How to work around performance issues in Sparks key/value pair paradigm
Writing high-performance Spark code without Scala or the JVM
How to test for functionality and performance when applying suggested improvements
Using Spark MLlib and Spark ML machine learning libraries
Sparks Streaming components and external community packages

Holden Karau: author's other books

Who wrote High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark? Find out the surname, the name of the author of the book and a list of all author's works by series.

High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark — read online for free the complete book (whole text) full work

Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.

Light

Font size:

↓

↑

Reset

Interval:

↓

↑

Bookmark:

Make

High Performance Spark

by Holden Karau and Rachel Warren

Printed in the United States of America.

Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.

OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles ( http://safaribooksonline.com ). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .

Editor: Shannon Cutt
Production Editor: FILL IN PRODUCTION EDITOR
Copyeditor: FILL IN COPYEDITOR
Proofreader: FILL IN PROOFREADER
Indexer: FILL IN INDEXER
Interior Designer: David Futato
Cover Designer: Karen Montgomery
Illustrator: Rebecca Demarest

July 2016: First Edition

Revision History for the First Edition

2016-03-21: First Early Release

See http://oreilly.com/catalog/errata.csp?isbn=9781491943205 for release details.

The OReilly logo is a registered trademark of OReilly Media, Inc. High Performance Spark, the cover image, and related trade dress are trademarks of OReilly Media, Inc.

While the publisher and the author(s) have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author(s) disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.

978-1-491-94320-5

[FILL IN]

Preface

Who Is This Book For?

This book is for data engineers and data scientists who are looking to get the most out of Spark. If youve been working with Spark and invested in Spark but your experience so far has been mired by memory errors and mysterious, intermittent failures, this book is for you. If you have been using Spark for some exploratory work or experimenting with it on the side but havent felt confident enough to put it into production, this book may help. If you are enthusiastic about Spark but havent seen the performance improvements from it that you expected, we hope this book can help. This book is intended for those who have some working knowledge of Spark and may be difficult to understand for those with little or no experience with Spark or distributed computing. For recommendations of more introductory literature see .

We expect this text will be most useful to those who care about optimizing repeated queries in production, rather than to those who are doing merely exploratory work. While writing highly performant queries is perhaps more important to the data engineer, writing those queries with Spark, in contrast to other frameworks, requires a good knowledge of the data, usually more intuitive to the data scientist. Thus it may be more useful to a data engineer who may be less experienced with thinking critically about the statistical nature, distribution, and layout of your data when considering performance. We hope that this book will help data engineers think more critically about their data as they put pipelines into production. We want to help our readers ask questions such as: How is my data distributed? Is it skewed?, What is the range of values in a column?, How do we expect a given value to group? Is it skewed?. And to apply the answers to those questions to the logic of their Spark queries.

However, even for data scientists using Spark mostly for exploratory purposes, this book should cultivate some important intuition about writing performant Spark queries, so that as the scale of the exploratory analysis inevitably grows, you may have a better shot of getting something to run the first time. We hope to guide data scientists, even those who are already comfortable thinking about data in a distributed way, to think critically about how their programs are evaluated, empowering them to explore their data more fully. more quickly, and to communicate effectively with anyone helping them put their algorithms into production.

Regardless of your job title, it is likely that the amount of data with which you are working is growing quickly. Your original solutions may need to be scaled, and your old techniques for solving new problems may need to be updated. We hope this book will help you leverage Apache Spark to tackle new problems more easily and old problems more efficiently.

Early Release Note

You are reading an early release version of High Performance Spark, and for that, we thank you!If you find errors, mistakes, or have ideas for ways to improve this book, please reach out to us at . If you wish to be included in a thanks section in future editions of the book, please include your preferred display name.

Warning

This is an early release. While there are always mistakes and omissions in technical books, this is especially true for an early release book.

Supporting Books & Materials

For data scientists and developers new to Spark, is a great book for interested data scientists.

Beyond books, there is also a collection of intro-level Spark training material available.For individuals who prefer video, Paco Nathan has an excellent introduction video series on OReilly.Commercially, Databricks as well as Cloudera and other Hadoop/Spark vendors offer Spark training.Previous recordings of Spark camps, as well as many other great resources, have been posted on the Apache Spark documentation page.

If you dont have experience with Scala, we do our best to convince you to pick up Scala in

Conventions Used in this Book

The following typographical conventions are used in this book:

Italic

Indicates new terms, URLs, email addresses, filenames, and file extensions.

Constant width

Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.

Constant width bold

Shows commands or other text that should be typed literally by the user.

Constant width italic

Shows text that should be replaced with user-supplied values or by values determined by context.

Tip

This element signifies a tip or suggestion.

Note

This element signifies a general note.

Warning

This element indicates a warning or caution.

Using Code Examples

Supplemental material (code examples, exercises, etc.) is available for download from the High Performance Spark GitHub Repositoryand some of the testing code is available at the Spark Testing Base Github Repository.and the Spark Validator Repo.

This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless youre reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from OReilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. The code is also available under an Apache 2 License. Incorporating a significant amount of example code from this book into your products documentation may require permission.

Light

Font size:

↓

↑

Reset

Interval:

↓

↑

Bookmark:

Make

Similar books «High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark»

Look at similar books to High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark. We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.

Shrey Mehrotra

Apache Spark Quick Start Guide: Quickly learn the art of writing efficient big data applications with Apache Spark

Anirudh Kala

Optimizing Databricks Workloads: Harness the power of Apache Spark in Azure and maximize the performance of modern big data workloads

Hien Luu

Beginning Apache Spark 3: With DataFrame, Spark SQL, Structured Streaming, and Spark Machine Learning Library

Thottuvaikkatumana

Apache Spark 2 for beginners develop large-scale distributed data processing applications using Spark 2 in Scala and Python

Karau Holden

High performance Spark: best practices for scaling and optimizing Apache Spark

Holden Karau

Learning Spark

Frampton Mike

Mastering Apache Spark: gain expertise in processing and storing data by using advanced techniques with Apache Spark

Hien Luu

Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning library

Rachel Warren

High performance Spark : best practices for scaling and optimizing Apache Spark

Karau Holden

Learning Spark: [lightning-fast data analysis]

Gerard Maas

Stream Processing with Apache Spark: Mastering Structured Streaming and Spark Streaming

Holden Karau

Fast Data Processing with Spark

Reviews about «High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark»

Discussion, reviews of the book High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.