• Complain

Karau Holden - High performance Spark: best practices for scaling and optimizing Apache Spark

Here you can read online Karau Holden - High performance Spark: best practices for scaling and optimizing Apache Spark full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. City: Sebastopol;CA, year: 2017, publisher: OReilly Media, genre: Computer. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:

Romance novel Science fiction Adventure Detective Science History Home and family Prose Art Politics Computer Non-fiction Religion Business Children Humor

Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.

Karau Holden High performance Spark: best practices for scaling and optimizing Apache Spark
  • Book:
    High performance Spark: best practices for scaling and optimizing Apache Spark
  • Author:
  • Publisher:
    OReilly Media
  • Genre:
  • Year:
    2017
  • City:
    Sebastopol;CA
  • Rating:
    3 / 5
  • Favourites:
    Add to favourites
  • Your mark:
    • 60
    • 1
    • 2
    • 3
    • 4
    • 5

High performance Spark: best practices for scaling and optimizing Apache Spark: summary, description and annotation

We offer to read an annotation, description, summary or preface (depends on what the author of the book "High performance Spark: best practices for scaling and optimizing Apache Spark" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.

Apache Spark is amazing when everything clicks. But if you havent seen the performance improvements you expected, or still dont feel confident enough to use Spark in production, this practical book is for you. Authors Holden Karau and Rachel Warren demonstrate performance optimizations to help your Spark queries run faster and handle larger data sizes, while using fewer resources.

Ideal for software engineers, data engineers, developers, and system administrators working with large-scale data applications, this book describes techniques that can reduce data infrastructure costs and developer hours. Not only will you gain a more comprehensive understanding of Spark, youll also learn how to make it sing.

With this book, youll explore:

  • How Spark SQLs new interfaces improve performance over SQLs RDD data structure
  • The choice between data joins in Core Spark and Spark SQL
  • Techniques for getting the most out of standard RDD...
  • Karau Holden: author's other books


    Who wrote High performance Spark: best practices for scaling and optimizing Apache Spark? Find out the surname, the name of the author of the book and a list of all author's works by series.

    High performance Spark: best practices for scaling and optimizing Apache Spark — read online for free the complete book (whole text) full work

    Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "High performance Spark: best practices for scaling and optimizing Apache Spark" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.

    Light

    Font size:

    Reset

    Interval:

    Bookmark:

    Make
    High Performance Spark

    by Holden Karau and Rachel Warren

    Copyright 2017 Holden Karau, Rachel Warren. All rights reserved.

    Printed in the United States of America.

    Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.

    OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .

    Editor: Shannon Cutt

    Indexer: Ellen Troutman-Zaig

    Production Editor: Kristen Brown

    Interior Designer: David Futato

    Copyeditor: Kim Cofer

    Cover Designer: Karen Montgomery

    Proofreader: James Fraleigh

    Illustrator: Rebecca Demarest

    • June 2017: First Edition
    Revision History for the First Edition
    • 2017-05-22: First Release

    The OReilly logo is a registered trademark of OReilly Media, Inc. High Performance Spark, the cover image, and related trade dress are trademarks of OReilly Media, Inc.

    While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.

    978-1-491-94320-5

    [LSI]

    Preface

    We wrote this book for data engineers and data scientists who are looking to get the most out of Spark. If youve been working with Spark and invested in Spark but your experience so far has been mired by memory errors and mysterious, intermittent failures, this book is for you. If you have been using Spark for some exploratory work or experimenting with it on the side but have not felt confident enough to put it into production, this book may help. If you are enthusiastic about Spark but have not seen the performance improvements from it that you expected, we hope this book can help. This book is intended for those who have some working knowledge of Spark, and may be difficult to understand for those with little or no experience with Spark or distributed computing. For recommendations of more introductory literature see .

    We expect this text will be most useful to those who care about optimizing repeated queries in production, rather than to those who are doing primarily exploratory work. While writing highly performant queries is perhaps more important to the data engineer, writing those queries with Spark, in contrast to other frameworks, requires a good knowledge of the data, usually more intuitive to the data scientist. Thus it may be more useful to a data engineer who may be less experienced with thinking critically about the statistical nature, distribution, and layout of data when considering performance. We hope that this book will help data engineers think more critically about their data as they put pipelines into production. We want to help our readers ask questions such as How is my data distributed?, Is it skewed?, What is the range of values in a column?, and How do we expect a given value to group? and then apply the answers to those questions to the logic of their Spark queries.

    However, even for data scientists using Spark mostly for exploratory purposes, this book should cultivate some important intuition about writing performant Spark queries, so that as the scale of the exploratory analysis inevitably grows, you may have a better shot of getting something to run the first time. We hope to guide data scientists, even those who are already comfortable thinking about data in a distributed way, to think critically about how their programs are evaluated, empowering them to explore their data more fully, more quickly, and to communicate effectively with anyone helping them put their algorithms into production.

    Regardless of your job title, it is likely that the amount of data with which you are working is growing quickly. Your original solutions may need to be scaled, and your old techniques for solving new problems may need to be updated. We hope this book will help you leverage Apache Spark to tackle new problems more easily and old problems more efficiently.

    First Edition Notes

    You are reading the first edition of High Performance Spark, and for that, we thank you!If you find errors, mistakes, or have ideas for ways to improve this book, please reach out to us at . If you wish to be included in a thanks section in future editions of the book, please include your preferred display name.

    Supporting Books and Materials

    For data scientists and developers new to Spark, by Franois Garillot may also be of use once it is available.

    Beyond books, there is also a collection of intro-level Spark training material available.For individuals who prefer video, Paco Nathan has an excellent introduction video series on OReilly.Commercially, Databricks as well as Cloudera and other Hadoop/Spark vendors offer Spark training.Previous recordings of Spark camps, as well as many other great resources, have been posted on the Apache Spark documentation page.

    If you dont have experience with Scala, we do our best to convince you to pick up Scala in

    Conventions Used in This Book

    The following typographical conventions are used in this book:

    Italic

    Indicates new terms, URLs, email addresses, filenames, and file extensions.

    Constant width

    Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.

    Constant width bold

    Shows commands or other text that should be typed literally by the user.

    Constant width italic

    Shows text that should be replaced with user-supplied values or by values determined by context.

    Tip

    This element signifies a tip or suggestion.

    Note

    This element signifies a general note.

    Warning

    This element indicates a warning or caution.

    Warning

    Examples prefixed with Evil depend heavily on Apache Spark internals, and will likely break in future minor releases of Apache Spark.Youve been warnedbut we totally understand you arent going to pay much attention to that because neither would we.

    Using Code Examples

    Supplemental material (code examples, exercises, etc.) is available for download from .

    This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless youre reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from OReilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. The code is also available under an Apache 2 License. Incorporating a significant amount of example code from this book into your products documentation may require permission.

    Next page
    Light

    Font size:

    Reset

    Interval:

    Bookmark:

    Make

    Similar books «High performance Spark: best practices for scaling and optimizing Apache Spark»

    Look at similar books to High performance Spark: best practices for scaling and optimizing Apache Spark. We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.


    Reviews about «High performance Spark: best practices for scaling and optimizing Apache Spark»

    Discussion, reviews of the book High performance Spark: best practices for scaling and optimizing Apache Spark and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.