• Complain

Jules S. Damji - Learning Spark: Lightning-Fast Data Analytics

Here you can read online Jules S. Damji - Learning Spark: Lightning-Fast Data Analytics full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. year: 2020, publisher: OReilly Media, genre: Home and family. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:

Romance novel Science fiction Adventure Detective Science History Home and family Prose Art Politics Computer Non-fiction Religion Business Children Humor

Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.

Jules S. Damji Learning Spark: Lightning-Fast Data Analytics

Learning Spark: Lightning-Fast Data Analytics: summary, description and annotation

We offer to read an annotation, description, summary or preface (depends on what the author of the book "Learning Spark: Lightning-Fast Data Analytics" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.

Data is bigger, arrives faster, and comes in a variety of formatsand it all needs to be processed at scale for analytics or machine learning. But how can you process such varied workloads efficiently? Enter Apache Spark.

Updated to include Spark 3.0, this second edition shows data engineers and data scientists why structure and unification in Spark matters. Specifically, this book explains how to perform simple and complex data analytics and employ machine learning algorithms. Through step-by-step walk-throughs, code snippets, and notebooks, youll be able to:

  • Learn Python, SQL, Scala, or Java high-level Structured APIs
  • Understand Spark operations and SQL Engine
  • Inspect, tune, and debug Spark operations with Spark configurations and Spark UI
  • Connect to data sources: JSON, Parquet, CSV, Avro, ORC, Hive, S3, or Kafka
  • Perform analytics on batch and streaming data using Structured Streaming
  • Build reliable data pipelines with open source Delta Lake and Spark
  • Develop machine learning pipelines with MLlib and productionize models using MLflow

Jules S. Damji: author's other books


Who wrote Learning Spark: Lightning-Fast Data Analytics? Find out the surname, the name of the author of the book and a list of all author's works by series.

Learning Spark: Lightning-Fast Data Analytics — read online for free the complete book (whole text) full work

Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "Learning Spark: Lightning-Fast Data Analytics" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.

Light

Font size:

Reset

Interval:

Bookmark:

Make
Praise for Learning Spark, Second Edition

This book offers a structured approach to learning Apache Spark, covering new developments in the project. It is a great way for Spark developers to get started with big data.

Reynold Xin, Databricks Chief Architect and
Cofounder and Apache Spark PMC Member

For data scientists and data engineers looking to learn Apache Spark and how to build scalable and reliable big data applications, this book is an essential guide!

Ben Lorica, Databricks Chief Data Scientist,
Past Program Chair OReilly Strata Conferences,
Program Chair for Spark + AI Summit

Learning Spark

by Jules S. Damji , Brooke Wenig , Tathagata Das , and Denny Lee

Copyright 2020 Databricks, Inc. All rights reserved.

Printed in the United States of America.

Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.

OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .

  • Acquisitions Editor: Jonathan Hassell
  • Development Editor: Michele Cronin
  • Production Editor: Deborah Baker
  • Copyeditor: Rachel Head
  • Proofreader: Penelope Perkins
  • Indexer: Potomac Indexing, LLC
  • Interior Designer: David Futato
  • Cover Designer: Karen Montgomery
  • Illustrator: Rebecca Demarest
  • January 2015: First Edition
  • July 2020: Second Edition
Revision History for the Second Edition
  • 2020-06-24: First Release
  • 2020-08-03: Second Release

See http://oreilly.com/catalog/errata.csp?isbn=9781492050049 for release details.

The OReilly logo is a registered trademark of OReilly Media, Inc. Learning Spark, the cover image, and related trade dress are trademarks of OReilly Media, Inc.

The views expressed in this work are those of the authors, and do not represent the publishers views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.

This work is part of a collaboration between OReilly and Databricks. See our statement of editorial independence.

978-1-492-05004-9

[GP]

Foreword

Apache Spark has evolved significantly since I first started the project at UC Berkeley in 2009. After moving to the Apache Software Foundation, the open source project has had over 1,400 contributors from hundreds of companies, and the global Spark meetup group has grown to over half a million members. Sparks user base has also become highly diverse, encompassing Python, R, SQL, and JVM developers, with use cases ranging from data science to business intelligence to data engineering. I have been working closely with the Apache Spark community to help continue its development, and I am thrilled to see the progress thus far.

The release of Spark 3.0 marks an important milestone for the project and has sparked the need for updated learning material. The idea of a second edition of Learning Spark has come up many timesand it was overdue. Even though I coauthored both Learning Spark and Spark: The Definitive Guide (both OReilly), it was time for me to let the next generation of Spark contributors pick up the narrative. Im delighted that four experienced practitioners and developers, who have been working closely with Apache Spark from its early days, have teamed up to write this second edition of the book, incorporating the most recent APIs and best practices for Spark developers in a clear and informative guide.

The authors approach to this edition is highly conducive to hands-on learning. The key concepts in Spark and distributed big data processing have been distilled into easy-to-follow chapters. Through the books illustrative code examples, developers can build confidence using Spark and gain a greater understanding of its Structured APIs and how to leverage them. I hope that this second edition of Learning Spark will guide you on your large-scale data processing journey, whatever problems you wish to tackle using Spark.

Matei Zaharia, Chief Technologist,

Cofounder of Databricks, Asst. Professor at Stanford,

and original creator of Apache Spark

Preface

We welcome you to the second edition of Learning Spark. Its been five years since the first edition was published in 2015, originally authored by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia. This new edition has been updated to reflect Apache Sparks evolution through Spark 2.x and Spark 3.0, including its expanded ecosystem of built-in and external data sources, machine learning, and streaming technologies with which Spark is tightly integrated.

Over the years since its first 1.x release, Spark has become the de facto big data unified processing engine. Along the way, it has extended its scope to include support for various analytic workloads. Our intent is to capture and curate this evolution for readers, showing not only how you can use Spark but how it fits into the new era of big data and machine learning. Hence, we have designed each chapter to build progressively on the foundations laid by the previous chapters, ensuring that the content is suited for our intended audience.

Who This Book Is For

Most developers who grapple with big data are data engineers, data scientists, or machine learning engineers. This book is aimed at those professionals who are looking to use Spark to scale their applications to handle massive amounts of data.

In particular, data engineers will learn how to use Sparks Structured APIs to perform complex data exploration and analysis on both batch and streaming data; use Spark SQL for interactive queries; use Sparks built-in and external data sources to read, refine, and write data in different file formats as part of their extract, transform, and load (ETL) tasks; and build reliable data lakes with Spark and the open source Delta Lake table format.

For data scientists and machine learning engineers, Sparks MLlib library offers many common algorithms to build distributed machine learning models. We will cover how to build pipelines with MLlib, best practices for distributed machine learning, how to use Spark to scale single-node models, and how to manage and deploy these models using the open source library MLflow.

While the book is focused on learning Spark as an analytical engine for diverse workloads, we will not cover all of the languages that Spark supports. Most of the examples in the chapters are written in Scala, Python, and SQL. Where necessary, we have infused a bit of Java. For those interested in learning Spark with R, we recommend Javier Luraschi, Kevin Kuo, and Edgar Ruizs Mastering Spark with R (OReilly).

Finally, because Spark is a distributed engine, building an understanding of Spark application concepts is critical. We will guide you through how your Spark application interacts with Sparks distributed components and how execution is decomposed into parallel tasks on a cluster. We will also cover which deployment modes are supported and in what environments.

Next page
Light

Font size:

Reset

Interval:

Bookmark:

Make

Similar books «Learning Spark: Lightning-Fast Data Analytics»

Look at similar books to Learning Spark: Lightning-Fast Data Analytics. We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.


Reviews about «Learning Spark: Lightning-Fast Data Analytics»

Discussion, reviews of the book Learning Spark: Lightning-Fast Data Analytics and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.