• Complain

Akash Tandon - Advanced Analytics with PySpark: Patterns for Learning from Data at Scale Using Python and Spark

Here you can read online Akash Tandon - Advanced Analytics with PySpark: Patterns for Learning from Data at Scale Using Python and Spark full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. year: 2022, publisher: OReilly Media, genre: Computer. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:

Romance novel Science fiction Adventure Detective Science History Home and family Prose Art Politics Computer Non-fiction Religion Business Children Humor

Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.

Akash Tandon Advanced Analytics with PySpark: Patterns for Learning from Data at Scale Using Python and Spark

Advanced Analytics with PySpark: Patterns for Learning from Data at Scale Using Python and Spark: summary, description and annotation

We offer to read an annotation, description, summary or preface (depends on what the author of the book "Advanced Analytics with PySpark: Patterns for Learning from Data at Scale Using Python and Spark" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.

The amount of data being generated today is staggering and growing. Apache Spark has emerged as the de facto tool to analyze big data and is now a critical part of the data science toolbox. Updated for Spark 3.0, this practical guide brings together Spark, statistical methods, and real-world datasets to teach you how to approach analytics problems using PySpark, Sparks Python API, and other best practices in Spark programming.

Data scientists Akash Tandon, Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills offer an introduction to the Spark ecosystem, then dive into patterns that apply common techniques-including classification, clustering, collaborative filtering, and anomaly detection, to fields such as genomics, security, and finance. This updated edition also covers NLP and image processing.

If you have a basic understanding of machine learning and statistics and you program in Python, this book will get you started with large-scale data analysis.

  • Familiarize yourself with Sparks programming model and ecosystem
  • Learn general approaches in data science
  • Examine complete implementations that analyze large public datasets
  • Discover which machine learning tools make sense for particular problems
  • Explore code that can be adapted to many uses

Akash Tandon: author's other books


Who wrote Advanced Analytics with PySpark: Patterns for Learning from Data at Scale Using Python and Spark? Find out the surname, the name of the author of the book and a list of all author's works by series.

Advanced Analytics with PySpark: Patterns for Learning from Data at Scale Using Python and Spark — read online for free the complete book (whole text) full work

Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "Advanced Analytics with PySpark: Patterns for Learning from Data at Scale Using Python and Spark" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.

Light

Font size:

Reset

Interval:

Bookmark:

Make
Advanced Analytics with PySpark by Akash Tandon Sandy Ryza Uri Laserson - photo 1
Advanced Analytics with PySpark

by Akash Tandon , Sandy Ryza , Uri Laserson , Sean Owen , and Josh Wills

Copyright 2022 Akash Tandon. All rights reserved.

Printed in the United States of America.

Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.

OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .

  • Acquisitions Editor: Jessica Haberman
  • Development Editor: Jeff Bleiel
  • Production Editor: Christopher Faucher
  • Copyeditor: Penelope Perkins
  • Proofreader: Kim Wimpsett
  • Indexer: Sue Klefstad
  • Interior Designer: David Futato
  • Cover Designer: Karen Montgomery
  • Illustrator: Kate Dullea
  • June 2022: First Edition
Revision History for the First Edition
  • 2022-06-14: First Release

See http://oreilly.com/catalog/errata.csp?isbn=9781098103651 for release details.

The OReilly logo is a registered trademark of OReilly Media, Inc. Advanced Analytics with PySpark, the cover image, and related trade dress are trademarks of OReilly Media, Inc.

The views expressed in this work are those of the authors, and do not represent the publishers views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.

978-1-098-10365-1

[LSI]

Preface

Apache Sparks long lineage of predecessors, from MPI (message passing interface) to MapReduce, made it possible to write programs that take advantage of massive resources while abstracting away the nitty-gritty details of distributed systems. As much as data processing needs have motivated the development of these frameworks, in a way the field of big data has become so related to them that its scope is defined by what these frameworks can handle. Sparks original promise was to take this a little furtherto make writing distributed programs feel like writing regular programs.

The rise in Sparks popularity coincided with that of the Python data (PyData) ecosystem. So it makes sense that Sparks Python APIPySparkhas significantly grown in popularity over the last few years. Although the PyData ecosystem has recently sprung up some distributed programming options, Apache Spark remains one of the most popular choices for working with large datasets across industries and domains. Thanks to recent efforts to integrate PySpark with the other PyData tools, learning the framework can help you boost your productivity significantly as a data science practitioner.

We think that the best way to teach data science is by example. To that end, we have put together a book of applications, trying to touch on the interactions between the most common algorithms, datasets, and design patterns in large-scale analytics. This book isnt meant to be read cover to cover: page to a chapter that looks like something youre trying to accomplish, or that simply ignites your interest, and start there.

Why Did We Write This Book Now?

Apache Spark experienced a major version upgrade in 2020version 3.0. One of the biggest improvements was the introduction of Spark Adaptive Execution. This feature takes away a big portion of the complexity around tuning and optimization. We do not refer to it in the book because its turned on by default in Spark 3.2 and later versions, and so you automatically get the benefits.

The ecosystem changes, combined with Sparks latest major release, make this edition a timely one. Unlike previous editions of Advanced Analytics with Spark, which chose Scala, we will use Python. Well cover best practices and integrate with the wider Python data science ecosystem when appropriate. All chapters have been updated to use the latest PySpark API. Two new chapters have been added and multiple chapters have undergone major rewrites. We will not cover Sparks streaming and graph libraries. With Spark in a new era of maturity and stability, we hope that these changes will preserve the book as a useful resource on analytics for years to come.

How This Book Is Organized

introduces the basics of data processing in PySpark and Python through a use case in data cleansing. The next few chapters delve into the meat and potatoes of machine learning with Spark, applying some of the most common algorithms in canonical applications. The remaining chapters are a bit more of a grab bag and apply Spark in slightly more exotic applicationsfor example, querying Wikipedia through latent semantic relationships in the text, analyzing genomics data, and identifying similar images.

This book is not about PySparks merits and disadvantages. There are a few other things that it is not about either. It introduces the Spark programming model and basics of Sparks Python API, PySpark. However, it does not attempt to be a Spark reference or provide a comprehensive guide to all Sparks nooks and crannies. It does not try to be a machine learning, statistics, or linear algebra reference, although many of the chapters provide some background on these before using them.

Instead, this book will help the reader get a feel for what its like to use PySpark for complex analytics on large datasets by covering the entire pipeline: not just building and evaluating models, but also cleansing, preprocessing, and exploring data, with attention paid to turning results into production applications. We believe that the best way to teach this is by example.

Here are examples of some tasks that will be tackled in this book:

Predicting forest cover

We predict type of forest cover using relevant features like location and soil type by using decision trees (see ).

Querying Wikipedia for similar entries

We identify relationships between entries and query the Wikipedia corpus by using NLP (natural language processing) techniques (see ).

Understanding utilization of New York cabs

We compute average taxi waiting time as a function of location by performing temporal and geospatial analysis (see ).

Reduce risk for an investment portfolio

We estimate financial risk for an investment portfolio using the Monte Carlo simulation (see ).

When possible, we attempt not to just provide a solution, but to demonstrate the full data science workflow, with all of its iterations, dead ends, and restarts. This book will be useful for getting more comfortable with Python, Spark, and machine learning and data analysis. However, these are in service of a larger goal, and we hope that most of all this book will teach you how to approach tasks like those described earlier. Each chapter, in about 20 measly pages, will try to get as close as possible to demonstrating how to build one piece of these data applications.

Conventions Used in This Book

The following typographical conventions are used in this book:

Next page
Light

Font size:

Reset

Interval:

Bookmark:

Make

Similar books «Advanced Analytics with PySpark: Patterns for Learning from Data at Scale Using Python and Spark»

Look at similar books to Advanced Analytics with PySpark: Patterns for Learning from Data at Scale Using Python and Spark. We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.


Reviews about «Advanced Analytics with PySpark: Patterns for Learning from Data at Scale Using Python and Spark»

Discussion, reviews of the book Advanced Analytics with PySpark: Patterns for Learning from Data at Scale Using Python and Spark and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.