LitArk » Books » Home and family

Jonathan Rioux - Data Analysis with Python and PySpark

Here you can read online Jonathan Rioux - Data Analysis with Python and PySpark full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. year: 2022, publisher: Manning, genre: Home and family. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:

Romance novel Science fiction Adventure Detective Science History Home and family Prose Art Politics Computer Non-fiction Religion Business Children Humor

Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.

Book:
Data Analysis with Python and PySpark
Author:
Jonathan Rioux
Publisher:
Manning
Genre:
Books / Home and family
Year:
2022
Rating:
5 / 5
Favourites:
Add to favourites
Your mark:
- 100
- 1
- 2
- 3
- 4
- 5

Description
Author's other books
Similar books

Data Analysis with Python and PySpark: summary, description and annotation

We offer to read an annotation, description, summary or preface (depends on what the author of the book "Data Analysis with Python and PySpark" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.

Data Analysis with Python and PySpark is a carefully engineered tutorial that helps you use PySpark to deliver your data-driven applications at any scale. This clear and hands-on guide shows you how to enlarge your processing capabilities across multiple machines with data from any source, ranging from Hadoop-based clusters to Excel worksheets. Youll learn how to break down big analysis tasks into manageable chunks and how to choose and use the best PySpark data abstraction for your unique needs. By the time youre done, youll be able to write and run incredibly fast PySpark programs that are scalable, efficient to operate, and easy to debug.

Jonathan Rioux: author's other books

Who wrote Data Analysis with Python and PySpark? Find out the surname, the name of the author of the book and a list of all author's works by series.

Data Analysis with Python and PySpark — read online for free the complete book (whole text) full work

Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "Data Analysis with Python and PySpark" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.

Light

Font size:

↓

↑

Reset

Interval:

↓

↑

Bookmark:

Make

MEAP VERSION 6

About this MEAP

You can download the most up-to-date version of your electronic books from your Manning Account at .

Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.

https://livebook.manning.com/#!/book/pyspark-in-action/discussion

Welcome

Thank you for purchasing the MEAP for PySpark in Action: Python data analysis at scale. It is a lot of fun (and work!) and I hope youll enjoy reading it as much as I am enjoying writing the book.

My journey with PySpark is pretty typical: the company I used to work for migrated their data infrastructure to a data lake and realized along the way that their usual warehouse-type jobs didnt work so well anymore. I spent most of my first months there figuring out how to make PySpark work for my colleagues and myself, starting from zero. This book is very influenced by the questions I got from my colleagues and students (and sometimes myself). Ive found that combining practical experience through real examples with a little bit of theory brings not only proficiency in using PySpark, but also how to build better data programs. This book walks the line between the two by explaining important theoretical concepts without being too laborious.

This book covers a wide range of subjects, since PySpark is itself a very versatile platform. I divided the book into three parts.

Part 1: Walk teaches how PySpark works and how to get started and perform basic data manipulation.
Part 2: Jog builds on the material contained in Part 1 and goes through more advanced subjects. It covers more exotic operations and data formats and explains more what goes on under the hood.
Part 3: Run tackles the cooler stuff: building machine learning models at scale, squeezing more performance out of your cluster, and adding functionality to PySpark.

To have the best time possible with the book, you should be at least comfortable using Python. It isnt enough to have learned another language and transfer your knowledge into Python. I cover more niche corners of the language when appropriate, but youll need to do some research on your own if you are new to Python.

Furthermore, this book covers how PySpark can interact with other data manipulation frameworks (such as Pandas), and those specific sections assume basic knowledge of Pandas.

Finally, for some subjects in Part 3, such as machine learning, having prior exposure will help you breeze through. Its hard to strike a balance between not enough explanation and too much explanation; I do my best to make the right choices.

Your feedback is key in making this book its best version possible. I welcome your comments and thoughts in the liveBook discussion forum.

Thank you again for your interest and in purchasing the MEAP!

Jonathan Rioux

Data Analysis with Python and PySpark - image 3

Brief Table of Contents

Part 1:Walk

Part 2:

Faster big data processing: a primer

Part 3:

A foray into machine learning: logistic regression with PySpark

Simplifying your experiments with machine learning pipelines

Machine learning for unstructured data

PySpark for graphs: GraphFrames

Testing PySpark code

Going full circle: structuring end-to-end PySpark code

Appendixes:

Using PySpark with a cloud provider

Python essentials

PySpark data types

Efficiently using PySparks API documentation

Introduction

In this chapter, you will learn:

What is PySpark
Why PySpark is a useful tool for analytics
The versatility of the Spark platform and its limitations
PySparks way of processing data

According to pretty much every news outlet, data is everything, everywhere. Its the new oil, the new electricity, the new gold, plutonium, even bacon! We call it powerful, intangible, precious, dangerous. I prefer calling it useful in capable hands . After all, for a computer, any piece of data is a collection of zeroes and ones, and it is our responsibility, as users, to make sense of how it translates to something useful.

Just like oil, electricity, gold, plutonium and bacon (especially bacon!), our appetite for data is growing. So much, in fact, that computers arent following. Data is growing in size and complexity, yet consumer hardware has been stalling a little. RAM is hovering for most laptops at around 8 to 16 GB, and SSD are getting prohibitively expensive past a few terabytes. Is the solution for the burgeoning data analyst to triple-mortgage his life to afford top of the line hardware to tackle Big Data problems?

Introducing Spark, and its companion PySpark, the unsung heroes of large-scale analytical workloads. They take a few pages of the supercomputer playbook powerful, but manageable, compute units meshed in a network of machines and bring it to the masses. Add on top a powerful set of data structures ready for any work youre willing to throw at them, and you have a tool that will grow (pun intended) with you.

This book is great introduction to data manipulation and analysis using PySpark. It tries to cover just enough theory to get you comfortable, while giving enough opportunities to practice. Each chapter except this one contains a few exercices to anchor what you just learned. The exercises are all solved and explained in Appendix A.

1.1 What is PySpark?

Whats in a name? Actually, quite a lot. Just by separating PySpark in two, one can already deduce that this will be related to Spark and Python. And it would be right!

At the core, PySpark can be summarized as being the Python API to Spark. While this is an accurate definition, it doesnt give much unless you know the meaning of Python and Spark. If we were in a video game, I certainly wouldnt win any prize for being the most useful NPC. Lets continue our quest to understand what is PySpark by first answering What is Spark? .

1.1.1 You saw it coming: What is Spark?

Spark, according to their authors, is a unified analytics engine for large-scale data processing . This is a very accurate definition, if a little dry.

Digging a little deeper, we can compare Spark to an analytics factory . The raw material here, data comes in, and data, insights, visualizations, models, you name it! comes out.

Just like a factory will often gain more capacity by increasing its footprint, Spark can process an increasingly vast amount of data by scaling out instead of scaling up . This means that, instead of buying thousand of dollars of RAM to accommodate your data set, youll rely instead of multiple computers, splitting the job between them. In a world where two modest computers are less costly than one large one, it means that scaling out is less expensive than up, keeping more money in your pockets.

The problem with computers is that they crash or behave unpredictably once in a while. If instead of one, you have a hundred, the chance that at least one of them go down is now much higher. Spark goes therefore through a lot of hoops to manage, scale, and babysit those poor little sometimes unstable computers so you can focus on what you want, which is to work with data.

Light

Font size:

↓

↑

Reset

Interval:

↓

↑

Bookmark:

Make

Similar books «Data Analysis with Python and PySpark»

Look at similar books to Data Analysis with Python and PySpark. We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.

Jeffrey Aven

Data Analytics with Spark Using Python

Akash Tandon

Advanced Analytics with PySpark: Patterns for Learning from Data at Scale Using Python and Spark

Pramod Singh

Machine Learning with PySpark: With Natural Language Processing and Recommender Systems

Sreeram Nudurupati

Essential PySpark for Scalable Data Analytics: A beginners guide to harnessing the power and ease of PySpark 3

Mahmoud Parsian

Data Algorithms with Spark: Recipes and Design Patterns for Scaling Up using PySpark

Lai Rudy

Hands-On Big Data Analytics with PySpark: Analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs

Drabas

PYSPARK COOKBOOK: over 60 recipes for implementing big data processing and analytics using apache ... spark and python

Tomasz Drabas

PySpark Cookbook: Over 60 Recipes for Implementing Big Data Processing and Analytics Using Apache Spark and Python

Denny Lee

Learning PySpark

Jenny Kim

Interactive Spark using PySpark

Raju Kumar Mishra

PySpark SQL Recipes: With HiveQL, Dataframe and Graphframes

Raju Kumar Mishra

PySpark Recipes: A Problem-Solution Approach with PySpark2

Reviews about «Data Analysis with Python and PySpark»

Discussion, reviews of the book Data Analysis with Python and PySpark and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.