LitArk » Books » Computer

Raju Kumar Mishra - PySpark Recipes: A Problem-Solution Approach with PySpark2

Here you can read online Raju Kumar Mishra - PySpark Recipes: A Problem-Solution Approach with PySpark2 full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. year: 2017, publisher: Apress, genre: Computer. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:

Romance novel Science fiction Adventure Detective Science History Home and family Prose Art Politics Computer Non-fiction Religion Business Children Humor

Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.

Book:
PySpark Recipes: A Problem-Solution Approach with PySpark2
Author:
Raju Kumar Mishra
Publisher:
Apress
Genre:
Books / Computer
Year:
2017
Rating:
3 / 5
Favourites:
Add to favourites
Your mark:
- 60
- 1
- 2
- 3
- 4
- 5

Description
Author's other books
Similar books

PySpark Recipes: A Problem-Solution Approach with PySpark2: summary, description and annotation

We offer to read an annotation, description, summary or preface (depends on what the author of the book "PySpark Recipes: A Problem-Solution Approach with PySpark2" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.

Quickly find solutions to common programming problems encountered while processing big data. Content is presented in the popular problem-solution format. Look up the programming problem that you want to solve. Read the solution. Apply the solution directly in your own code. Problem solved!

PySpark Recipes covers Hadoop and its shortcomings. The architecture of Spark, PySpark, and RDD are presented. You will learn to apply RDD to solve day-to-day big data problems. Python and NumPy are included and make it easy for new learners of PySpark to understand and adopt the model.

What You Will Learn

Understand the advanced features of PySpark2 and SparkSQL
Optimize your code
Program SparkSQL with Python
Use Spark Streaming and Spark MLlib with Python
Perform graph analysis with GraphFrames

Who This Book Is For

Data analysts, Python programmers, big data enthusiasts

Raju Kumar Mishra: author's other books

Who wrote PySpark Recipes: A Problem-Solution Approach with PySpark2? Find out the surname, the name of the author of the book and a list of all author's works by series.

PySpark Recipes: A Problem-Solution Approach with PySpark2 — read online for free the complete book (whole text) full work

Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "PySpark Recipes: A Problem-Solution Approach with PySpark2" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.

Light

Font size:

↓

↑

Reset

Interval:

↓

↑

Bookmark:

Make

Raju Kumar Mishra 2018

Raju Kumar Mishra PySpark Recipes

1. The Era of Big Data, Hadoop, and Other Big Data Processing Frameworks

Raju Kumar Mishra 1

(1)

Bangalore, Karnataka, India

When I first joined Orkut, I was happy. With Orkut, I had a new platform enabling me get to know the people around me, including their thoughts, their views, their purchases, and the places they visited. We were all gaining more knowledge than ever before and felt more connected to the people around us. Uploading pictures helped us share good ideas of places to visit. I was becoming more and more addicted to understanding and expressing sentiments. After a few years, I joined Facebook. And day by day, I was introduced to what became an infinite amount of information from all over world. Next, I started purchasing items online, and I liked it more than shopping offline. I could easily get a lot of information about products, and I could compare prices and features. And I wasnt the only one; millions of people were feeling the same way about the Web.

More and more data was flooding in from every corner of the world to the Web. And thanks to all those inventions related to data storage systems, people could store this huge inflow of data.

More and more users joined the Web from all over the world, and therefore increased the amount of data being added to these storage systems. This data was in the form of opinions, pictures, videos, and other forms of data too. This data deluge forced users to adopt distributed systems. Distributed systems require distributed programming. And we also know that distributed systems require extra care for fault-tolerance and efficient algorithms. Distributed systems always need two things: reliability of the system and availability of all its components.

Apache Hadoop was introduced, ensuring efficient computation and fault-tolerance for distributed systems. Mainly, it concentrated on reliability and availability. Because Apache Hadoop was easy to program, many people became interested in big data. Big data became a popular topic for discussion everywhere. E-commerce companies wanted to know more about their customers, and the health-care industry was interested in gaining insights from the data collected, for example. More data metrics were defined. More data points started to be collected.

Many open source big data tools emerged, including Apache Tez and Apache Storm . This was also a time that many NoSQL databases emerged to deal with this huge data inflow. Apache Spark also evolved as a distributed system and became very popular during this time.

In this chapter, we are going to discuss big data as well as Hadoop as a distributed system for processing big data. In covering the components of Hadoop, we will also discuss Hadoop ecosystem frameworks such as Apache Hive and Apache Pig. The usefulness of the components of the Hadoop ecosystem is also discussed to give you an overview. Throwing light on some of the shortcomings of Hadoop will give you background on the development of Apache Spark. The chapter will then move through a description of Apache Spark. We will also discuss various cluster managers that work with Apache Spark. The chapter wouldnt be complete without discussing NoSQL, so discussion on the NoSQL database HBase is also included. Sometimes we read data from a relational database management system (RDBMS) ; this chapter discusses PostgreSQL.

Big Data

Big data is one of the hot topics of this era. But what is big data? Big data describes a dataset that is huge and increasing with amazing speed. Apart from this volume and velocity, big data is also characterized by its variety of data and veracity. Lets explore these termsvolume, velocity, variety, and veracityin detail. These are also known as the 4V characteristics of big data, as illustrated in Figure .

Figure 1-1.

Characteristcis of big data

Volume

The volume specifies the amount of data to be processed. A large amount of data requires large machines or distributed systems. And the time required for computation will also increase with the volume of data. So its better to go for a distributed system, if we can parallelize our computation. Volume might be of structured data, unstructured data, or any data. If we have unstructured data, the situation becomes more complex and computing intensive. You might wonder, how big is big? What volume of data should be classified as big data? This is again a debatable question. But in general, we can say that an amount of data that we cant handle via a conventional system can be considered big data.

Velocity

Every organization is becoming more and more data conscious. A lot of data is collected every moment. This means that the velocity of datathe speed of the data flow and of data processingis also increasing. How will a single system be able to handle this velocity? The problem becomes complex when we have to analyze a large inflow of data in real time. Each day, systems are being developed to deal with this huge inflow of data.

Variety

Sometimes the variety of data adds enough complexity that conventional data analysis systems cant analyze data well. What do we mean by variety ? You might think data is just data. But this is not the case. Image data is different from simple tabular data, for example, because of the way it is organized and saved. In addition, an infinite number of file systems are available, and every file system requires a different way of dealing with it. Reading and writing a JSON file, for instance, will be different from the way we deal with a CSV file. Nowadays, a data scientist has to handle a combination of these data types. The data you are going to deal with might be a combination of pictures, videos, and text. This variety of data makes big data more complex to analyze.

Veracity

Can you imagine a logically incorrect computer program resulting in the correct output? Of course not. Similarly, data that is not accurate is going to provide misleading results. The veracity of data is one of the important concerns related to big data. When we consider the condition of big data, we have to think about any abnormalities in the data.

Hadoop

Hadoop is a distributed and scalable framework for solving big data problems. Hadoop, developed by Doug Cutting and Mark Cafarella, is written in Java. It can be installed on a cluster of commodity hardware, and it scales horizontally on distributed systems. Easy to program Inspiration from Google research paper Hadoop was developed. Hadoops capability to work on commodity hardware makes it cost-effective. If we are working on commodity hardware, fault-tolerance is an inevitable issue. But Hadoop provides a fault-tolerant system for data storage and computation, and this fault-tolerant capability has made Hadoop popular.

Hadoop has two components, as illustrated in Figure . The first component is the Hadoop Distributed File System (HDFS) . The second component is MapReduce. HDFS is for distributed data storage, and MapReduce is for performing computation on the data stored in HDFS.

Figure 1-2.

Hadoop components

HDFS

HDFS is used to store large amounts of data in a distributed and fault-tolerant fashion. HDFS is written in Java and runs on commodity hardware. It was inspired by a Google research paper about the Google File System (GFS) . It is a write-once and read-many-times system thats effective for large amounts of data.

Light

Font size:

↓

↑

Reset

Interval:

↓

↑

Bookmark:

Make

Similar books «PySpark Recipes: A Problem-Solution Approach with PySpark2»

Look at similar books to PySpark Recipes: A Problem-Solution Approach with PySpark2. We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.

Jeffrey Aven

Data Analytics with Spark Using Python

Akash Tandon

Advanced Analytics with PySpark: Patterns for Learning from Data at Scale Using Python and Spark

Jonathan Rioux

Data Analysis with Python and PySpark

Mahmoud Parsian

Data Algorithms with Spark: Recipes and Design Patterns for Scaling Up using PySpark

Josh Juneau

Java 9 Recipes: A Problem-Solution Approach

Juneau

Java 8 Recipes

Drabas

PYSPARK COOKBOOK: over 60 recipes for implementing big data processing and analytics using apache ... spark and python

Tomasz Drabas

PySpark Cookbook: Over 60 Recipes for Implementing Big Data Processing and Analytics Using Apache Spark and Python

Jenny Kim

Interactive Spark using PySpark

Raju Kumar Mishra

PySpark SQL Recipes: With HiveQL, Dataframe and Graphframes

Josh Juneau [Josh Juneau]

Java 9 Recipes: A Problem-Solution Approach, Third Edition

Josh Juneau

Java 7 Recipes A Problem-Solution Approach

Reviews about «PySpark Recipes: A Problem-Solution Approach with PySpark2»

Discussion, reviews of the book PySpark Recipes: A Problem-Solution Approach with PySpark2 and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.