LitArk » Books » Computer

LazyProgrammer - Big Data, MapReduce, Hadoop, and Spark with Python

Here you can read online LazyProgrammer - Big Data, MapReduce, Hadoop, and Spark with Python full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. year: 2016, genre: Computer. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:

Romance novel Science fiction Adventure Detective Science History Home and family Prose Art Politics Computer Non-fiction Religion Business Children Humor

Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.

Book:
Big Data, MapReduce, Hadoop, and Spark with Python
Author:
LazyProgrammer
Genre:
Books / Computer
Year:
2016
Rating:
5 / 5
Favourites:
Add to favourites
Your mark:
- 100
- 1
- 2
- 3
- 4
- 5

Description
Author's other books
Similar books

Big Data, MapReduce, Hadoop, and Spark with Python: summary, description and annotation

We offer to read an annotation, description, summary or preface (depends on what the author of the book "Big Data, MapReduce, Hadoop, and Spark with Python" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.

LazyProgrammer: author's other books

Who wrote Big Data, MapReduce, Hadoop, and Spark with Python? Find out the surname, the name of the author of the book and a list of all author's works by series.

Big Data, MapReduce, Hadoop, and Spark with Python — read online for free the complete book (whole text) full work

Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "Big Data, MapReduce, Hadoop, and Spark with Python" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.

Light

Font size:

↓

↑

Reset

Interval:

↓

↑

Bookmark:

Make

Big Data, MapReduce, Hadoop, and Spark with Python Master Big Data Analytics and Data Wrangling with MapReduce Fundamentals using Hadoop, Spark, and Python By: The LazyProgrammer ( https://lazyprogrammer.me )

Introduction

Whats the big deal with big data?

It was recently reported in the Wall Street Journal that the government is collecting so much data on its citizens that they cant even use it effectively. A few unicorns have popped up in the past decade or so, promising to help solve the big data problems that billion dollar corporations and the people running your country cant. It goes without saying that programming with frameworks that can do big data processing is a highly-coveted skill. Machine learning and artificial intelligence algorithms, which have garnered increased attention (and fear-mongering) in recent years, mainly due to the rise of deep learning, are completely dependent on data to learn. The more data the algorithm learns from, the smarter it can become. The problem is, the amount of data we collect has outpaced gains in CPU performance.

Therefore, scalable methods for processing data are needed. In the early 2000s, Google invented MapReduce, a framework to systematically and methodically process big data in a scalable way by distributing the work across multiple machines. Later, the technology was adopted into an open-source framework called Hadoop, and then Spark emerged as a new big data framework which addressed some problems with MapReduce. In this book we will cover all 3 - the fundamental MapReduce paradigm, how to program with Hadoop, and how to program with Spark.

Advance your Career

If Spark is a better version of MapReduce, why are we even talking about it? Good question! Corporations, being slow-moving entities, are often still using Hadoop due to historical reasons. Just search for big data and Hadoop on LinkedIn and you will see that there are a large number of high-salary openings for developers who know how to use Hadoop.

In addition to giving you deeper insight into how big data processing works, learning about the fundamentals of MapReduce and Hadoop first will help you really appreciate how much easier Spark is to work with. Any startup or technical engineering team will appreciate a solid background with all of these technologies. Many will require you to know all of them, so that you can help maintain and patch their existing systems, and build newer and more efficient systems that improve the performance and robustness of the old systems. Amazingly, all the technologies we discuss in this book can be downloaded and installed for FREE. That means all you need to invest after purchasing this book is your effort and your time. The only prerequisites are that you are comfortable with Python coding and the command line shell.

For the machine learning chapter youll want to be familiar with using machine learning libraries. BONUS: At the end of this book, Ill show you a super simple way to train a deep neural network on Spark with the classic MNIST dataset.

Formatting

I know that the e-book format can be quite limited on many platforms. If you find the formatting in this book lacking, particularly for the code or diagrams, please shoot me an email at along with a proof-of-purchase, and I will send you the original ePub from which this book was created.

Chapter 1: MapReduce Fundamentals

In this chapter we are going to look at how to program within the MapReduce framework and why it works. Youll see that MapReduce is effectively just a really fancy for-loop.

Once you master the MapReduce paradigm, coding with MapReduce will simply just become regular programming. Because MapReduce is designed to scale to big data, there is a lot of overhead involved in setting things up. Once you understand this overhead (both in terms of computational overhead and effort on your part), big data programming is just programming.

What is MapReduce?

The big question. You may have read that MapReduce is a framework that helps you process big data jobs in a scalable way. You may have read that MapReduce scales by parallelizing the work to be done.

You may have read that MapReduce proceeds in 2 steps: mapping, followed by reducing, which are done by processes called mappers and reducers, respectively. I dont find these descriptions particularly helpful. They certainly dont get you any closer to actually coding MapReduce jobs. They are more like a high-high-high-high-high level overview that perhaps a CEO who is not interested in the technical details would need to know. As a programmer, the above description is not useful. It is best to think about what the data looks like as we proceed from one step to the next.

The most straightforward way to think about big data processing is to think about text files. Text files will consist of a number of lines, each of which can be read in one at a time. In Python, it might look something like this: for line in open(input.txt): do_something(line) As you can see, this time complexity of this code is O(N). It will take twice as long to process a file that is twice as large. An obvious solution to this would be to break up the work. If we have a dual-core CPU, we could split up the file into 2 files of equal size: input1.txt and input2.txt.

Run the same program twice on each of the files, each will run on a separate core, thus taking half the time. If we have a quad-core CPU, we could split up the file into 4 parts and it would take a quarter of the time. Before MapReduce came about, this is how you would scale. Just split up the data, and spread out the computation across different CPUs, even different machines (if they have network capabilities). The problem with this is its not systematic enough. Its not general enough that we can have just one framework that takes care of all the details for us.

Some of these details might include: - One machine dies during computation. How do you know what it was working on so that you can then pass that job to another machine? - How would you even keep track of the fact that it died? - One machine is taking significantly longer than the others. How can you make sure youre spreading the work evenly? The MapReduce framework includes a master node that keeps track of all these things, and even runs an HTTP server so that you can track the progress of a job. To understand the MapReduce framework as an algorithm, you dont need to understand which machine does what. What you do need to understand is the map function and the reduce function. [Side note: We focus on implementation in this book, because when you join a new company as a big data hire, these systems are likely already setup by the DevOps team. [Side note: We focus on implementation in this book, because when you join a new company as a big data hire, these systems are likely already setup by the DevOps team.

Your job will be the coding part.] The best way is to learn by example, so that is what we are going to do now. The canonical example for MapReduce is word counting. You have a huge number of text documents (notice that this implies the data is already split up, not just one huge file), and you want to output a table showing each word and its corresponding count. If you were to do this in regular Python, it would look something like this: # count the words word_counts = {} for f in files: for line in open(f): words = line.split() for w in words: word_counts[w] = word_counts.get(w, 0) + 1 # display the result for w, c in word_counts.iteritems(): print %s\t%s % (w, c) Note that in big data processing, we often print data out in a table format. Implementation-wise, this means we use CSV or some other character-delimited format (tabs and pipes are also common). What would this look like in MapReduce? mapper.py for line in input: words = line.split() for w in words: print %s\t%s % (w, 1) reducer.py for key, values in input: print %s\t%s % (key, sum(values)) input here is abstracted and can be different in different contexts.

Light

Font size:

↓

↑

Reset

Interval:

↓

↑

Bookmark:

Make

Similar books «Big Data, MapReduce, Hadoop, and Spark with Python»

Look at similar books to Big Data, MapReduce, Hadoop, and Spark with Python. We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.

White

Hadoop

Parsian

Data algorithms recipes for scaling up with Hadoop and Spark

Achari

Hadoop essentials: delve into the key concepts of Hadoop and get a thorough understanding of the Hadoop ecosystem

Bengfort Benjamin

Data analytics with Hadoop: an introduction for data scientists

Grover Mark

Hadoop Application Architectures: Designing Real-World Big Data Applications

Krishna Sankar

Fast Data Processing with Spark

Mahmoud Parsian

Data Algorithms: Recipes for Scaling Up with Hadoop and Spark

Mark Grover

Hadoop Application Architectures

Shiva Achari

Hadoop Essentials

Holden Karau

Fast Data Processing with Spark

Tom White

Hadoop: The Definitive Guide

Vignesh Prajapati

Big Data Analytics with R and Hadoop

Reviews about «Big Data, MapReduce, Hadoop, and Spark with Python»

Discussion, reviews of the book Big Data, MapReduce, Hadoop, and Spark with Python and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.