• Complain

LEE - Apache Spark Manuscript ebook

Here you can read online LEE - Apache Spark Manuscript ebook full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. year: 2021, genre: Computer. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:

Romance novel Science fiction Adventure Detective Science History Home and family Prose Art Politics Computer Non-fiction Religion Business Children Humor

Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.

LEE Apache Spark Manuscript ebook
  • Book:
    Apache Spark Manuscript ebook
  • Author:
  • Genre:
  • Year:
    2021
  • Rating:
    5 / 5
  • Favourites:
    Add to favourites
  • Your mark:
    • 100
    • 1
    • 2
    • 3
    • 4
    • 5

Apache Spark Manuscript ebook: summary, description and annotation

We offer to read an annotation, description, summary or preface (depends on what the author of the book "Apache Spark Manuscript ebook" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.

Apache Spark Manuscript ebook — read online for free the complete book (whole text) full work

Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "Apache Spark Manuscript ebook" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.

Light

Font size:

Reset

Interval:

Bookmark:

Make

Apache Spark

INVENT THE FUTURE

ERNESTO LEE

APACHE SPARK

Copyright 2021 by ERNESTO LEE

All rights reserved.

No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written consent of the publisher. Short extracts may be used for review purposes.

Table of Contents

Processing Data with MapReduce

3Vs of Hadoop

Introduction to Spark

What is Spark?

Why Spark?

Components of Spark

Spark Data Storage

Various Spark Versions

Job Workflow in Spark

CHAPTER 1
INTRODUCTION TO
APACHE SPARK
Theory

This chapter provides a comprehensive introduction to Apache Spark, which is the center of focus throughout this book. Moreover, in the upcoming chapters, we will describe the Scala programming language to interact with Spark. But before we begin with Spark, lets have a brief introduction of Hadoop and compare it with Spark.

An Overview of Big Data
Quick Introduction to Hadoop

Apache Hadoop is an open source distributed framework that allows the storage and processing of large datasets (also known as the Big Data) across the clusters of commodity machines. Hadoop overcomes the traditional limitations of storage and computation of data by distributing the data over a cluster of commodity machines making it scalable and cost-effective.

The basic idea of Hadoop originated when Google released a white paper about Google File System (GFS) - a computing model developed by Google which was designed to provide an efficient and reliable access to data using large clusters of commodity hardware. Subsequently, this model was adopted by Doug Cutting and Mike Cafarella for their search engine called Nutch. Hadoop was then developed to support task and data distribution for the Nutch search engine project by Doug Cutting and Mike Cafarella. Well, what does the name Hadoop mean? There is no significance for this name and it is not an acronym either. Actually, Hadoop is the name that Doug Cuttings son had given to his yellow stuffed elephant. Doug found this name is very unique, easy to remember and sometimes funny. Nonetheless, not only does Hadoop have such an interesting name with no significance, but its sub-projects also tend to have such names which are based on the names of animals like Pig and for similar reasons. Overall, these names are unique, not used anywhere else and are easy to remember.

Why Hadoop?

Companies today are realizing that there is a lot of information in the unstructured documents spread across the network. Basically, a lot of data is available in the form of spreadsheets, text files, e-mails, logs, PDFs, and other data formats that contain valuable information which can help discover new trends, designing new products, improving the existing products, knowing customers better and what not. Due to the emergence of many advanced technologies relying on the Big Data, many research studies have reported that the data explosion is increasing at an alarming rate beyond limits like never before and there are no signs of slowing down, at least in the near future. To deal with such data, we need a reliable and low cost tool to meaningfully process it. Therefore, Hadoop has been developed as a tool, which helps us to reliably process the Big Data, which is present in a variety of formats, in a very less time in a flexible and cost effective way.

Let us see why Hadoop is so popular and what it has in store for us.

Scalable: Hadoop is scalable, meaning; we can just start from a single node server and eventually increase more nodes as we need more storage and computing power.
Fault-Tolerant: Hadoop helps prevent loss of data. All the data stored in the Hadoop Distributed File System (HDFS) is broken into blocks and stored with a default replication factor of 3. While processing this data, if a certain node goes off, the process does not stop but continues, as the data still exists in other nodes due to replication.
Flexible: Hadoop does not require schema. Furthermore, it can process unstructured, semi-structured and structured data from any kind of source or even from multiple sources.
Cost effective: Hadoop does not require expensive high-end computing hardware. It can efficiently work with a cluster of commodity machines by the process of parallel computing.
Quick Introduction to Hadoop Distributed File System

Hadoop Distributed File System (HDFS) is a file system which extends over a cluster of commodity machines rather than a single high-end machine (e.g., a supercomputer). HDFS is a distributed large scale storage component and is highly scalable. Moreover, HDFS can accept node failures without any loss of data, and it is widely known for its reliability. Let us now check out why HDFS stands out of the crowd when it comes to the distributed file systems.

Reliable Data Storage

HDFS is very much reliable when it comes to data storage. The data stored in HDFS is replicated by a default replication factor of 3. It means that, even if a machine fails, the data is still available in two other machines.

Cost Effective

HDFS can be deployed on any custom-made clusters of commodity hardware and can save us a lot of bucks. Therefore, HDFS does not require any high-end or expensive hardware to function.

Big Datasets

HDFS is capable of storing Petabytes of data over a cluster of machines where the size of a single file can range from Gigabytes to Terabytes. HDFS is not designed to store a huge number of small-sized files as the file system metadata is stored in the memory of NameNode.

Streaming Data Access

HDFS provides streaming access to data. It is best suited for batch processing of data and not suitable for interactive processing. Moreover, it is not designed for the applications which require low-latency access to data, such as OnLine Transaction Processing (OLTP).

Simple Coherency Model

HDFS is designed to write once and read many times access models for files. Appending the content to files at the end is supported, but files cannot be updated at any arbitrary point, and it is also not possible to have multiple writers. In addition, the files can only be written by a single writer.

Block Placement in HDFS

Hadoop is designed in such a way that the first block replica is placed on the same node as the client, but the second replica is placed on a different rack to that of the first replica. Similarly, the third replica is placed on a random node on the same rack as the second replica. If the replication factor is more than 3, any random nodes in the cluster are selected to place the replicas. If a client running outside the cluster stores a file, a random node (which is not busy) is automatically picked to place the first replica. This way, if a node fails, the data is still available on other nodes of the cluster and if a rack fails, again, the data is still intact.

HDFS Architecture

HDFS is designed as a Master and Slave architecture, in which the Master node controls and assigns jobs to all its slave nodes. The following terminologies are used to describe the Master and Slave nodes:

The Master Nodes in HDFS are called:

NameNode
Secondary NameNode

The Slave Nodes in HDFS are called:

Data Nodes

These nodes perform the core serving roles in HDFS architecture. Let us now look in detail at the roles of each Node to better comprehend them.

Next page
Light

Font size:

Reset

Interval:

Bookmark:

Make

Similar books «Apache Spark Manuscript ebook»

Look at similar books to Apache Spark Manuscript ebook. We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.


Reviews about «Apache Spark Manuscript ebook»

Discussion, reviews of the book Apache Spark Manuscript ebook and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.