LEE - Apache Spark Manuscript ebook
Here you can read online LEE - Apache Spark Manuscript ebook full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. year: 2021, genre: Computer. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:
Romance novel
Science fiction
Adventure
Detective
Science
History
Home and family
Prose
Art
Politics
Computer
Non-fiction
Religion
Business
Children
Humor
Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.
Apache Spark Manuscript ebook: summary, description and annotation
We offer to read an annotation, description, summary or preface (depends on what the author of the book "Apache Spark Manuscript ebook" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.
LEE: author's other books
Who wrote Apache Spark Manuscript ebook? Find out the surname, the name of the author of the book and a list of all author's works by series.
Apache Spark Manuscript ebook — read online for free the complete book (whole text) full work
Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "Apache Spark Manuscript ebook" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.
Font size:
Interval:
Bookmark:
Apache Spark
INVENT THE FUTURE
ERNESTO LEE
APACHE SPARK
Copyright 2021 by ERNESTO LEE
All rights reserved.
No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written consent of the publisher. Short extracts may be used for review purposes.
Table of Contents
Processing Data with MapReduce
3Vs of Hadoop
Introduction to Spark
What is Spark?
Why Spark?
Components of Spark
Spark Data Storage
Various Spark Versions
Job Workflow in Spark
INTRODUCTION TO
APACHE SPARK
This chapter provides a comprehensive introduction to Apache Spark, which is the center of focus throughout this book. Moreover, in the upcoming chapters, we will describe the Scala programming language to interact with Spark. But before we begin with Spark, lets have a brief introduction of Hadoop and compare it with Spark.
Apache Hadoop is an open source distributed framework that allows the storage and processing of large datasets (also known as the Big Data) across the clusters of commodity machines. Hadoop overcomes the traditional limitations of storage and computation of data by distributing the data over a cluster of commodity machines making it scalable and cost-effective.
The basic idea of Hadoop originated when Google released a white paper about Google File System (GFS) - a computing model developed by Google which was designed to provide an efficient and reliable access to data using large clusters of commodity hardware. Subsequently, this model was adopted by Doug Cutting and Mike Cafarella for their search engine called Nutch. Hadoop was then developed to support task and data distribution for the Nutch search engine project by Doug Cutting and Mike Cafarella. Well, what does the name Hadoop mean? There is no significance for this name and it is not an acronym either. Actually, Hadoop is the name that Doug Cuttings son had given to his yellow stuffed elephant. Doug found this name is very unique, easy to remember and sometimes funny. Nonetheless, not only does Hadoop have such an interesting name with no significance, but its sub-projects also tend to have such names which are based on the names of animals like Pig and for similar reasons. Overall, these names are unique, not used anywhere else and are easy to remember.
Companies today are realizing that there is a lot of information in the unstructured documents spread across the network. Basically, a lot of data is available in the form of spreadsheets, text files, e-mails, logs, PDFs, and other data formats that contain valuable information which can help discover new trends, designing new products, improving the existing products, knowing customers better and what not. Due to the emergence of many advanced technologies relying on the Big Data, many research studies have reported that the data explosion is increasing at an alarming rate beyond limits like never before and there are no signs of slowing down, at least in the near future. To deal with such data, we need a reliable and low cost tool to meaningfully process it. Therefore, Hadoop has been developed as a tool, which helps us to reliably process the Big Data, which is present in a variety of formats, in a very less time in a flexible and cost effective way.
Let us see why Hadoop is so popular and what it has in store for us.
Hadoop Distributed File System (HDFS) is a file system which extends over a cluster of commodity machines rather than a single high-end machine (e.g., a supercomputer). HDFS is a distributed large scale storage component and is highly scalable. Moreover, HDFS can accept node failures without any loss of data, and it is widely known for its reliability. Let us now check out why HDFS stands out of the crowd when it comes to the distributed file systems.
Reliable Data Storage | HDFS is very much reliable when it comes to data storage. The data stored in HDFS is replicated by a default replication factor of 3. It means that, even if a machine fails, the data is still available in two other machines. |
Cost Effective | HDFS can be deployed on any custom-made clusters of commodity hardware and can save us a lot of bucks. Therefore, HDFS does not require any high-end or expensive hardware to function. |
Big Datasets | HDFS is capable of storing Petabytes of data over a cluster of machines where the size of a single file can range from Gigabytes to Terabytes. HDFS is not designed to store a huge number of small-sized files as the file system metadata is stored in the memory of NameNode. |
Streaming Data Access | HDFS provides streaming access to data. It is best suited for batch processing of data and not suitable for interactive processing. Moreover, it is not designed for the applications which require low-latency access to data, such as OnLine Transaction Processing (OLTP). |
Simple Coherency Model | HDFS is designed to write once and read many times access models for files. Appending the content to files at the end is supported, but files cannot be updated at any arbitrary point, and it is also not possible to have multiple writers. In addition, the files can only be written by a single writer. |
Hadoop is designed in such a way that the first block replica is placed on the same node as the client, but the second replica is placed on a different rack to that of the first replica. Similarly, the third replica is placed on a random node on the same rack as the second replica. If the replication factor is more than 3, any random nodes in the cluster are selected to place the replicas. If a client running outside the cluster stores a file, a random node (which is not busy) is automatically picked to place the first replica. This way, if a node fails, the data is still available on other nodes of the cluster and if a rack fails, again, the data is still intact.
HDFS is designed as a Master and Slave architecture, in which the Master node controls and assigns jobs to all its slave nodes. The following terminologies are used to describe the Master and Slave nodes:
The Master Nodes in HDFS are called:
The Slave Nodes in HDFS are called:
These nodes perform the core serving roles in HDFS architecture. Let us now look in detail at the roles of each Node to better comprehend them.
Next pageFont size:
Interval:
Bookmark:
Similar books «Apache Spark Manuscript ebook»
Look at similar books to Apache Spark Manuscript ebook. We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.
Discussion, reviews of the book Apache Spark Manuscript ebook and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.