Notion Press Media Pvt Ltd No. 50, Chettiyar Agaram Main Road,
Vanagaram, Chennai, Tamil Nadu 600 095 First Published by Notion Press 2022
Copyright Samiya Khan 2022
All Rights Reserved. eISBN 979-8-88530-488-7 This book has been published with all efforts taken to make the material error-free after the consent of the author. However, the author and the publisher do not assume and hereby disclaim any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from negligence, accident, or any other cause. While every effort has been made to avoid any mistake or omission, this publication is being sold on the condition and understanding that neither the author nor the publishers or printers would be liable in any manner to any person by reason of any mistake or omission in this publication or for any action taken or omitted to be taken or advice rendered or accepted on the basis of this work. For any defect in printing or binding the publishers will be liable only to replace the defective copy by another copy of this work then available.
PREFACE Big data is at the foundation of all of the megatrends that are happening today, from social to mobile to the cloud to gaming. Chris Lynch Data deluge has been one of the biggest concerns for the scientific community in recent times. As a result of mass digitization across domains and borders, the data pool became huge. Thus, the traditional systems for storage and processing were overwhelmed by this humongous data reservoir. Organizations were forced to decide whether to discard this data or archive it. A decision in favor of maintaining a data reserve can be attributed to the fact that this data is capable of steering many organization-level decisions and revolutionizing how problem-solving and decision-making are performed.
In a popular article by Chris Anderson, Editor, Wired Magazine, it was explained how the scientific method, in itself, has become obsolete in the big data era. The foundation of a scientific method for solving problems lies in proposing a hypothetical solution to a problem and collecting data to test the proposed solution. However, in the present time and age, we have lots of data, and instead of looking for solutions to problems, we are looking for problems that can be solved using the data heap available to us. Big data has changed the equation most remarkably. Since data is the heart and soul of this technology, its applicability is extensive. Therefore, Big data is being used in varied streams, from Geo-Statistics to Bioinformatics.
Personalized medicine and smart cities have been the most accepted real-life endeavors of big data technologies. As we traverse the journey of big data from the inception of this concept to its present-day position in the market, Hadoop has been a constant feature on the big data technologies list. Furthermore, Hadoop has also evolved with the changing demands and market understanding with solutions such as Spark to expand its capabilities and become a complete big data framework for application development. This book is a beginners guide to big data and Hadoop. It has been organized in such a manner that it describes the big data problem and need for Hadoop. Henceforth, all the components of the Hadoop ecosystem are individually covered.
Chapter 1 to Chapter 9 provides theory lessons, while Chapter 10 to Chapter 20 are lab tutorials that can be aligned with theory chapters to better understand the subject. The theory and practical coverage of the book shall help the reader connect the dots between the different processes of the big data lifecycle, facilitating the development of comprehensive solutions for complex big data problems. First, Chapter 1 introduces the big data problem and gives an eagles eye view of Hadoop, a distributed programming framework. Besides this, it also provides a comparison between Hadoop and RDBMS. Chapter 2 dwells deeper into the Hadoop Ecosystem and HDFS Architecture. There are two Hadoop versions available, Hadoop 1.x and Hadoop 2.x.
This chapter points out the differences between these two versions and describes the key components of the Hadoop Ecosystem. Storage and processing are the two main functions of Hadoop. In order to understand how Hadoop performs these functions, Cluster Architecture and YARN are explained in detail. In addition to this, basic Hadoop functions like data placement, reading, and writing are explained in detail with the help of sequence diagrams. Finally, it also mentions the cluster modes in which Hadoop operates and the configuration files that can be changed to modify the operational characteristics of the system. MapReduce programming paradigm is implemented using Hadoop.
First, the basics of this programming paradigm and its comparison with the conventional form of programming are described in Chapter 3. Then, the workflows for application execution and job submission are diagrammatically explained. Finally, a working example of MapReduce code is provided to better understand the concept. This chapter aims to help the reader understand how the MapReduce programming paradigm can be implemented in Hadoop using Java programming language. Chapter 4 builds on the foundations of MapReduce programming laid in the previous chapter. Advanced programming concepts like Input Splits, Partitioner, Combiner, Counters, and Input Formats are discussed in this chapter.
In addition, Map and Reduce Side Joins are also explained, along with a discussion on which type of Join operation must be implemented in what kind of a programming scenario. Lastly, this chapter demonstrates the need and use of MRUnit Testing Framework, a debugging tool commonly used for MapReduce code. Companies like Yahoo and Facebook use Hadoop at the backend to perform analysis of their data. However, their data analysts are not well versed with NoSQL and usually have little to no expert knowledge of advanced programming languages. As a result, abstraction tools like Pig and Hive have been developed to reduce the programming effort required for performing common data analytical tasks. Chapter 5 is dedicated to Pig, providing a complete theoretical background of the tool to the reader.
Topics like key characteristics, performance issues, limitations, and applications are covered. Besides this, the basics of Pig scripting are also elucidated. Although Pig and Hive are two tools solving the same purpose, they are two dissimilar solutions developed by different companies. The differences between Pig and Hive are elaborated upon in Chapter 6. Moreover, Hive is also compared with traditional RDBMS to provide a holistic view to the reader. Finally, Hive architecture, components, limitations, and scripting are explained to provide the reader with enough knowledge about the tool to get started with the practical aspects of the same.
Chapter 7 is a detailed description of NoSQL Databases and their classification. HBase is the NoSQL solution available as part of the Hadoop ecosystem. However, users can integrate other NoSQL solutions with Hadoop to achieve better performance. Because of this reason, NoSQL, as a complete topic, is discussed. The chapter focuses on HBase, covering its basic concepts, uses components, and storage architecture. Zookeeper is additionally covered as part of this chapter because of its inseparable association with HBase processes.
Lastly, basic working knowledge of HBase is provided to help the reader get started with its practical facets. Oozie is the top-most level component of the Hadoop ecosystem that allows developers to specify workflows or set of Hadoop tasks that need to be performed sequentially to perform a specific task. If a task requires repeated execution, then specifying the workflow and instructing Oozie to time the execution of the workflow as per development requirements can prove to be extremely performance-effective. Chapter 8 explains Oozie and its functional component, highlighting how jobs can be scheduled using this component of Hadoop. Hadoop is an effective framework for distributed data processing. However, it lacks the statistical and analytical capabilities required to perform complex big data issues.
Next page