1. The Era of Big Data, Hadoop, and Other Big Data Processing Frameworks
When I first joined Orkut, I was happy. With Orkut, I had a new platform enabling me get to know the people around me, including their thoughts, their views, their purchases, and the places they visited. We were all gaining more knowledge than ever before and felt more connected to the people around us. Uploading pictures helped us share good ideas of places to visit. I was becoming more and more addicted to understanding and expressing sentiments. After a few years, I joined Facebook. And day by day, I was introduced to what became an infinite amount of information from all over world. Next, I started purchasing items online, and I liked it more than shopping offline. I could easily get a lot of information about products, and I could compare prices and features. And I wasnt the only one; millions of people were feeling the same way about the Web.
More and more data was flooding in from every corner of the world to the Web. And thanks to all those inventions related to data storage systems, people could store this huge inflow of data.
More and more users joined the Web from all over the world, and therefore increased the amount of data being added to these storage systems. This data was in the form of opinions, pictures, videos, and other forms of data too. This data deluge forced users to adopt distributed systems. Distributed systems require distributed programming. And we also know that distributed systems require extra care for fault-tolerance and efficient algorithms. Distributed systems always need two things: reliability of the system and availability of all its components.
Apache Hadoop was introduced, ensuring efficient computation and fault-tolerance for distributed systems. Mainly, it concentrated on reliability and availability. Because Apache Hadoop was easy to program, many people became interested in big data. Big data became a popular topic for discussion everywhere. E-commerce companies wanted to know more about their customers, and the health-care industry was interested in gaining insights from the data collected, for example. More data metrics were defined. More data points started to be collected.
Many open source big data tools emerged, including Apache Tez and Apache Storm . This was also a time that many NoSQL databases emerged to deal with this huge data inflow. Apache Spark also evolved as a distributed system and became very popular during this time.
In this chapter, we are going to discuss big data as well as Hadoop as a distributed system for processing big data. In covering the components of Hadoop, we will also discuss Hadoop ecosystem frameworks such as Apache Hive and Apache Pig. The usefulness of the components of the Hadoop ecosystem is also discussed to give you an overview. Throwing light on some of the shortcomings of Hadoop will give you background on the development of Apache Spark. The chapter will then move through a description of Apache Spark. We will also discuss various cluster managers that work with Apache Spark. The chapter wouldnt be complete without discussing NoSQL, so discussion on the NoSQL database HBase is also included. Sometimes we read data from a relational database management system (RDBMS) ; this chapter discusses PostgreSQL.
Big Data
Big data is one of the hot topics of this era. But what is big data? Big data describes a dataset that is huge and increasing with amazing speed. Apart from this volume and velocity, big data is also characterized by its variety of data and veracity. Lets explore these termsvolume, velocity, variety, and veracityin detail. These are also known as the 4V characteristics of big data, as illustrated in Figure .
Figure 1-1.
Characteristcis of big data
Volume
The volume specifies the amount of data to be processed. A large amount of data requires large machines or distributed systems. And the time required for computation will also increase with the volume of data. So its better to go for a distributed system, if we can parallelize our computation. Volume might be of structured data, unstructured data, or any data. If we have unstructured data, the situation becomes more complex and computing intensive. You might wonder, how big is big? What volume of data should be classified as big data? This is again a debatable question. But in general, we can say that an amount of data that we cant handle via a conventional system can be considered big data.
Velocity
Every organization is becoming more and more data conscious. A lot of data is collected every moment. This means that the velocity of datathe speed of the data flow and of data processingis also increasing. How will a single system be able to handle this velocity? The problem becomes complex when we have to analyze a large inflow of data in real time. Each day, systems are being developed to deal with this huge inflow of data.
Variety
Sometimes the variety of data adds enough complexity that conventional data analysis systems cant analyze data well. What do we mean by variety ? You might think data is just data. But this is not the case. Image data is different from simple tabular data, for example, because of the way it is organized and saved. In addition, an infinite number of file systems are available, and every file system requires a different way of dealing with it. Reading and writing a JSON file, for instance, will be different from the way we deal with a CSV file. Nowadays, a data scientist has to handle a combination of these data types. The data you are going to deal with might be a combination of pictures, videos, and text. This variety of data makes big data more complex to analyze.
Veracity
Can you imagine a logically incorrect computer program resulting in the correct output? Of course not. Similarly, data that is not accurate is going to provide misleading results. The veracity of data is one of the important concerns related to big data. When we consider the condition of big data, we have to think about any abnormalities in the data.
Hadoop
Hadoop is a distributed and scalable framework for solving big data problems. Hadoop, developed by Doug Cutting and Mark Cafarella, is written in Java. It can be installed on a cluster of commodity hardware, and it scales horizontally on distributed systems. Easy to program Inspiration from Google research paper Hadoop was developed. Hadoops capability to work on commodity hardware makes it cost-effective. If we are working on commodity hardware, fault-tolerance is an inevitable issue. But Hadoop provides a fault-tolerant system for data storage and computation, and this fault-tolerant capability has made Hadoop popular.
Hadoop has two components, as illustrated in Figure . The first component is the Hadoop Distributed File System (HDFS) . The second component is MapReduce. HDFS is for distributed data storage, and MapReduce is for performing computation on the data stored in HDFS.
Figure 1-2.
Hadoop components
HDFS
HDFS is used to store large amounts of data in a distributed and fault-tolerant fashion. HDFS is written in Java and runs on commodity hardware. It was inspired by a Google research paper about the Google File System (GFS) . It is a write-once and read-many-times system thats effective for large amounts of data.