1. The Big Data Phenomenon
We are inundated with data. Data from Twitter microblogs, YouTube and surveillance videos, Instagram pictures, SoundCloud audio, enterprise applications, and many other sources are part of our daily life. Computing has come of age to facilitate the pervasiveness of machine-readable data and leveraging it for the advancement of humanity. This Big Data phenomenon is the new information revolution that no IT professional can afford to miss to be part of. Big Data Analytics has proven to be a game changer in the way that businesses provide their services. Business models are getting better, operations are becoming intelligent, and revenue streams are growing.
Uncertainty is probably the biggest impeding factor in the economic evolution of mankind. Thankfully, Big Data helps to deal with uncertainty . The more we know an entity, the more we can learn about the entity and thereby reduce the uncertainty. For instance, analyzing the continuous data about customers buying patterns is enabling stores to predict changes in demand and stock accordingly. Big Data is helping businesses to understand customers better so that they can be served better. Analyzing the consumer data from various sources, such Online Social Networks ( OSN) , usage of mobile apps, and purchase transaction records, businesses are able to personalize offerings. Computational statistics and Machine Learning algorithms are able to hypothesize patterns from Big Data to help achieve this personalization.
Web 2.0 , which includes the OSN, is one significant source of Big Data. Another major contributor is the Internet of Things. The billions of devices connecting to the Internet generate Petabytes of data. It is a well-known fact that businesses collect as much data as they can about consumers their preferences, purchase transactions, opinions, individual characteristics, browsing habits, and so on. Consumers themselves are generating substantial chunks of data in terms of reviews, ratings, direct feedback, video recordings, pictures and detailed documents of demos, troubleshooting, and tutorials to use the products and such, exploiting the expressiveness of the Web 2.0, thus contributing to the Big Data.
From the list of sources of data, it can be easily seen that it is relatively inexpensive to collect data. There are a number of other technology trends too that are fueling the Big Data phenomenon. High Availability systems and storage, drastically declining hardware costs, massive parallelism in task execution, high-speed networks, new computing paradigms such as cloud computing, high performance computing, innovations in Analytics and Machine Learning algorithms, new ways of storing unstructured data, and ubiquitous access to computing devices such as smartphones and laptops are all contributing to the Big Data revolution.
Human beings are intelligent because their brains are able to collect inputs from various sources, connect them, and analyze them to look for patterns. Big Data and the algorithms associated with it help achieve the same using compute power. Fusion of data from disparate sources can yield surprising insights into the entities involved. For instance, if there are plenty of instances of flu symptoms being reported on OSN from a particular geographical location and there is a surge in purchases of flu medication based on the credit card transactions in that area, it is quite likely that there is an onset of a flu outbreak. Given that Big Data makes no sense without the tools to collect, combine, and analyze data, some proponents even argue that Big Data is not really data, but a technology comprised of tools and techniques to extract value from huge sets of data .
Note
Generating value from Big Data can be thought of as comprising two major functions: fusion , the coming together of data from various sources; and fission , analyzing that data.
There is a huge amount of data pertaining to the human body and its health. Genomic data science is an academic specialization that is gaining increasing popularity. It helps in studying the disease mechanisms for better diagnosis and drug response. The algorithms used for analyzing Big Data are a game changer in genome research as well. This category of data is so huge that the famed international journal of science, Nature, carried a new item about how the genome researchers are worried that the computing infrastructure may not cope with the increasing amount of data that their research generates.
Science has a lot to benefit from the developments in Big Data. Social scientists can leverage data from the OSN to identify both micro- and macrolevel details, such as any psychiatric conditions at an individual level or group dynamics at a macrolevel. The same data from OSN can also be used to detect medical emergencies and pandemics. In financial sector too, data from the stock markets, business news, and OSN can reveal valuable insights to help improve lending practices, set macroeconomic strategies, and avert recession.
There are various other uses of Big Data applications in a wide variety of areas . Housing and real estate business; actuaries; and government departments such as national security, defense, education, disease control, law enforcement, and energy, which are all characterized by huge amounts of data are expected to benefit from the Big Data phenomenon.
Note
Where there is humongous data and appropriate algorithms applied on the data, there is wealth, value, and prosperity.
Why Big Data
A common question that arises is this: Why Big Data, why not just data? For data to be useful, we need to be able to identify patterns and predict those patterns for future data that is yet to be known. A typical analogy is to predict the brand of rice in a bag based on a given sample. The rice in the bag is unknown to us. We are only given a sample from it and samples of known brands of rice. The known samples are called training data in the language of Machine Learning. The sample of unknown rice is the test data.
It is common sense that the larger the sample, the better the prediction of the brand. If we are given just two grains of rice of each brand in the training data, we may base our conclusion solely based on the characteristics of those two grains, missing out on other characteristics. In the Machine Learning parlance, this is called overfitting . If we have a bigger sample, we can recognize a number of features and a possible range of values for the features: in other words, the probability distributions of the values, and look for similar distributions in the data that is yet to be known. Hence the need for humongous data, Big Data, and not just data.
In fact, a number of algorithms that are popular with Big Data have been in existence for long. The Nave Bayes technique , for instance, has been there since the 18th century and the Support Vector Machine model was invented in early 1960s. They gained prominence with the advent of the Big Data revolution for reasons explained earlier.
An often-cited heuristic to differentiate Big Data from the conventional bytes of data is that the Big Data is too big to fit into traditional Relational Database Management Systems (RDBMS) . With the ambitious plan of the Internet of Things to connect every entity of the world to everything else, the conventional RDBMS will not be able to handle the data upsurge. In fact, Seagate predicts that the world will not be able to cope with the storage needs in a couple of years. According to them, it is harder to manufacture capacity than to generate data. It will be interesting to see if and how the storage industry will meet the capacity demands of the Volume from Big Data phenomenon, which brings us to the Vs of the Big Data.