Chapter 1
Introduction to Streaming Data
It seems like the world moves at a faster pace every day. People and places become more connected, and people and organizations try to react at an ever-increasing pace. Reaching the limits of a human's ability to respond, tools are built to process the vast amounts of data available to decision makers, analyze it, present it, and, in some cases, respond to events as they happen.
The collection and processing of this data has a number of application areas, some of which are discussed in the next section. These applications, which are discussed later in this chapter, require an infrastructure and method of analysis specific to streaming data. Fortunately, like batch processing before it, the state of the art of streaming infrastructure is focused on using commodity hardware and software to build its systems rather than the specialized systems required for real-time analysis prior to the Internet era. This, combined with flexible cloud-based environment, puts the implementation of a real-time system within the reach of nearly any organization. These commodity systems allow organizations to analyze their data in real time and scale that infrastructure to meet future needs as the organization grows and changes over time.
The goal of this book is to allow a fairly broad range of potential users and implementers in an organization to gain comfort with the complete stack of applications. When real-time projects reach a certain point, they should be agile and adaptable systems that can be easily modified, which requires that the users have a fair understanding of the stack as a whole in addition to their own areas of focus. Real time applies as much to the development of new analyses as it does to the data itself. Any number of well-meaning projects have failed because they took so long to implement that the people who requested the project have either moved on to other things or simply forgotten why they wanted the data in the first place. By making the projects agile and incremental, this can be avoided as much as possible.
This chapter is divided into sections that cover three topics. The first section, Sources of Streaming Data, is some of the common sources and applications of streaming data. They are arranged more or less chronologically and provide some background on the origin of streaming data infrastructures. Although this is historically interesting, many of the tools and frameworks presented were developed to solve problems in these spaces, and their design reflects some of the challenges unique to the space in which they were born. Kafka, a data motion tool covered in Chapter 4, Flow Management for Streaming Analysis, for example, was developed as a web applications tool, whereas Storm, a processing framework covered in Chapter 5, Processing Streaming Data, was developed primarily at Twitter for handling social media data.
The second section, Why Streaming Data is Different, covers three of the important aspects of streaming data: continuous data delivery, loosely structured data, and high-cardinality datasets. The first, of course, defines a system to be a real-time streaming data environment in the first place. The other two, though not entirely unique, present a unique challenge to the designer of a streaming data application. All three combine to form the essential streaming data environment.
The third section, Infrastructures and Algorithms, briefly touches on the significance of how infrastructures and algorithms are used with streaming data.
Sources of Streaming Data
There are a variety of sources of streaming data. This section introduces some of the major categories of data. Although there are always more and more data sources being made available, as well as many proprietary data sources, the categories discussed in this section are some of the application areas that have made streaming data interesting. The ordering of the application areas is primarily chronological, and much of the software discussed in this book derives from solving problems in each of these specific application areas.
The data motion systems presented in this book got their start handling data for website analytics and online advertising at places like LinkedIn, Yahoo!, and Facebook. The processing systems were designed to meet the challenges of processing social media data from Twitter and social networks like LinkedIn.
Google, whose business is largely related to online advertising, makes heavy use of the advanced algorithmic approaches similar to those presented in Chapter 11. Google seems to be especially interested in a technique called deep learning, which makes use of very large-scale neural networks to learn complicated patterns.
These systems are even enabling entirely new areas of data collection and analysis by making the Internet of Things and other highly distributed data collection efforts economically feasible. It is hoped that outlining some of the previous application areas provides some inspiration for as-of-yet-unforeseen applications of these technologies.
Operational Monitoring
Operational monitoring of physical systems was the original application of streaming data. Originally, this would have been implemented using specialized hardware and software (or even analog and mechanical systems in the pre-computer era). The most common use case today of operational monitoring is tracking the performance of the physical systems that power the Internet.
These datacenters house thousandspossibly even tens of thousandsof discrete computer systems. All of these systems continuously record data about their physical state from the temperature of the processor, to the speed of the fan and the voltage draw of their power supplies. They also record information about the state of their disk drives and fundamental metrics of their operation, such as processor load, network activity, and storage access times.
To make the monitoring of all of these systems possible and to identify problems, this data is collected and aggregated in real time through a variety of mechanisms. The first systems tended to be specialized ad hoc mechanisms, but when these sorts of techniques started applying to other areas, they started using the same collection systems as other data collection mechanisms.
Web Analytics
The introduction of the commercial web, through e-commerce and online advertising, led to the need to track activity on a website. Like the circulation numbers of a newspaper, the number of unique visitors who see a website in a day is important information. For e-commerce sites, the data is less about the number of visitors as it is the various products they browse and the correlations between them.
To analyze this data, a number of specialized log-processing tools were introduced and marketed. With the rise of Big Data and tools like Hadoop, much of the web analytics infrastructure shifted to these large batch-based systems. They were used to implement recommendation systems and other analysis. It also became clear that it was possible to conduct experiments on the structure of websites to see how they affected various metrics of interest. This is called A/B testing becausein the same way an optometrist tries to determine the best prescriptiontwo choices are pitted against each other to determine which is best. These tests were mostly conducted sequentially, but this has a number of problems, not the least of which is the amount of time needed to conduct the study.
As more and more organizations mined their website data, the need to reduce the time in the feedback loop and to collect data on a more continual basis became more important. Using the tools of the system-monitoring community, it became possible to also collect this data in real time and perform things like A/B tests in parallel rather than in sequence. As the number of dimensions being measured and the need for appropriate auditing (due to the metrics being used for billing) increased, the analytics community developed much of the streaming infrastructure found in this book to safely move data from their web servers spread around the world to processing and billing systems.
Next page