1. Introduction
1.1 Big Network Data
There is hardly any other data set whose size can rival the big network data that flows on the Internet. The annual global IP traffic is expected to pass zettabyte by 2016 [], and major online retailers such as Alibaba process over a billion sales annually. As these data accumulate day by day and year by year, mining them for knowledge becomes a daunting task that requires tremendous resources. This book aims to develop new compact and fast online measurement methods that reduce big network data to measurement summaries orders-of-magnitude smaller than what the traditional methods can do. The new methods hold the promise of allowing routers to perform measurement on large network traffic in real time using small cache memory on network processors, allowing enterprise systems to store their traffic records (in the form of summaries) over a far longer time frame, and allowing users with ordinary computing resources to perform analysis on big network data.
1.2 Online Challenge
Modern routers forward packets from incoming ports to outgoing ports via switching fabric. To process packets in real time, online modules for traffic measurement, packet scheduling, access control, and quality of service are implemented on network processors, bypassing main memory, and CPU almost entirely []. Commonly used cache memory on network processor chips is SRAM, typically a few megabytes. Increasing on-chip memory to more than 10MB is technically feasible, but it comes with a much higher price tag and access time is longer. There is a huge incentive to keep on-chip memory small because smaller memory can be made faster and cheaper. Off-chip SRAM or embedded DRAM (built on 3-D stacking interconnect or packaging on the same module) can be made larger. However, it is slower to access, and the bandwidth between a network processor and its off-chip memory can be a performance bottleneck. Hence, on-chip memory remains the first choice for online network functions that are designed to match the line speed.
To make the matter more challenging, limited on-chip memory may have to be shared among routing/performance/measurement/security functions that are implemented on the same chip. Each function can only use a fraction of the available space. Depending on their relative importance, some functions may be allocated tiny portions of the on-chip memory, whereas the amount of data they have to process and store can be extremely large in high-speed networks. The great disparity in memory demand and supply requires us to implement online functions, including real-time traffic measurement, as compact as possible. As an example, if the amount of on-chip memory allocated to a traffic measurement function is 1Mb but there are 1M concurrent flows, with 1 bit per flow, can we still perform per-flow traffic measurement? How about 10M concurrent flows with the same memory allocation? This is what we want to achieve through this book.
1.3 Offline Challenge
The space problem also exists offline where disks are used to store network traffic data over time for long-term analysis. As such data are constantly produced, there is a limit on how long they can be stored. With a given amount of disk space, the smaller we can reduce the traffic data, the longer we can keep the data before it has to be removed.
The space issue also arises when we analyze big data. Suppose an analyst who has access to web search records wants to profile the number of searches for each keyword/phrase/question/sentence. This information is useful to online social/economical/opinion trend studies [], various data analysis systems at Google, such as Sawzall, Dremel, and PowerDrill, estimate the cardinalities of very large data sets on a daily basis, which presents a challenge in computational resources, and memory in particularfor the PowerDrill system, a non-negligible fraction of queries historically could not be computed because they exceeded the available memory.
As another example, lets consider an analyst with access to billions of sale records from an online retailer. Suppose she wants to analyze purchase associations. Each association is defined as the purchase of one product followed by the purchase of another product from the same client. Profiling the frequency of each association helps the retailer follow up with product recommendations to its clients after they make purchases. However, such analysis requires pairing up the sale records. The multiplicative effort of pairing may result in an extraordinary number of purchase associations, much larger than the number of sale records. Although the analyst may resort to a datacenter for needed resources, it would certainly be welcome if we can make the same job doable on a regular laptop, even when the number of available memory bits on the laptop is far fewer than the number of purchase associations. (The same is true for the previous example of profiling search record.) This is what we want to achieve through this project.
1.4 Fundamental Primitives
In this book, we model network data as a set of flows, each of which is the abstraction of a data subset defined based on the measurement requirement. For example, we may treat all packets from the same source address as a flow, i.e., per-source flow. In this case, the flow identifier is the source address in the packet header. Similarly, we may define per-destination flows, per-source/destination flows, TCP flows, WWW flows, P2P flows, or other application-specific flows. We also need to define elements in the flows to be measured. Depending on the application needs, the elements may be destination addresses, source addresses, ports, or even keywords that appear in the packets of a flow.
Big network data consists of millions or even billions of flows. We may measure the flow size which is what NetFlow [] doesin number of bytes or packets; here, each byte (or packet) is considered as an element to be counted. We may measure the flow cardinality which is what firewalls often doin number of distinct elements in each flow. This is a harder problem because in order to remove the duplicate elements in the flow, we need a way to remember which elements we have seen in the past. Or we may measure the persistent spread of a flow: For a certain number of consecutive periods, if an element of a flow appears in each period, we call it a persistent element. The persistent spread of the flow over a given number of periods is defined as the distinct number of persistent elements in the flow. This book presents three important fundamental online functions: per-flow size measurement, persistent spread measurement, and per-flow cardinality measurement.
1.5 Scalable Counter Architectures for Per-Flow Size Measurement
Measuring flow size has many important applications. We may measure the number of packets in each TCP flow, the data rate of each voice-over-IP session, the number of bytes that each host downloads, the number of SYN packets from each source address, or the number of ACK packets sent to each address. Such information is very useful to service provision, capacity planning, accounting and billing, and anomaly detection [] and use per-flow information to compile the list of candidate bots that contribute to the change, helping to narrow down the scope for further investigation.