To my wife, Barbara, and our boys, Adam and Joel. Their support, encouragement, and sacrificed Saturdays have made this book possible.
To my wife Jenny, my older son Ethan, and my younger son Charlie who was delivered during the writing of the book.
Preface
Data is addictive. Our ability to collect and store it has grown massively in the last several decades, yet our appetite for ever more data shows no sign of being satiated. Scientists want to be able to store more data in order to build better mathematical models of the world. Marketers want better data to understand their customers desires and buying habits. Financial analysts want to better understand the workings of their markets. And everybody wants to keep all their digital photographs, movies, emails, etc.
Before the computer and Internet revolutions, the US Library of Congress was one of the largest collections of data in the world. It is estimated that its printed collections contain approximately 10 terabytes (TB) of information. Today, large Internet companies collect that much data on a daily basis. And it is not just Internet applications that are producing data at prodigious rates. For example, the Large Synoptic Survey Telescope (LSST) under construction in Chile is expected to produce 15 TB of data every day.
Part of the reason for the massive growth in available data is our ability to collect much more data. Every time someone clicks a websites links, the web server can record information about what page the user was on and which link he clicked. Every time a car drives over a sensor in the highway, its speed can be recorded. But much of the reason is also our ability to store that data. Ten years ago, telescopes took pictures of the sky every night. But they could not store the collected data at the same level of detail that will be possible when the LSST is operational. The extra data was being thrown away because there was nowhere to put it. The ability to collect and store vast quantities of data only feeds our data addiction.
One of the most commonly used tools for storing and processing data in computer systems over the last few decades has been the relational database management system (RDBMS). But as datasets have grown large, only the more sophisticated (and hence more expensive) RDBMSs have been able to reach the scale many users now desire. At the same time, many engineers and scientists involved in processing the data have realized that they do not need everything offered by an RDBMS. These systems are powerful and have many features, but many data owners who need to process terabytes or petabytes of data need only a subset of those features.
The high cost and unneeded features of RDBMSs have led to the development of many alternative data-processing systems. One such alternative system is Apache Hadoop. Hadoop is an open source project started by Doug Cutting. Over the past several years, Yahoo! and a number of other web companies have driven the development of Hadoop, which was based on papers published by Google describing how its engineers were dealing with the challenge of storing and processing the massive amounts of data they were collecting. Hadoop is installed on a cluster of machines and provides a means to tie together storage and processing in that cluster. For a history of the project, see Hadoop: The Definitive Guide, by Tom White (OReilly).
The development of new data-processing systems such as Hadoop has spurred the porting of existing tools and languages and the construction of new tools, such as Apache Pig. Tools like Pig provide a higher level of abstraction for data users, giving them access to the power and flexibility of Hadoop without requiring them to write extensive data-processing applications in low-level Java code.
Who Should Read This Book
This book is intended for Pig programmers, new and old. Those who have never used Pig will find introductory material on how to run Pig and to get them started writing Pig Latin scripts. For seasoned Pig users, this book covers almost every feature of Pig: different modes it can be run in, complete coverage of the Pig Latin language, and how to extend Pig with your own user-defined functions (UDFs). Even those who have been using Pig for a long time are likely to discover features they have not used before.
Some knowledge of Hadoop will be useful for readers and Pig users. If youre not already familiar with it or want a quick refresher, walks through a very simple example of a Hadoop job.
Small snippets of Java, Python, and SQL are used in parts of this book. Knowledge of these languages is not required to use Pig, but knowledge of Python and Java will be necessary for some of the more advanced features. Those with a SQL background may find to be a helpful starting point in understanding the similarities and differences between Pig Latin and SQL.
Whats New in This Edition
The second edition covers Pig 0.10 through Pig 0.16, which is the latest version at the time of writing. For features introduced before 0.10, we will not call out the initial version of the feature. For newer features introduced after 0.10, we will point out the version in which the feature was introduced.
Pig runs on both Hadoop 1 and Hadoop 2 for all the versions covered in the book. To simplify our discussion, we assume Hadoop 2 is the target platform and will point out the difference for Hadoop 1 whenever applicable in this edition.
The second edition has two new chapters: Pig on Tez (). Other chapters have also been updated with the latest additions to Pig and information on existing features not covered in the first edition. These include but are not limited to: