• Complain

Gates - Programming Pig

Here you can read online Gates - Programming Pig full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. City: Sebastopol, year: 2011, publisher: OReilly Media, genre: Computer. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:

Romance novel Science fiction Adventure Detective Science History Home and family Prose Art Politics Computer Non-fiction Religion Business Children Humor

Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.

No cover
  • Book:
    Programming Pig
  • Author:
  • Publisher:
    OReilly Media
  • Genre:
  • Year:
    2011
  • City:
    Sebastopol
  • Rating:
    5 / 5
  • Favourites:
    Add to favourites
  • Your mark:
    • 100
    • 1
    • 2
    • 3
    • 4
    • 5

Programming Pig: summary, description and annotation

We offer to read an annotation, description, summary or preface (depends on what the author of the book "Programming Pig" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.

This guide is an ideal learning tool and reference for Apache Pig, the open source engine for executing parallel data flows on Hadoop. With Pig, you can batch-process data without having to create a full-fledged applicationmaking it easy for you to experiment with new datasets.

Programming Pig introduces new users to Pig, and provides experienced users with comprehensive coverage on key features such as the Pig Latin scripting language, the Grunt shell, and User Defined Functions (UDFs) for extending Pig. If you need to analyze terabytes of data, this book shows you how to do it efficiently with Pig.

  • Delve into Pigs data model, including scalar and complex data types
  • Write Pig Latin scripts to sort, group, join, project, and filter your data
  • Use Grunt to work with the Hadoop Distributed File System (HDFS)
  • Build complex data processing pipelines with Pigs macros and modularity features
  • Embed Pig Latin in Python for iterative processing and other advanced tasks
  • Create your own load and store functions to handle data formats and storage mechanisms
  • Get performance tips for running scripts on Hadoop clusters in less time

Programming Pig — read online for free the complete book (whole text) full work

Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "Programming Pig" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.

Light

Font size:

Reset

Interval:

Bookmark:

Make
Programming Pig
Alan Gates
Editor
Meghan Blanchette
Editor
Mike Loukides

Copyright 2011 Yahoo!, Inc.

OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (.

Nutshell Handbook, the Nutshell Handbook logo, and the OReilly logo are registered trademarks of OReilly Media, Inc. Programming Pig , the image of a domestic pig, and related trade dress are trademarks of OReilly Media, Inc.

Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and OReilly Media, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps.

While every precaution has been taken in the preparation of this book, the publisher and author assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.

OReilly Media Dedication To my wife Barbara and our boys Adam and Joel - photo 1

O'Reilly Media

Dedication

To my wife, Barbara, and our boys, Adam and Joel. Their support, encouragement, and sacrificed Saturdays have made this book possible.

SPECIAL OFFER: Upgrade this ebook with OReilly

for more information on this offer!

Please note that upgrade offers are not available from sample content.

Preface
Data Addiction

Data is addictive. Our ability to collect and store data has grown massively in the last several decades. Yet our appetite for ever more data shows no sign of being satiated. Scientists want to be able to store more data in order to build better mathematical models of the world. Marketers want better data to understand their customers desires and buying habits. Financial analysts want to better understand the workings of their markets. And everybody wants to keep all their digital photographs, movies, emails, etc.

The computer and Internet revolutions have greatly increased our ability to collect and store data. Before these revolutions, the US Library of Congress was one of the largest collections of data in the world. It is estimated that its printed collections contain approximately 10 terabytes (TB) of information. Today large Internet companies collect that much data on a daily basis. And it is not just Internet applications that are producing data at prodigious rates. For example, the Large Synoptic Survey Telescope (LSST) planned for construction in Chile is expected to produce 20 TB of data every day.

Part of the reason for this massive growth in data is our ability to collect much more data. Every time someone clicks on a websites links, the web server can record information about what page the user was on and which link he clicked. Every time a car drives over a sensor in the highway, its speed can be recorded. But much of the reason is also our ability to store that data. Ten years ago, telescopes took pictures of the sky every night. But they could not store it at the same detail level that will be possible when the LSST is operational. The extra data was being thrown away because there was nowhere to put it. The ability to collect and store vast quantities of data only feeds our data addiction.

One of the most commonly used tools for storing and processing data in computer systems over the last few decades has been the relational database management system (RDBMS). But as data sets have grown large, only the more sophisticated (and hence more expensive) RDBMSs have been able to reach the scale many users now desire. At the same time, many engineers and scientists involved in processing the data have realized that they do not need everything offered by an RDBMS. These systems are powerful and have many features, but many data owners who need to process terabytes or petabytes of data need only a subset of those features.

The high cost and unneeded features of RDBMSs have led to the development of many alternative data-processing systems. One such alternative system is Apache Hadoop. Hadoop is an open source project started by Doug Cutting. Over the past several years, Yahoo! and a number of other web companies have driven the development of Hadoop, which was based on papers published by Google describing how their engineers were dealing with the challenge of storing and processing the massive amounts of data they were collecting. For a history of Hadoop, see Hadoop: The Definitive Guide, by Tom White (OReilly). Hadoop is installed on a cluster of machines and provides a means to tie together storage and processing in that cluster.

The development of new data-processing systems such as Hadoop has spurred the porting of existing tools and languages and the construction of new tools, such as Apache Pig. Tools like Pig provide a higher level of abstraction for data users, giving them access to the power and flexibility of Hadoop without requiring them to write extensive data-processing applications in low-level Java code.

Who Should Read This Book

This book is intended for Pig programmers, new and old. Those who have never used Pig will find introductory material on how to run Pig and to get them started writing Pig Latin scripts. For seasoned Pig users, this book covers almost every feature of Pig: different modes it can be run in, complete coverage of the Pig Latin language, and how to extend Pig with your own User Defined Functions (UDFs). Even those who have been using Pig for a long time are likely to discover features they have not used before.

will not be usable by those on 0.6 or earlier versions. However, the rest of the book will still be applicable.

walks through a very simple example of a Hadoop job. These sections will be helpful for those not already familiar with Hadoop.

to be a helpful starting point in understanding the similarities and differences between Pig Latin and SQL.

Conventions Used in This Book

The following typographical conventions are used in this book:

Italic

Indicates new terms, URLs, email addresses, filenames, and file extensions.

Constant width

Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.

Constant width bold

Shows commands or other text that should be typed literally by the user.

Constant width italic

Shows text that should be replaced with user-supplied values or by values determined by context.

Tip

This icon signifies a tip, suggestion, or general note.

Caution

This icon indicates a warning or caution.

Code Examples in This Book

Many of the example scripts, User Defined Functions (UDFs), and data used in this book are available for download from my GitHub repository. README files are included to help you get the UDFs built and to understand the contents of the datafiles. Each example script in the text that is available on GitHub has a comment at the beginning that gives the filename. Pig Latin and Python script examples are organized by chapter in the examples directory. UDFs, both Java and Python, are in a separate directory, udfs. All data sets are in the data directory.

For brevity, each script is written assuming that the input and output are in the local directory. Therefore, when in local mode, you should run Pig in the directory that the input data is in. When running on a cluster, you should place the data in your home directory on the cluster.

Next page
Light

Font size:

Reset

Interval:

Bookmark:

Make

Similar books «Programming Pig»

Look at similar books to Programming Pig. We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.


Reviews about «Programming Pig»

Discussion, reviews of the book Programming Pig and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.