• Complain

Tom White - Hadoop: The Definitive Guide

Here you can read online Tom White - Hadoop: The Definitive Guide full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. year: 2015, publisher: OReilly Media, genre: Computer. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:

Romance novel Science fiction Adventure Detective Science History Home and family Prose Art Politics Computer Non-fiction Religion Business Children Humor

Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.

Tom White Hadoop: The Definitive Guide
  • Book:
    Hadoop: The Definitive Guide
  • Author:
  • Publisher:
    OReilly Media
  • Genre:
  • Year:
    2015
  • Rating:
    5 / 5
  • Favourites:
    Add to favourites
  • Your mark:
    • 100
    • 1
    • 2
    • 3
    • 4
    • 5

Hadoop: The Definitive Guide: summary, description and annotation

We offer to read an annotation, description, summary or preface (depends on what the author of the book "Hadoop: The Definitive Guide" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.

Get ready to unlock the power of your data. With the fourth edition of this comprehensive guide, youll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.

Using Hadoop 2 exclusively, author Tom White presents new chapters on YARN and several Hadoop-related projects such as Parquet, Flume, Crunch, and Spark. Youll learn about recent changes to Hadoop, and explore new case studies on Hadoops role in healthcare systems and genomics data processing.

  • Learn fundamental components such as MapReduce, HDFS, and YARN
  • Explore MapReduce in depth, including steps for developing applications with it
  • Set up and maintain a Hadoop cluster running HDFS and MapReduce on YARN
  • Learn two data formats: Avro for data serialization and Parquet for nested data
  • Use data ingestion tools such as Flume (for streaming data) and Sqoop (for bulk data transfer)
  • Understand how high-level data processing tools like Pig, Hive, Crunch, and Spark work with Hadoop
  • Learn the HBase distributed database and the ZooKeeper distributed configuration service

Tom White: author's other books


Who wrote Hadoop: The Definitive Guide? Find out the surname, the name of the author of the book and a list of all author's works by series.

Hadoop: The Definitive Guide — read online for free the complete book (whole text) full work

Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "Hadoop: The Definitive Guide" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.

Light

Font size:

Reset

Interval:

Bookmark:

Make
Hadoop: The Definitive Guide
Tom White

For Eliane, Emilia, and Lottie

Foreword
Doug Cutting, April 2009
Shed in the Yard, California

Hadoop got its start in Nutch. A few of us were attempting to build an open source web search engine and having trouble managing computations running on even a handful of computers. Once Google published its GFS and MapReduce papers, the route became clear. Theyd devised systems to solve precisely the problems we were having with Nutch. So we started, two of us, half-time, to try to re-create these systems as a part of Nutch.

We managed to get Nutch limping along on 20 machines, but it soon became clear that to handle the Webs massive scale, wed need to run it on thousands of machines, and moreover, that the job was bigger than two half-time developers could handle.

Around that time, Yahoo! got interested, and quickly put together a team that I joined. We split off the distributed computing part of Nutch, naming it Hadoop. With the help of Yahoo!, Hadoop soon grew into a technology that could truly scale to the Web.

In 2006, Tom White started contributing to Hadoop. I already knew Tom through an excellent article hed written about Nutch, so I knew he could present complex ideas in clear prose. I soon learned that he could also develop software that was as pleasant to read as his prose.

From the beginning, Toms contributions to Hadoop showed his concern for users and for the project. Unlike most open source contributors, Tom is not primarily interested in tweaking the system to better meet his own needs, but rather in making it easier for anyone to use.

Initially, Tom specialized in making Hadoop run well on Amazons EC2 and S3 services. Then he moved on to tackle a wide variety of problems, including improving the MapReduce APIs, enhancing the website, and devising an object serialization framework. In all cases, Tom presented his ideas precisely. In short order, Tom earned the role of Hadoop committer and soon thereafter became a member of the Hadoop Project Management Committee.

Tom is now a respected senior member of the Hadoop developer community. Though hes an expert in many technical corners of the project, his specialty is making Hadoop easier to use and understand.

Given this, I was very pleased when I learned that Tom intended to write a book about Hadoop. Who could be better qualified? Now you have the opportunity to learn about Hadoop from a master not only of the technology, but also of common sense and plain talk.

Preface

Martin Gardner, the mathematics and science writer, once said in an interview:

Beyond calculus, I am lost. That was the secret of my columns success. It took me so long to understand what I was writing about that I knew how to write in a way most readers would understand.[]

In many ways, this is how I feel about Hadoop. Its inner workings are complex, resting as they do on a mixture of distributed systems theory, practical engineering, and common sense. And to the uninitiated, Hadoop can appear alien.

But it doesnt need to be like this. Stripped to its core, the tools that Hadoop provides for working with big data are simple. If theres a common theme, it is about raising the level of abstraction to create building blocks for programmers who have lots of data to store and analyze, and who dont have the time, the skill, or the inclination to become distributed systems experts to build the infrastructure to handle it.

With such a simple and generally applicable feature set, it seemed obvious to me when I started using it that Hadoop deserved to be widely used. However, at the time (in early 2006), setting up, configuring, and writing programs to use Hadoop was an art. Things have certainly improved since then: there is more documentation, there are more examples, and there are thriving mailing lists to go to when you have questions. And yet the biggest hurdle for newcomers is understanding what this technology is capable of, where it excels, and how to use it. That is why I wrote this book.

The Apache Hadoop community has come a long way. Since the publication of the first edition of this book, the Hadoop project has blossomed. Big data has become a household term.[] In this time, the software has made great leaps in adoption, performance, reliability, scalability, and manageability. The number of things being built and run on the Hadoop platform has grown enormously. In fact, its difficult for one person to keep track. To gain even wider adoption, I believe we need to make Hadoop even easier to use. This will involve writing more tools; integrating with even more systems; and writing new, improved APIs. Im looking forward to being a part of this, and I hope this book will encourage and enable others to do so, too.

Administrative Notes

During discussion of a particular Java class in the text, I often omit its package name to reduce clutter. If you need to know which package a class is in, you can easily look it up in the Java API documentation for Hadoop (linked to from the Apache Hadoop home page), or the relevant project. Or if youre using an integrated development environment (IDE), its auto-complete mechanism can help find what youre looking for.

Similarly, although it deviates from usual style guidelines, program listings that import multiple classes from the same package may use the asterisk wildcard character to save space (for example, import org.apache.hadoop.io.*).

The sample programs in this book are available for download from the books website. You will also find instructions there for obtaining the datasets that are used in examples throughout the book, as well as further notes for running the programs in the book and links to updates, additional resources, and my blog.

Whats New in the Fourth Edition?

The fourth edition covers Hadoop 2 exclusively. The Hadoop 2 release series is the current active release series and contains the most stable versions of Hadoop.

There are new chapters covering YARN ().

This edition includes two new case studies (Chapters .

Many corrections, updates, and improvements have been made to existing chapters to bring them up to date with the latest releases of Hadoop and its related projects.

Whats New in the Third Edition?

The third edition covers the 1.x (formerly 0.20) release series of Apache Hadoop, as well as the newer 0.22 and 2.x (formerly 0.23) series. With a few exceptions, which are noted in the text, all the examples in this book run against these versions.

This edition uses the new MapReduce API for most of the examples. Because the old API is still in widespread use, it continues to be discussed in the text alongside the new API, and the equivalent code using the old API can be found on the books website.

The major change in Hadoop 2.0 is the new MapReduce runtime, MapReduce 2, which is built on a new distributed resource management system called YARN. This edition includes new sections covering MapReduce on YARN: how it works ().

There is more MapReduce material, too, including development practices such as packaging MapReduce jobs with Maven, setting the users Java classpath, and writing tests with MRUnit (all in ).

The chapter on HDFS () now has introductions to high availability, federation, and the new WebHDFS and HttpFS filesystems.

The chapters on Pig, Hive, Sqoop, and ZooKeeper have all been expanded to cover the new features and changes in their latest releases.

In addition, numerous corrections and improvements have been made throughout the book.

Whats New in the Second Edition?

The second edition has two new chapters on Sqoop and Hive (Chapters ), and a new case study on analyzing massive network graphs using Hadoop.

Next page
Light

Font size:

Reset

Interval:

Bookmark:

Make

Similar books «Hadoop: The Definitive Guide»

Look at similar books to Hadoop: The Definitive Guide. We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.


Reviews about «Hadoop: The Definitive Guide»

Discussion, reviews of the book Hadoop: The Definitive Guide and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.