To my beautiful wife, Erin, for her endless patience, and my wonderful children, Dominic and Ivy, for keeping me in line.
To my wife, Nancy Sherman, for all her encouragement during our writing, rewriting, and then rewriting yet again. Also, many thanks go to that cute little yellow elephant, without whom we wouldnt even have thought about writing this book.
Preface
What is Hadoop and why should you care? This book will help you understand what Hadoop is, but for now, lets tackle the second part of that question. Hadoop is the most common single platform for storing and analyzing big data. If you and your organization are entering the exciting world of big data, youll have to decide whether Hadoop is the right platform and which of the many components are best suited to the task. The goal of this book is to introduce you to the topic and get you started on your journey.
There are many books, websites, and classes about Hadoop and related technologies. This one is different. It does not provide a lengthy tutorial introduction to a particular aspect of Hadoop or to any of the many components of the Hadoop ecosystem. It certainly is not a rich, detailed discussion of any of these topics. Instead, it is organized like a field guide to birds or trees. Each chapter focuses on portions of the Hadoop ecosystem that have a common theme. Within each chapter, the relevant technologies and topics are briefly introduced: we explain their relation to Hadoop and discuss why they may be useful (and in some cases less than useful) for particular needs. To that end, this book includes various short sections on the many projects and subprojects of Apache Hadoop and some related technologies, with pointers to tutorials and links to related technologies and processes.
In each section, we have included a table that looks like this:
License |
Activity | None, Low, Medium, High |
Purpose |
Official Page |
Hadoop Integration | Fully Integrated, API Compatible, No Integration, Not Applicable |
Lets take a deeper look at what each of these categories entails:
License
While all of the sections in the first version of this field guide are open source, there are several different licenses that come with the softwaremostly alike, with some differences. If you plan to include this software in a product, you should familiarize yourself with the conditions of the license.
Activity
We have done our best to measure how much active development work is being done on the technology. We may have misjudged in some cases, and the activity level may have changed since we first wrote on the topic.
Purpose
What does the technology do? We have tried to group topics with a common purpose together, and sometimes we found that a topic could fit into different chapters. Life is about making choices; these are the choices we made.
Official Page
If those responsible for the technology have a site on the Internet, this is the home page of the project.
Hadoop Integration
When we started writing, we werent sure exactly what topics we would include in the first version. Some on the initial list were tightly integrated or bound into Apache Hadoop. Others were alternative technologies or technologies that worked with Hadoop but were not part of the Apache Hadoop family. In those cases, we tried to best understand what the level of integration was at the time of our writing. This will no doubt change over time.
You should not think that this book is something you read from cover to cover. If youre completely new to Hadoop, you should start by reading the introductory chapter, . Then you should look for topics of interest, read the section on that component, read the chapter header, and possibly scan other selections in the same chapter. This should help you get a feel for the subject. We have often included links to other sections in the book that may be relevant. You may also want to look at links to tutorials on the subject or to the official page for the topic.
Weve arranged the topics into sections that follow the pattern in the diagram shown in . Many of the topics fit into the Hadoop Common (formerly the Hadoop Core), the basic tools and techniques that support all the other Apache Hadoop modules. However, the set of tools that play an important role in the big data ecosystem isnt limited to technologies in the Hadoop core. In this book we also discuss a number of related technologies that play a critical role in the big data landscape.
Figure P-1. Overview of the topics covered in this book
In this first edition, we have not included information on any proprietary Hadoop distributions. We realize that these projects are important and relevant, but the commercial landscape is shifting so quickly that we propose a focus on open source technology only. Open source has a strong hold on the Hadoop and big data markets at the moment, and many commercial solutions are heavily based on the open source technology we describe in this book. Readers who are interested in adopting the open source technologies we discuss are encouraged to look for commercial distributions of those technologies if they are so inclined.
This work is not meant to be a static document that is only updated every year or two. Our goal is to keep it as up to date as possible, adding new content as the Hadoop environment grows and some of the older technologies either disappear or go into maintenance mode as they become supplanted by others that meet newer technology needs or gain in favor for other reasons.