Hadoop Application Architectures
by Mark Grover , Ted Malaska , Jonathan Seidman , and Gwen Shapira
Copyright 2015 Jonathan Seidman, Gwen Shapira, Ted Malaska, and Mark Grover. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .
- Editors: Ann Spencer and Brian Anderson
- Production Editor: Nicole Shelby
- Copyeditor: Rachel Monaghan
- Proofreader: Elise Morrison
- Indexer: Ellen Troutman
- Interior Designer: David Futato
- Cover Designer: Ellie Volckhausen
- Illustrator: Rebecca Demarest
Revision History for the First Edition
- 2015-06-26: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781491900086 for release details.
The OReilly logo is a registered trademark of OReilly Media, Inc. Hadoop Application Architectures, the cover image, and related trade dress are trademarks of OReilly Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
978-1-491-90008-6
[LSI]
Foreword
Apache Hadoop has blossomed over the past decade.
It started in Nutch as a promising capability the ability to scalably process petabytes. In 2005 it hadnt been run on more than a few dozen machines, and had many rough edges. It was only used by a few folks for experiments. Yet a few saw promise there, that an affordable, scalable, general-purpose data storage and processing framework might have broad utility.
By 2007 scalability had been proven at Yahoo!. Hadoop now ran reliably on thousands of machines. It began to be used in production applications, first at Yahoo! and then at other Internet companies, like Facebook, LinkedIn, and Twitter. But while it enabled scalable processing of petabytes, the price of adoption was high, with no security and only a Java batch API.
Since then Hadoops become the kernel of a complex ecosystem. Its gained fine-grained security controls, high availability (HA), and a general-purpose scheduler (YARN).
A wide variety of tools have now been built around this kernel. Some, like HBase and Accumulo, provide online keystores that can back interactive applications. Others, like Flume, Sqoop, and Apache Kafka, help route data in and out of Hadoops storage. Improved processing APIs are available through Pig, Crunch, and Cascading. SQL queries can be processed with Apache Hive and Cloudera Impala. Apache Spark is a superstar, providing an improved and optimized batch API while also incorporating real-time stream processing, graph processing, and machine learning. Apache Oozie and Azkaban orchestrate and schedule many of the above.
Confused yet? This menagerie of tools can be overwhelming. Yet, to make effective use of this new platform, you need to understand how these tools all fit together and which can help you. The authors of this book have years of experience building Hadoop-based systems and can now share with you the wisdom theyve gained.
In theory there are billions of ways to connect and configure these tools for your use. But in practice, successful patterns emerge. This book describes best practices, where each tool shines, and how best to use it for a particular task. It also presents common-use cases. At first users improvised, trying many combinations of tools, but this book describes the patterns that have proven successful again and again, sparing you much of the exploration.
These authors give you the fundamental knowledge you need to begin using this powerful new platform. Enjoy the book, and use it to help you build great Hadoop applications.
Doug Cutting
Shed in the Yard, California
Preface
Its probably not an exaggeration to say that Apache Hadoop has revolutionized data management and processing. Hadoops technical capabilities have made it possible for organizations across a range of industries to solve problems that were previously impractical with existing technologies. These capabilities include:
- Scalable processing of massive amounts of data
- Flexibility for data processing, regardless of the format and structure (or lack of structure) in the data
Another notable feature of Hadoop is that its an open source project designed to run on relatively inexpensive commodity hardware. Hadoop provides these capabilities at considerable cost savings over traditional data management solutions.
This combination of technical capabilities and economics has led to rapid growth in Hadoop and tools in the surrounding ecosystem. The vibrancy of the Hadoop community has led to the introduction of a broad range of tools to support management and processing of data with Hadoop.
Despite this rapid growth, Hadoop is still a relatively young technology. Many organizations are still trying to understand how Hadoop can be leveraged to solve problems, and how to apply Hadoop and associated tools to implement solutions to these problems. A rich ecosystem of tools, application programming interfaces (APIs), and development options provide choice and flexibility, but can make it challenging to determine the best choices to implement a data processing application.
The inspiration for this book comes from our experience working with numerous customers and conversations with Hadoop users who are trying to understand how to build reliable and scalable applications with Hadoop. Our goal is not to provide detailed documentation on using available tools, but rather to provide guidance on how to combine these tools to architect scalable and maintainable applications on Hadoop.
We assume readers of this book have some experience with Hadoop and related tools. You should have a familiarity with the core components of Hadoop, such as the Hadoop Distributed File System (HDFS) and MapReduce. If you need to come up to speed on Hadoop, or need refreshers on core Hadoop concepts, Hadoop: The Definitive Guide by Tom White remains, well, the definitive guide.
The following is a list of other tools and technologies that are important to understand in using this book, including references for further reading:
YARN
Up until recently, the core of Hadoop was commonly considered as being HDFS and MapReduce. This has been changing rapidly with the introduction of additional processing frameworks for Hadoop, and the introduction of YARN accelarates the move toward Hadoop as a big-data platform supporting multiple parallel processing models. YARN provides a general-purpose resource manager and scheduler for Hadoop processing, which includes MapReduce, but also extends these services to other processing models. This facilitates the support of multiple processing frameworks and diverse workloads on a single Hadoop cluster, and allows these different models and workloads to effectively share resources. For more on YARN, see