Using Flume
by Hari Shreedharan
Copyright 2015 Hari Shreedharan. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (https://www.safaribooksonline.com/). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .
- Editor: Ann Spencer
- Production Editor: Kara Ebrahim
- Copyeditor: Charles Roumeliotis
- Proofreader: Rachel Head
- Indexer: Meghan Jones
- Interior Designer: David Futato
- Cover Designer: Ellie Volckhausen
- Illustrator: Rebecca Demarest
- October 2014: First Edition
Revision History for the First Edition
- 2014-09-15: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781449368302 for release details.
The OReilly logo is a registered trademarks of OReilly Media, Inc. Using Flume, the cover image of a burbot, and related trade dress are trademarks of OReilly Media, Inc.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and OReilly Media, Inc., was aware of a trademark claim, the designations have been printed in caps or initial caps.
While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
978-1-449-36830-2
[LSI]
Foreword
The past few years have seen tremendous growth in the development and adoption of Big Data technologies. Hadoop and related platforms are powering the next wave of data analytics over increasingly large amounts of data. The data produced today will be dwarfed by what is expected tomorrow, growing at an ever-increasing rate as the digital revolution engulfs all aspects of our existence. The barrier to entry in this new age of massive data volumes is of course the obvious one: how do you get all this data in your cluster to begin with? Clearly, this data is produced in a wide spectrum of sources spread across the enterprise, and has an interesting mix of interaction, machine, sensor, and social data among others. Any operator who has dealt with similar challenges would no doubt agree that it is nontrivial, if not downright hard, to build a system that can route this data into your clusters in a cost-effective manner.
Apache Flume is exactly built to handle this challenge.
Back in 2011 when Flume went into Incubation at The Apache Software Foundation, it was a project built by Cloudera engineers to address large-scale log data aggregation on Hadoop. Being a popular project from the beginning, it had seen a large number of new requirements ranging from event-ordering to guaranteed-delivery semantics, that came up over its initial releases. Given its popularity and high demand for complex requirements, we decided to refactor the project entirely and make it simpler, more powerful in its applicability and manageability, and allow for easy extensions where necessary. Hari and I were in the Incubator project along with a handful of other engineers who were working around the clock with the Flume community to drive this vision and implementation forward. From that time until now, Flume has graduated into its own top-level Apache project, made several stable releases, and has grown significantly rich in functionality.
Today, Flume is actively deployed and in use across the world in large numbers of data centers, sometimes spanning continental boundaries. It continues to effectively provide a super-resilient, fault-tolerant, reliable, fast, and efficient mechanism to move massive amounts of data from a variety of sources over to destination systems such as HBase, HDFS, etc. A well-planned Flume topology operates with minimal or no intervention, practically running itself indefinitely. It provides contextual routing and is able to work through downtimes, network outages, and other unpredictable/unplanned interruptions by providing the capacity to reliably store and retransmit messages when connectivity is restored. It does all of this out of the box, and yet provides the flexibility to customize any component within its implementation using fairly stable and intuitive interfaces that are widely in use.
In Using Flume, Hari provides an overview of various components within Flume, diving into details where necessary. Operators will find this book immensely valuable for understanding how to easily set up and deploy Flume pipelines. Developers will find it a handy reference to build or customize components within Flume, and to better understand its architecture and component designs. Above all, this book will give you the necessary insights for setting up continuous ingestion for HDFS and HBase the two most popular storage systems today.
With Flume deployed, you can be sure that data no matter where its produced in your enterprise, or how large its volume is will make it safely and timely into your Big Data platforms. And you can then focus your energy on getting the right insights out of your data. Good luck!
Arvind Prabhakar, CTO, StreamSets
Preface
Today, developers are able to write and deploy applications on a large number of servers in the cloud very easily. These applications are producing more data than ever, which when stored and analyzed gives valuable insights that can improve the applications themselves and the businesses that the applications are a part of. The data generated by such applications is often analyzed using systems like Hadoop and HBase.
Analyzing this data is really possible only if you can get the data into these systems from frontend servers. Often, the validity of such analysis becomes less valid as the data becomes older. To get the data into the processing system in near real time, systems like Apache Flume are used. Apache Flume is a system for moving large amounts of data from large numbers of data producers to systems that store, index, or analyze that data. Such systems also decouple the producers from the consumers of the data, making it easy to change either side without the other knowing about it. In addition to decoupling, they also provide failure isolation and an added buffer between the producer and the storage system. The data producers will not know about the storage or indexing system being inaccessible until all of the Flume buffers also fill up this provides an additional buffer, which might be enough for the storage system to come back online and clear up the backlog of events in the Flume buffers.
In this book, we will discuss in detail why systems like Flume are needed, the internals of a Flume agent, and how to configure and deploy Flume agents. We will also discuss the various ways in which Flume deployments can be customized and how to write plug-ins for Flume.
gives a basic introduction to Apache Hadoop and Apache HBase. This chapter is only meant to introduce the reader to Hadoop and HBase and give some details of their internals. This can be skipped if the reader is already familiar with Hadoop and HBase.