LitArk » Books » Computer

Steve Hoffman - Apache Flume: Distributed Log Collection for Hadoop

Here you can read online Steve Hoffman - Apache Flume: Distributed Log Collection for Hadoop full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. year: 2013, publisher: Packt Publishing, genre: Computer. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:

Romance novel Science fiction Adventure Detective Science History Home and family Prose Art Politics Computer Non-fiction Religion Business Children Humor

Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.

Book:
Apache Flume: Distributed Log Collection for Hadoop
Author:
Steve Hoffman
Publisher:
Packt Publishing
Genre:
Books / Computer
Year:
2013
Rating:
3 / 5
Favourites:
Add to favourites
Your mark:
- 60
- 1
- 2
- 3
- 4
- 5

Description
Author's other books
Similar books

Apache Flume: Distributed Log Collection for Hadoop: summary, description and annotation

We offer to read an annotation, description, summary or preface (depends on what the author of the book "Apache Flume: Distributed Log Collection for Hadoop" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.

Stream data to Hadoop using Apache Flume

Overview

Integrate Flume with your data sources
Transcode your data en-route in Flume
Route and separate your data using regular expression matching
Configure failover paths and load-balancing to remove single points of failure
Utilize Gzip Compression for files written to HDFS

In Detail

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Its main goal is to deliver data from applications to Apache Hadoops HDFS. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with many failover and recovery mechanisms.

Apache Flume: Distributed Log Collection for Hadoop covers problems with HDFS and streaming data/logs, and how Flume can resolve these problems. This book explains the generalized architecture of Flume, which includes moving data to/from databases, NO-SQL-ish data stores, as well as optimizing performance. This book includes real-world scenarios on Flume implementation.

Apache Flume: Distributed Log Collection for Hadoop starts with an architectural overview of Flume and then discusses each component in detail. It guides you through the complete installation process and compilation of Flume.

It will give you a heads-up on how to use channels and channel selectors. For each architectural component (Sources, Channels, Sinks, Channel Processors, Sink Groups, and so on) the various implementations will be covered in detail along with configuration options. You can use it to customize Flume to your specific needs. There are pointers given on writing custom implementations as well that would help you learn and implement them.

By the end, you should be able to construct a series of Flume agents to transport your streaming data and logs from your systems into Hadoop in near real time.

What you will learn from this book

Understand the Flume architecture
Download and install open source Flume from Apache
Discover when to use a memory or file-backed channel
Understand and configure the Hadoop File System (HDFS) sink
Learn how to use sink groups to create redundant data flows
Configure and use various sources for ingesting data
Inspect data records and route to different or multiple destinations based on payload content
Transform data en-route to Hadoop
Monitor your data flows

Approach

A starter guide that covers Apache Flume in detail.

Who this book is written for

Apache Flume: Distributed Log Collection for Hadoop is intended for people who are responsible for moving datasets into Hadoop in a timely and reliable manner like software engineers, database administrators, and data warehouse administrators.

Steve Hoffman: author's other books

Who wrote Apache Flume: Distributed Log Collection for Hadoop? Find out the surname, the name of the author of the book and a list of all author's works by series.

Apache Flume: Distributed Log Collection for Hadoop — read online for free the complete book (whole text) full work

Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "Apache Flume: Distributed Log Collection for Hadoop" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.

Light

Font size:

↓

↑

Reset

Interval:

↓

↑

Bookmark:

Make

Apache Flume: Distributed Log Collection for Hadoop

Apache Flume: Distributed Log Collection for Hadoop

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: July 2013

Production Reference: 1090713

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78216-791-4

www.packtpub.com

Cover Image by Abhishek Pandey (<>)

Credits

Author

Steve Hoffman

Reviewers

Subash D'Souza

Stefan Will

Acquisition Editor

Kunal Parikh

Commissioning Editor

Sharvari Tawde

Technical Editors

Jalasha D'costa

Mausam Kothari

Project Coordinator

Sherin Padayatty

Proofreader

Aaron Nash

Indexer

Monica Ajmera Mehta

Graphics

Valentina D'silva

Abhinash Sahu

Production Coordinator

Kirtee Shingan

Cover Work

Kirtee Shingan

About the Author

Steve Hoffman has 30 years of software development experience and holds a B.S. in computer engineering from the University of Illinois Urbana-Champaign and a M.S. in computer science from the DePaul University. He is currently a Principal Engineer at Orbitz Worldwide.

More information on Steve can be found at http://bit.ly/bacoboy or on Twitter @bacoboy.

This is Steve's first book.

I'd like to dedicate this book to my loving wife Tracy. Her dedication to perusing what you love is unmatched and it inspires me to follow her excellent lead in all things.

I'd also like to thank Packt Publishing for the opportunity to write this book and my reviewers and editors for their hard work in making it a reality.

Finally, I want to wish a fond farewell to my brother Richard who passed away recently. No book has enough pages to describe in detail just how much we will all miss him. Good travels brother.

About the Reviewers

Subash D'Souza is a professional software developer with strong expertise in crunching big data using Hadoop/HBase with Hive/Pig. He has worked with Perl/PHP/Python, primarily for coding and MySQL/Oracle as the backend, for several years prior to moving into Hadoop fulltime. He has worked on scaling for load, code development, and optimization for speed. He also has experience optimizing SQL queries for database interactions. His specialties include Hadoop, HBase, Hive, Pig, Sqoop, Flume, Oozie, Scaling, Web Data Mining, PHP, Perl, Python, Oracle, SQL Server, and MySQL Replication/Clustering.

I would like to thank my wife, Theresa for her kind words of support and encouragement.

Stefan Will is a computer scientist with a degree in machine learning and pattern recognition from the University of Bonn. For over a decade has worked for several startup companies in Silicon Valley and Raleigh, North Carolina, in the area of search and analytics. Presently, he leads the development of the search backend and the Hadoop-based product analytics platform at Zendesk, the customer service software provider.

www.PacktPub.com

Support files, eBooks, discount offers and more

You might want to visit www.PacktPub.com for support files and downloads related to your book.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at > for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

http://PacktLib.PacktPub.com

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books.

Why Subscribe?

Fully searchable across every book published by Packt
Copy and paste, print and bookmark content
On demand and accessible via web browser

Free Access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access.

Preface

Hadoop is a great open source tool for sifting tons of unstructured data into something manageable, so that your business can gain better insight into your customers, needs. It is cheap (can be mostly free), scales horizontally as long as you have space and power in your data center, and can handle problems your traditional data warehouse would be crushed under. That said, a little known secret is that your Hadoop cluster requires you to feed it with data; otherwise, you just have a very expensive heat generator. You will quickly find, once you get past the playing around phase with Hadoop, that you will need a tool to automatically feed data into your cluster. In the past, you had to come up with a solution for this problem, but no more! Flume started as a project out of Cloudera when their integration engineers had to keep writing tools over and over again for their customers to import data automatically. Today the project lives with the Apache Foundation, is under active development, and boasts users who have been using it in their production environments for years.

In this book I hope to get you up and running quickly with an architectural overview of Flume and a quick start guide. After that well deep-dive into the details on many of the more useful Flume components, including the very important File Channel for persistence of in-flight data records and the HDFS Sink for buffering and writing data into HDFS , the Hadoop Distributed File System . Since Flume comes with a wide variety of modules, chances are that the only tool youll need to get started is a text editor for the configuration file.

By the end of the book, you should know enough to build out a highly available, fault tolerant, streaming data pipeline feeding your Hadoop cluster.

What this book covers

, Overview and Architecture , introduces the reader to Flume and the problem space that it is trying to address (specifically with regard to Hadoop). An architectural overview is given of the various components to be covered in the later chapters.

Light

Font size:

↓

↑

Reset

Interval:

↓

↑

Bookmark:

Make

Similar books «Apache Flume: Distributed Log Collection for Hadoop»

Look at similar books to Apache Flume: Distributed Log Collection for Hadoop. We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.

Vishwanathan Narayanan

Big Data Hadoop Interview Guide: Get answers to the most frequently asked questions in a Hadoop interview (English Edition)

Mayank Bhushan

Big Data and Hadoop: Learn by Example

White

Hadoop

Achari

Hadoop essentials: delve into the key concepts of Hadoop and get a thorough understanding of the Hadoop ecosystem

Bengfort Benjamin

Data analytics with Hadoop: an introduction for data scientists

Hien Luu

Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning library

Grover Mark

Hadoop Application Architectures: Designing Real-World Big Data Applications

Mark Grover

Hadoop Application Architectures

Hari Shreedharan

Using Flume: Flexible, Scalable, and Reliable Data Streaming

Tom White

Hadoop: The Definitive Guide

Arun C. Murthy

Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2

Vignesh Prajapati

Big Data Analytics with R and Hadoop

Reviews about «Apache Flume: Distributed Log Collection for Hadoop»

Discussion, reviews of the book Apache Flume: Distributed Log Collection for Hadoop and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.