• Complain

Mark Grover - Hadoop Application Architectures

Here you can read online Mark Grover - Hadoop Application Architectures full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. year: 2015, publisher: OReilly Media, genre: Home and family. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:

Romance novel Science fiction Adventure Detective Science History Home and family Prose Art Politics Computer Non-fiction Religion Business Children Humor

Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.

Mark Grover Hadoop Application Architectures

Hadoop Application Architectures: summary, description and annotation

We offer to read an annotation, description, summary or preface (depends on what the author of the book "Hadoop Application Architectures" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.

Get expert guidance on architecting end-to-end data management solutions with Apache Hadoop. While many sources explain how to use various components in the Hadoop ecosystem, this practical book takes you through architectural considerations necessary to tie those components together into a complete tailored application, based on your particular use case.

To reinforce those lessons, the books second section provides detailed examples of architectures used in some of the most commonly found Hadoop applications. Whether youre designing a new Hadoop application, or planning to integrate Hadoop into your existing data infrastructure, Hadoop Application Architectures will skillfully guide you through the process.

This book covers:

  • Factors to consider when using Hadoop to store and model data
  • Best practices for moving data in and out of the system
  • Data processing frameworks, including MapReduce, Spark, and Hive
  • Common Hadoop processing patterns, such as removing duplicate records and using windowing analytics
  • Giraph, GraphX, and other tools for large graph processing on Hadoop
  • Using workflow orchestration and scheduling tools such as Apache Oozie
  • Near-real-time stream processing with Apache Storm, Apache Spark Streaming, and Apache Flume
  • Architecture examples for clickstream analysis, fraud detection, and data warehousing

Mark Grover: author's other books


Who wrote Hadoop Application Architectures? Find out the surname, the name of the author of the book and a list of all author's works by series.

Hadoop Application Architectures — read online for free the complete book (whole text) full work

Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "Hadoop Application Architectures" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.

Light

Font size:

Reset

Interval:

Bookmark:

Make
Hadoop Application Architectures

Mark Grover, Ted Malaska,
Jonathan Seidman & Gwen Shapira

Hadoop Application Architectures

by Mark Grover , Ted Malaska , Jonathan Seidman , and Gwen Shapira

Copyright 2015 Jonathan Seidman, Gwen Shapira, Ted Malaska, and Mark Grover. All rights reserved.

Printed in the United States of America.

Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.

OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .

  • Editors: Ann Spencer and Brian Anderson
  • Production Editor: Nicole Shelby
  • Copyeditor: Rachel Monaghan
  • Proofreader: Elise Morrison
  • Indexer: Ellen Troutman
  • Interior Designer: David Futato
  • Cover Designer: Ellie Volckhausen
  • Illustrator: Rebecca Demarest
  • July 2015: First Edition
Revision History for the First Edition
  • 2015-06-26: First Release

See http://oreilly.com/catalog/errata.csp?isbn=9781491900086 for release details.

The OReilly logo is a registered trademark of OReilly Media, Inc. Hadoop Application Architectures, the cover image, and related trade dress are trademarks of OReilly Media, Inc.

While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.

978-1-491-90008-6

[LSI]

Foreword

Apache Hadoop has blossomed over the past decade.

It started in Nutch as a promising capability the ability to scalably process petabytes. In 2005 it hadnt been run on more than a few dozen machines, and had many rough edges. It was only used by a few folks for experiments. Yet a few saw promise there, that an affordable, scalable, general-purpose data storage and processing framework might have broad utility.

By 2007 scalability had been proven at Yahoo!. Hadoop now ran reliably on thousands of machines. It began to be used in production applications, first at Yahoo! and then at other Internet companies, like Facebook, LinkedIn, and Twitter. But while it enabled scalable processing of petabytes, the price of adoption was high, with no security and only a Java batch API.

Since then Hadoops become the kernel of a complex ecosystem. Its gained fine-grained security controls, high availability (HA), and a general-purpose scheduler (YARN).

A wide variety of tools have now been built around this kernel. Some, like HBase and Accumulo, provide online keystores that can back interactive applications. Others, like Flume, Sqoop, and Apache Kafka, help route data in and out of Hadoops storage. Improved processing APIs are available through Pig, Crunch, and Cascading. SQL queries can be processed with Apache Hive and Cloudera Impala. Apache Spark is a superstar, providing an improved and optimized batch API while also incorporating real-time stream processing, graph processing, and machine learning. Apache Oozie and Azkaban orchestrate and schedule many of the above.

Confused yet? This menagerie of tools can be overwhelming. Yet, to make effective use of this new platform, you need to understand how these tools all fit together and which can help you. The authors of this book have years of experience building Hadoop-based systems and can now share with you the wisdom theyve gained.

In theory there are billions of ways to connect and configure these tools for your use. But in practice, successful patterns emerge. This book describes best practices, where each tool shines, and how best to use it for a particular task. It also presents common-use cases. At first users improvised, trying many combinations of tools, but this book describes the patterns that have proven successful again and again, sparing you much of the exploration.

These authors give you the fundamental knowledge you need to begin using this powerful new platform. Enjoy the book, and use it to help you build great Hadoop applications.

Doug Cutting
Shed in the Yard, California

Preface

Its probably not an exaggeration to say that Apache Hadoop has revolutionized data management and processing. Hadoops technical capabilities have made it possible for organizations across a range of industries to solve problems that were previously impractical with existing technologies. These capabilities include:

  • Scalable processing of massive amounts of data
  • Flexibility for data processing, regardless of the format and structure (or lack of structure) in the data

Another notable feature of Hadoop is that its an open source project designed to run on relatively inexpensive commodity hardware. Hadoop provides these capabilities at considerable cost savings over traditional data management solutions.

This combination of technical capabilities and economics has led to rapid growth in Hadoop and tools in the surrounding ecosystem. The vibrancy of the Hadoop community has led to the introduction of a broad range of tools to support management and processing of data with Hadoop.

Despite this rapid growth, Hadoop is still a relatively young technology. Many organizations are still trying to understand how Hadoop can be leveraged to solve problems, and how to apply Hadoop and associated tools to implement solutions to these problems. A rich ecosystem of tools, application programming interfaces (APIs), and development options provide choice and flexibility, but can make it challenging to determine the best choices to implement a data processing application.

The inspiration for this book comes from our experience working with numerous customers and conversations with Hadoop users who are trying to understand how to build reliable and scalable applications with Hadoop. Our goal is not to provide detailed documentation on using available tools, but rather to provide guidance on how to combine these tools to architect scalable and maintainable applications on Hadoop.

We assume readers of this book have some experience with Hadoop and related tools. You should have a familiarity with the core components of Hadoop, such as the Hadoop Distributed File System (HDFS) and MapReduce. If you need to come up to speed on Hadoop, or need refreshers on core Hadoop concepts, Hadoop: The Definitive Guide by Tom White remains, well, the definitive guide.

The following is a list of other tools and technologies that are important to understand in using this book, including references for further reading:

YARN
Up until recently, the core of Hadoop was commonly considered as being HDFS and MapReduce. This has been changing rapidly with the introduction of additional processing frameworks for Hadoop, and the introduction of YARN accelarates the move toward Hadoop as a big-data platform supporting multiple parallel processing models. YARN provides a general-purpose resource manager and scheduler for Hadoop processing, which includes MapReduce, but also extends these services to other processing models. This facilitates the support of multiple processing frameworks and diverse workloads on a single Hadoop cluster, and allows these different models and workloads to effectively share resources. For more on YARN, see
Next page
Light

Font size:

Reset

Interval:

Bookmark:

Make

Similar books «Hadoop Application Architectures»

Look at similar books to Hadoop Application Architectures. We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.


Reviews about «Hadoop Application Architectures»

Discussion, reviews of the book Hadoop Application Architectures and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.