• Complain

Philip (flip) Kromer - Big Data for Chimps: A Guide to Massive-Scale Data Processing in Practice

Here you can read online Philip (flip) Kromer - Big Data for Chimps: A Guide to Massive-Scale Data Processing in Practice full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. year: 2015, publisher: OReilly Media, genre: Computer / Science. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:

Romance novel Science fiction Adventure Detective Science History Home and family Prose Art Politics Computer Non-fiction Religion Business Children Humor

Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.

Philip (flip) Kromer Big Data for Chimps: A Guide to Massive-Scale Data Processing in Practice

Big Data for Chimps: A Guide to Massive-Scale Data Processing in Practice: summary, description and annotation

We offer to read an annotation, description, summary or preface (depends on what the author of the book "Big Data for Chimps: A Guide to Massive-Scale Data Processing in Practice" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.

Finding patterns in massive event streams can be difficult, but learning how to find them doesnt have to be. This unique hands-on guide shows you how to solve this and many other problems in large-scale data processing with simple, fun, and elegant tools that leverage Apache Hadoop. Youll gain a practical, actionable view of big data by working with real data and real problems.

Perfect for beginners, this books approach will also appeal to experienced practitioners who want to brush up on their skills. Part I explains how Hadoop and MapReduce work, while Part II covers many analytic patterns you can use to process any data. As you work through several exercises, youll also learn how to use Apache Pig to process data.

  • Learn the necessary mechanics of working with Hadoop, including how data and computation move around the cluster
  • Dive into map/reduce mechanics and build your first map/reduce job in Python
  • Understand how to run chains of map/reduce jobs in the form of Pig scripts
  • Use a real-world datasetbaseball performance statisticsthroughout the book
  • Work with examples of several analytic patterns, and learn when and where you might use them

Philip (flip) Kromer: author's other books


Who wrote Big Data for Chimps: A Guide to Massive-Scale Data Processing in Practice? Find out the surname, the name of the author of the book and a list of all author's works by series.

Big Data for Chimps: A Guide to Massive-Scale Data Processing in Practice — read online for free the complete book (whole text) full work

Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "Big Data for Chimps: A Guide to Massive-Scale Data Processing in Practice" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.

Light

Font size:

Reset

Interval:

Bookmark:

Make
Big Data for Chimps

Philip Kromer and Russell Jurney

Big Data for Chimps

by Philip Kromer and Russell Jurney

Copyright 2016 Philip Kromer and Russell Jurney. All rights reserved.

Printed in the United States of America.

Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.

OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .

  • Acquisitions Editor: Mike Loukides
  • Editors: Meghan Blanchette and Amy Jollymore
  • Production Editor: Matthew Hacker
  • Copyeditor: Jasmine Kwityn
  • Proofreader: Rachel Monaghan
  • Indexer: Wendy Catalano
  • Interior Designer: David Futato
  • Cover Designer: Ellie Volckhausen
  • Illustrator: Rebecca Demarest
  • October 2015: First Edition
Revision History for the First Edition
  • 2015-09-25: First Release

See http://oreilly.com/catalog/errata.csp?isbn=9781491923948 for release details.

The OReilly logo is a registered trademark of OReilly Media, Inc. Big Data for Chimps, the cover image of a chimpanzee, and related trade dress are trademarks of OReilly Media, Inc.

While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.

978-1-491-92394-8

[LSI]

Preface

Big Data for Chimps will explain a practical, actionable view of big data. This view will be centered on tested best practices as well as give readers street-fighting smarts with Hadoop.

Readers will come away with a useful, conceptual idea of big data. Insight is data in context. The key to understanding big data is scalability: infinite amounts of data can rest upon distinct pivot points. We will teach you how to manipulate data about these pivot points.

Finally, the book will contain examples with real data and real problems that will bring the concepts and applications for business to life.

What This Book Covers

Big Data for Chimps shows you how to solve important problems in large-scale data processing using simple, fun, and elegant tools.

Finding patterns in massive event streams is an important, hard problem. Most of the time, there arent earthquakes but the patterns that will let you predict one in advance lie within the data from those quiet periods. How do you compare the trillions of subsequences in billions of events, each to each other, to find the very few that matter? Once you have those patterns, how do you react to them in real time?

Weve chosen case studies anyone can understand, and that are general enough to apply to whatever problems youre looking to solve. Our goal is to provide you with the following:

  • The ability to think at scale--equipping you with a deep understanding of how to break a problem into efficient data transformations, and of how data must flow through the cluster to effect those transformations
  • Detailed example programs applying Hadoop to interesting problems in context
  • Advice and best practices for efficient software development

All of the examples use real data, and describe patterns found in many problem domains, as you:

  • Create statistical summaries
  • Identify patterns and groups in the data
  • Search, filter, and herd records in bulk

The emphasis on simplicity and fun should make this book especially appealing to beginners, but this is not an approach youll outgrow. Weve found its the most powerful and valuable approach for creative analytics. One of our maxims is robots are cheap, humans are important: write readable, scalable code now and find out later whether you want a smaller cluster. The code you see is adapted from programs we write at Infochimps and Data Syndrome to solve enterprise-scale business problems, and these simple high-level transformations meet our needs.

Many of the chapters include exercises. If youre a beginning user, we highly recommend you work through at least one exercise from each chapter. Deep learning will come less from having the book in front of you as you read it than from having the book next to you while you write code inspired by it. There are sample solutions and result datasets on the books website.

Who This Book Is For

Wed like for you to be familiar with at least one programming language, but it doesnt have to be Python or Pig. Familiarity with SQL will help a bit, but isnt essential. Some exposure to working with data in a business intelligence or analysis background will be helpful.

Most importantly, you should have an actual project in mind that requires a big-data toolkit to solve a problem that requires scaling out across multiple machines. If you dont already have a project in mind but really want to learn about the big-data toolkit, take a look at , which uses baseball data. It makes a great dataset for fun exploration.

Who This Book Is Not For

This is not Hadoop: The Definitive Guide (thats already been written, and well); this is more like Hadoop: A Highly Opinionated Guide. The only coverage of how to use the bare Hadoop API is to say, in most cases, dont. We recommend storing your data in one of several highly space-inefficient formats and in many other ways encourage you to willingly trade a small performance hit for a large increase in programmer joy. The book has a relentless emphasis on writing scalable code, but no content on writing performant code beyond the advice that the best path to a 2x speedup is to launch twice as many machines.

That is because for almost everyone, the cost of the cluster is far less than the opportunity cost of the data scientists using it. If you have not just big data but huge data (lets say somewhere north of 100 terabytes), then you will need to make different trade-offs for jobs that you expect to run repeatedly in production. However, even at petabyte scale, you will still develop in the manner we outline.

The book does include some information on provisioning and deploying Hadoop, and on a few important settings. But it does not cover advanced algorithms, operations, or tuning in any real depth.

What This Book Does Not Cover

We are not currently planning to cover Hive. The Pig scripts will translate naturally for folks who are already familiar with it.

This book picks up where the Internet leaves off. Were not going to spend any real time on information well covered by basic tutorials and core documentation. Other things we do not plan to include:

  • Installing or maintaining Hadoop.
  • Other MapReduce-like platforms (Disco, Spark, etc.) or other frameworks (Wukong, Scalding, Cascading).
  • At a few points, well use Unix text utils (cut/wc/etc.), but only as tools for an immediate purpose. We cant justify going deep into any of them; there are whole OReilly books covering these utilities.
Next page
Light

Font size:

Reset

Interval:

Bookmark:

Make

Similar books «Big Data for Chimps: A Guide to Massive-Scale Data Processing in Practice»

Look at similar books to Big Data for Chimps: A Guide to Massive-Scale Data Processing in Practice. We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.


Reviews about «Big Data for Chimps: A Guide to Massive-Scale Data Processing in Practice»

Discussion, reviews of the book Big Data for Chimps: A Guide to Massive-Scale Data Processing in Practice and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.