Readers will come away with a useful, conceptual idea of big data. Insight is data in context. The key to understanding big data is scalability: infinite amounts of data can rest upon distinct pivot points. We will teach you how to manipulate data about these pivot points.
Finally, the book will contain examples with real data and real problems that will bring the concepts and applications for business to life.
What This Book Covers
Big Data for Chimps shows you how to solve important problems in large-scale data processing using simple, fun, and elegant tools.
Finding patterns in massive event streams is an important, hard problem. Most of the time, there arent earthquakesbut the patterns that will let you predict one in advance lie within the data from those quiet periods. How do you compare the trillions of subsequences in billions of events, each to each other, to find the very few that matter? Once you have those patterns, how do you react to them in real time?
Weve chosen case studies anyone can understand, and that are general enough to apply to whatever problems youre looking to solve. Our goal is to provide you with the following:
- The ability to think at scale--equipping you with a deep understanding of how to break a problem into efficient data transformations, and of how data must flow through the cluster to effect those transformations
- Detailed example programs applying Hadoop to interesting problems in context
- Advice and best practices for efficient software development
All of the examples use real data, and describe patterns found in many problem domains, as you:
Create statistical summaries
Identify patterns and groups in the data
Search, filter, and herd records in bulk
The emphasis on simplicity and fun should make this book especially appealing to beginners, but this is not an approach youll outgrow. Weve found its the most powerful and valuable approach for creative analytics. One of our maxims is robots are cheap, humans are important: write readable, scalable code now and find out later whether you want a smaller cluster. The code you see is adapted from programs we write at Infochimps and Data Syndrome to solve enterprise-scale business problems, and these simple high-level transformations meet our needs.
Many of the chapters include exercises. If youre a beginning user, we highly recommend you work through at least one exercise from each chapter. Deep learning will come less from having the book in front of you as you read it than from having the book next to you while you write code inspired by it. There are sample solutions and result datasets on the books website.
Who This Book Is For
Wed like for you to be familiar with at least one programming language, but it doesnt have to be Python or Pig. Familiarity with SQL will help a bit, but isnt essential. Some exposure to working with data in a business intelligence or analysis background will be helpful.
Most importantly, you should have an actual project in mind that requires a big-data toolkit to solvea problem that requires scaling out across multiple machines. If you dont already have a project in mind but really want to learn about the big-data toolkit, take a look at , which uses baseball data. It makes a great dataset for fun exploration.
Who This Book Is Not For
This is not Hadoop: The Definitive Guide (thats already been written, and well); this is more like Hadoop: A Highly Opinionated Guide. The only coverage of how to use the bare Hadoop API is to say, in most cases, dont. We recommend storing your data in one of several highly space-inefficient formats and in many other ways encourage you to willingly trade a small performance hit for a large increase in programmer joy. The book has a relentless emphasis on writing scalable code, but no content on writing performant code beyond the advice that the best path to a 2x speedup is to launch twice as many machines.
That is because for almost everyone, the cost of the cluster is far less than the opportunity cost of the data scientists using it. If you have not just big data but huge data (lets say somewhere north of 100 terabytes), then you will need to make different trade-offs for jobs that you expect to run repeatedly in production. However, even at petabyte scale, you will still develop in the manner we outline.
The book does include some information on provisioning and deploying Hadoop, and on a few important settings. But it does not cover advanced algorithms, operations, or tuning in any real depth.
What This Book Does Not Cover
We are not currently planning to cover Hive. The Pig scripts will translate naturally for folks who are already familiar with it.
This book picks up where the Internet leaves off. Were not going to spend any real time on information well covered by basic tutorials and core documentation. Other things we do not plan to include:
Installing or maintaining Hadoop.
Other MapReduce-like platforms (Disco, Spark, etc.) or other frameworks (Wukong, Scalding, Cascading).
At a few points, well use Unix text utils (cut
/wc
/etc.), but only as tools for an immediate purpose. We cant justify going deep into any of them; there are whole OReilly books covering these utilities.