Preface
Welcome to MapReduce Design Patterns ! This book will be unique in some ways and familiar in others. First and foremost, this book is obviously about design patterns, which are templates or general guides to solving problems. We took a look at other design patterns books that have been written in the past as inspiration, particularly Design Patterns: Elements of Reusable Object-Oriented Software , by Gamma et al. (1995), which is commonly referred to as The Gang of Four book. For each pattern, youll see a template that we reuse over and over that we loosely based off of their book. Repeatedly seeing a similar template will help you get to the specific information you need. This will be especially useful in the future when using this book as a reference.
This book is a bit more open-ended than a book in the cookbook series of texts as we dont call out specific problems. However, similarly to the cookbooks, the lessons in this book are short and categorized. Youll have to go a bit further than just copying and pasting our code to solve your problems, but we hope that you will find a pattern to get you at least 90% of the way for just about all of your challenges.
This book is mostly about the analytics side of Hadoop or MapReduce. We intentionally try not to dive into too much detail on how Hadoop or MapReduce works or talk too long about the APIs that we are using. These topics have been written about quite a few times, both online and in print, so we decided to focus on analytics.
In this preface, well talk about how to read this book since its format might be a bit different than most books youve read.
Intended Audience
The motivation for us to write this book was to fill a missing gap we saw in a lot of new MapReduce developers. They had learned how to use the system, got comfortable with writing MapReduce, but were lacking the experience to understand how to do things right or well. The intent of this book is to prevent you from having to make some of your own mistakes by educating you on how experts have figured out how to solve problems with MapReduce. So, in some ways, this book can be viewed as an intermediate or advanced MapReduce developer resource, but we think early beginners and gurus will find use out of it.
This book is also intended for anyone wanting to learn more about the MapReduce paradigm. The book goes deeply into the technical side of MapReduce with code examples and detailed explanations of the inner workings of a MapReduce system, which will help software engineers develop MapReduce analytics. However, quite a bit of time is spent discussing the motivation of some patterns and the common use cases for these patterns, which could be interesting to someone who just wants to know what a system like Hadoop can do.
To get the most out of this book, we suggest you have some knowledge of Hadoop, as all of the code examples are written for Hadoop and many of the patterns are discussed in a Hadoop context. A brief refresher will be given in the first chapter, along with some suggestions for additional reading material.
Pattern Format
The patterns in this book follow a single template format so they are easier to read in succession. Some patterns will omit some of the sections if they dont make sense in the context of that pattern.
IntentThis section is a quick description of the problem the pattern is intended to solve.
MotivationThis section explains why you would want to solve this problem or where it would appear. Some use cases are typically discussed in brief.
ApplicabilityThis section contains a set of criteria that must be true to be able to apply this pattern to a problem. Sometimes these are limitations in the design of the pattern and sometimes they help you make sure this pattern will work in your situation.
StructureThis section explains the layout of the MapReduce job itself. Itll explain what the map phase does, what the reduce phase does, and also lets you know if itll be using any custom partitioners, combiners, or input formats. This is the meat of the pattern and explains how to solve the problem.
ConsequencesThis section is pretty short and just explains what the output of the pattern will be. This is the end goal of the output this pattern produces.
ResemblancesFor readers that have some experience with SQL or Pig, this section will show analogies of how this problem would be solved with these other languages. You may even find yourself reading this section first as it gets straight to the point of what this pattern does.
Sometimes, SQL, Pig, or both are omitted if what we are doing with MapReduce is truly unique.
Known UsesThis section outlines some common use cases for this pattern.
Performance AnalysisThis section explains the performance profile of the analytic produced by the pattern. Understanding this is important because every MapReduce analytic needs to be tweaked and configured properly to maximize performance. Without the knowledge of what resources it is using on your cluster, it would be difficult to do this.
The Examples in This Book
All of the examples in this book are written for Hadoop version 1.0.3. MapReduce is a paradigm that is seen in a number of open source and commercial systems these days, but we had to pick one to make our examples consistent and easy to follow, so we picked Hadoop. Hadoop was a logical choice since it a widely used system, but we hope that users of MongoDBs MapReduce and other MapReduce implementations will be able to extrapolate the examples in this text to their particular system of choice.
Caution
In general, we try to use the newer mapreduce
API for all of our examples, not the deprecated mapred
API. Just be careful when mixing code from this book with other sources, as plenty of people still use mapred
and their APIs are not compatible.
Our examples generally omit any sort of error handling, mostly to make the code more terse. In real-world big data systems, you can expect your data to be malformed and youll want to be proactive in handling those situations in your analytics.
We use the same data set throughout this text: a dump of StackOverflows databases. StackOverflow is a popular website in which software developers can go to ask and answer questions about any coding topic (including Hadoop). This data set was chosen because it is reasonable in size, yet not so big that you cant use it on a single node. This data set also contains human-generated natural language text as well as structured elements like usernames and dates.
Throughout the examples in this book, we try to break out parsing logic of this data set into helper functions to clearly distinguish what code is specific to this data set and which code is general and part of the pattern. Since the XML is pretty simple, we usually avoid using a full-blown XML parser and just parse it with some string operations in our Java code.
The data set contains five tables, of which we only use three: comments, posts, and users. All of the data is in well-formed XML, with one record per line.
We use the following three StackOverflow tables in this book:
comments