Real-World Hadoop
Ted Dunning
Ellen Friedman
Beijing Cambridge Farnham Kln Sebastopol Tokyo
Dedication
The authors dedicate this book with gratitude to Yorick Wilks, Fellow of the British Computing Society and Professor Emeritus in the Natural Language Processing Group at University of Sheffield, Senior Research Fellow at the Oxford Internet Institute, Senior Research Scientist at the Florida Institute for Human and Machine Cognition, and an extraordinary person.
Yorick mentored Ted Dunning as Department Chair and his graduate advisor during Teds doctoral studies in Computing Science at the University of Sheffield. He also provided guidance as Teds supervisor while Yorick was Director of the Computing Research Laboratory, New Mexico State University, where Ted did research on statistical methods for natural language processing (NLP). Yoricks strong leadership showed that critical and open examination of a wide range of ideas is the foundation of real progress. Ted can only hope to try to live up to that ideal.
We both are grateful to Yorick for his outstanding and continuing contributions to computing science, especially in the fields of artificial intelligence and NLP, through a career that spans five decades. His brilliance in research is matched by a sparkling wit, and it is both a pleasure and an inspiration to know him.
These links provide more details about Yoricks work:
http://staffwww.dcs.shef.ac.uk/people/Y.Wilks/
http://en.wikipedia.org/wiki/Yorick_Wilks
Preface
This book is for you if you are interested in how Apache Hadoop and related technologies can address problems involving large-scale data in cost-effective ways. Whether you are new to Hadoop or a seasoned user, you should find the content in this book both accessible and helpful.
Here we speak to business team leaders, CIOs, business analysts, and technical developers to explain in basic terms how NoSQL Apache Hadoop and Apache HBaserelated technologies work to meet big data challenges and the ways in which people are using them, including using Hadoop in production. Detailed knowledge of Hadoop is not a prerequisite for this book. We do assume you are rougly familiar with what Hadoop and HBase are, and we focus mainly on how best to use them to advantage. The book includes some suggestions for best practice, but it is intended neither as a technical reference nor a comprehensive guide to how to use these technologies, and people can easily read it whether or not they have a deeply technical background. That said, we think that technical adepts will also benefit, not so much from a review of tools, but from a sharing of experience.
Based on real-world situations and experience, in this book we aim to describe how Hadoop-based systems and new NoSQL database technologies such as Apache HBase have been used to solve a wide variety of business and research problems. These tools have grown to be very effective and production-ready. Hadoop and associated tools are being used successfully in a variety of use cases and sectors. To choose to move into these new approaches is a big decision, and the first step is to recognize how these solutions can be an advantage to achieve your own specific goals. For those just getting started, we describe some of the pre-planning and early decisions that can make the process easier and more productive. People who are already using Hadoop and NoSQL-based technologies will find suggestions for new ways to gain the full range of benefits possible from employing Hadoop well.
In order to help inform the choices people make as they consider these new solutions, weve put together:
- An overview of the reasons people are turning to these technologies
- A brief review of what the Hadoop ecosystem tools can do for you
- A collection of tips for success
- A description of some widely applicable prototypical use cases
- Stories from the real world to show how people are already using Hadoop and NoSQL successfully for experimentation, development, and in production
This book is a selection of various examples that should help guide decisions and spark your ideas for how best to employ these technologies. The examples we describe are based on how customers use the Hadoop distribution from MapR Technologies to solve their big data needs in many situations across a range of different sectors. The uses for Hadoop we describe are not, however, limited to MapR. Where a particular capability is MapR-specific, we call that to your attention and explain how this would be handled by other Hadoop distributions. Regardless of the Hadoop distribution you choose, you should be able to see yourself in these examples and gain insights into how to make the best use of Hadoop for your own purposes.
How to Use This Book
If you are inexperienced with Apache Hadoop and NoSQL non-relational databases, you will find basic advice to get you started, as well as suggestions for planning your use of Hadoop going forward.
If you are a seasoned Hadoop user and have familiarity with Hadoop-based tools, you may want to mostly skim or even skip except as a quick review of the ecosystem.
For all readers, when you reach then shows you how Hadoop users are putting those options together in real-world settings to address many different problems.
We hope you find this approach helpful.
Ted Dunning and Ellen Friedman, January 2015
Chapter 1. Turning to Apache Hadoop and NoSQL Solutions
Some questions are easier to answer than others. In response to the question, Is Hadoop ready for production?, the answer is, simply, yes.
This answer may surprise you, given how young the Apache Hadoop technology actually is. You may wonder on what basis we offer this definitive response to the question of Hadoops readiness for production. The key reason we say that it is ready is simply because so many organizations are already using Hadoop in production and doing so successfully. Of course, being ready for production is not the same thing as being a mature technology.
Will Hadoop-based technologies change over the next few years? Of course they will. This is a rapidly expanding new arena, with continual improvements in the underlying technology and the appearance of innovative new tools that run in this ecosystem. The level of experience and understanding among Hadoop users is also rapidly increasing. As Hadoop and its related technologies continue progress toward maturity, there will be a high rate of change. Not only will new features and capabilities be added, these technologies will generally become easier to use as they become more refined.
Are these technologies a good choice for you? The answer to that question is more complicated, as it depends on your own project goals, your resources, and your willingness to adopt new approaches. Even with a mature technology, there would be a learning curve to account for in planning the use of something different; with a maturing technology you also have to account for a cost of novelty and stay adaptable to rapid change in the technology. Hadoop and NoSQL solutions are still young, so not only are the tools themselves still somewhat short of maturity, there is also a more limited pool of experienced users from which to select when building out your own team than with some older approaches.
Even so, Hadoop adoption is widespread and growing rapidly. For many, the question is no longer whether or not to turn to Hadoop and NoSQL solutions for their big data challenges but rather,