Programming Elastic MapReduce
Kevin Schmidt
Christopher Phillips
Preface
Many organizations have a treasure trove of data stored away in the many silos of information within them. To unlock this information and use it to compete in the marketplace, organizations have begun looking to Hadoop and Big Data as the key to gaining an advantage over their competition. Many organizations, however, lack the knowledgeable resources and data center space to launch large-scale Hadoop solutions for their data analysis projects.
Amazon Elastic MapReduce (EMR) is Amazons Hadoop solution, running in Amazons data center. Amazons solution is allowing organizations to focus on the data analysis problems they want to solve without the need to plan data center buildouts and maintain large clusters of machines. Amazons pay-as-you-go model is just another benefit that allows organizations to start these projects with no upfront costs and scale instantly as the project grows. We hope this book inspires you to explore Amazon Web Services (AWS) and Amazon EMR, and to use this book to help you launch your next great project with the power of Amazons cloud to solve your biggest data analysis problems.
This book focuses on the core Amazon technologies needed to build an application using AWS and EMR. We chose an application to analyze log data as our case study throughout this book to demonstrate the power of EMR. Log analysis is a good case study for many data analysis problems that organizations faced. Computer logfiles contain large amounts of diverse data from different sources and can be mined to gain valuable intelligence. More importantly, logfiles are ubiquitous across computer systems and provide a ready and available data set with which you can start solving data analysis problems.
Here is an outline of what this book provides:
- Sample configurations for third-party software
- Step-by-step configurations for AWS
- Sample code
- Best practices
- Gotchas
The intent is not to provide a book that has all the code, configuration, and so on, to be able to plop this application on AWS and start going. Instead, we will provide guidance to help you see how to put together a system or application in a cloud environment and describe core issues you may face in working within AWS in building your own project.
You will get the most out of this book if you have a some experience developing or managing applications developed for the traditional data center, but now want to learn how you can move your applications and data into a cloud environment. You should be comfortable using development toolsets and reviewing code samples, architecture diagrams, and configuration examples to understand basic concepts covered in this book. We will use the command line and command-line tools in Unix on a number of the examples we present, so it would not hurt to be familiar with navigating the command line and using basic Unix command-line utilities. The examples in this book can be used on Windows systems too, but you may need to load third-party utilities like Cygwin to follow along.
This book will challenge you with new ways of looking at your applications outside of your traditional data center walls, but hopefully it will open your eyes to the possibilities of what you can accomplish when you focus on the problems you are trying to solve rather than the many administrative issues of building out new servers in a private data center.
What Is AWS?
Amazon Web Services is the name of the computing platform started by Amazon in 2006. AWS offers a suite of services to companies and third-party developers to build solutions using the computing and software resources hosted in Amazons data centers around the globe. Amazon Elastic MapReduce is one of many available AWS services. Developers and companies only pay for the resources they use with a pay-as-you-go model in AWS. This model is changing the approach many businesses take at looking at new projects and initiatives. New initiatives can get started and scale within AWS as they build a customer base and grow without much of the usual upfront costs of buying new servers and infrastructure. Using AWS, companies can now focus on innovation and on building great solutions. They are able to focus less on building and maintaining data centers and the physical infrastructure and can focus on developing solutions.
Cloud Services and Their Impacts
Throughout this book, we discuss the many benefits of AWS and cloud services. Although these services do provide tremendous value to organizations in many ways, they are not always the best option for every project. Running your application comes with many of the same impacts and effects as using VMware or other virtualization technology stacks. These impacts can affect application performance and security, and your application in the cloud may be running with multiple other customers on the same machine. For most applications, the benefits of cloud computing greatly outweigh these impacts. In before starting your own application to make sure it will be a good fit for AWS and cloud computing.
Whats in This Book?
This book is organized as follows. , we review project cost estimation for AWS and EMR applications and how to perform cost analysis of a project.
Sign Up for AWS
To get started, you need to sign up for AWS. If you are already an AWS user, you can skip this section because you already have access to each of the AWS services used throughout this book. If you are a new user, we will get you started in this section.
To sign up for AWS, go to the .
Figure 1. Amazon Web Services home page
You will need to provide a phone number to verify that you are setting up a valid account and you will also need to provide a credit card number to allow Amazon to bill you for the usage of AWS services. We will cover how to estimate, review, and set up billing alerts within AWS in .
After signing up for an AWS account, go to your My Account page to review the services to which you now have access. shows the available services under our account, but your results will likely look somewhat different.
Tip
Remember, there are charges associated with the use of AWS, and a number of the examples and exercises in this book will incur charges to your account. With a new AWS account, there is a
Figure 2. AWS services available after signup
Code Samples in This Book
There are numerous code samples and examples throughout this book. Many of the examples are built using the Java programming language or Hadoop Java libraries. To get the most out of this book and follow along, you need to have a system set up to do Java development and Hadoop Java JAR files to build an application that Amazon EMR can consume and execute. To get ready to develop and build your next application, review
Conventions Used in This Book
The following typographical conventions are used in this book:
Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width
Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.