Moving Hadoop to the Cloud
by Bill Havanki
Copyright 2017 Bill Havanki Jr. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .
Editor: Marie Beaugureau | Production Editor: Colleen Cole |
Copyeditor: Kim Cofer | Proofreader: Christina Edwards |
Indexer: WordCo Indexing Services, Inc. | Interior Designer: David Futato |
Cover Designer: Karen Montgomery | Illustrator: Rebecca Demarest |
Revision History for the First Edition
- 2017-07-05: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781491959633 for release details.
The OReilly logo is a registered trademark of OReilly Media, Inc. Moving Hadoop to the Cloud, the cover image, and related trade dress are trademarks of OReilly Media, Inc.
While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
978-1-491-95963-3
[LSI]
Foreword
Apache Hadoop as software is a simple framework that allows for distributed processing of data across many machines. As a technology, Hadoop and the surrounding ecosystem have changed the way we think about data processing at scale. No longer does our data need to fit in the memory of a single machine, nor are we limited by the I/O of a single machines disks. These are powerful tenets.
So too has cloud computing changed our way of thinking. While the notion of colocating machines in a faraway data center isnt new, allowing users to provision machines on-demand is, and its changed everything. No longer are developers or architects limited by the processing power installed in on-premise data centers, nor do we need to host small web farms under our desks or in that old storage closet. The pay-as-you-go model has been a boon for ad hoc testing and proof-of-concept efforts, eliminating time spent in purchasing, installation, and setup.
Both Hadoop and cloud computing represent major paradigm shifts, not just in enterprise computing, but affecting many other industries. Much has been written about how these technologies have been used to make advances in retail, public sector, manufacturing, energy, and healthcare, just to name a few. Entire businesses have sprung up as a result, dedicated to the care, feeding, integration, and optimization of these new systems.
It was inevitable that Hadoop workloads would be run on cloud computing providers infrastructure. The cloud offers incredible flexibility to users, often complementing on-premise solutions, enabling them to use Hadoop in ways simply not possible previously.
Ever the conscientious software engineer, author Bill Havanki has a strong penchant for documenting. Hes able to break down complex concepts and explain them in simple terms, without making you feel foolish. Bill writes the kind of documentation that you actually enjoy, the kind you find yourself reading long after youve discovered the solution to your original problem.
Hadoop and cloud computing are powerful and valuable tools, but arent simple technologies by any means. This stuff is hard. Both have a multitude of configuration options and its very easy to become overwhelmed. All major cloud providers offer similar services like virtual machines, network attached storage, relational databases, and object storageall of which can be utilized by Hadoopbut each provider uses different naming conventions and has different capabilities and limitations. For example, some providers require that resource provisioning occurs in a specific order. Some providers create isolated virtual networks for your machines automatically while others require manual creation and assignment. It can be confusing. Whether youre working with Hadoop for the first time or a veteran installing on a cloud provider youve never used before, knowing about the specifics of each environment will save you a lot of time and pain.
Cloud computing appeals to a dizzying array of users running a wide variety of workloads. Most cloud providers official documentation isnt specific to any particular application (such as Hadoop). Using Hadoop on cloud infrastructure introduces additional architectural issues that need to be considered and addressed. It helps to have a guide to demystify the options specific to Hadoop deployments and to ease you through the setup process on a variety of cloud providers, step by step, providing tips and best practices along the way. This book does precisely that, in a way that I wish had been available when I started working in the cloud computing world.
Whether code or expository prose, Bills creations are approachable, sensible, and easy to consume. With this book and its author, youre in capable hands for your first foray into moving Hadoop to the Cloud.
Alex Moundalexis,
May 2017
Preface
Its late 2015, and Im staring at a page of mine on my employers wiki, trying to think of an OKR. An OKR is something like a performance objective, a goal to accomplish paired with a way to measure if its been accomplished. While my management chain defines OKRs for the company as a whole and major organizations in it, individuals define their own. We grade ourselves on them, but they do not determine how well we performed because they are meant to be aspirational, not necessary. If you meet all your OKRs, they werent ambitious enough.
My coworkers had already been impressed with writing that Id done as part of my job, both in product documentation and in internal presentations, so focusing on a writing task made sense. How aspirational could I get? So I set this down.
Begin writing a technical book! On something! That is, begin working on one myself, or assist someone else in writing one.
Outright ridiculous, I thought, but why not? Hows that for aspirational.
Well, I have an excellent manager who is willing to entertain the ridiculous, and so she encouraged me to float the idea to someone else in our company who dealt with things like employees writing books, and he responded.
Heres an idea: there is no book out there about Running Hadoop in the Cloud. Would you have enough material at this point?
I work on a product that aims to make the use of Hadoop clusters in the cloud easier, so it was admittedly an extremely good fit. It didnt take long at all for this ember of an idea to catch, and the end result is the book you are reading right now.
Who This Book Is For
Between the twin subjects of Hadoop and the cloud, there is more than enough to write about. Since there are already plenty of good Hadoop books out there, this book doesnt try to duplicate them, and so you should already be familiar with running Hadoop. The details of configuring Hadoop clusters are only covered as needed to get clusters up and running. You can apply your prior Hadoop knowledge with great effectiveness to clusters in the cloud, and much of what other Hadoop books cover still applies.