Practical Machine Learning with H2O
by Darren Cook
Copyright 2017 Darren Cook. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .
- Editor: Nicole Tache
- Production Editor: Colleen Lobner
- Copyeditor: Kim Cofer
- Proofreader: Charles Roumeliotis
- Indexer: WordCo Indexing Services, Inc.
- Interior Designer: David Futato
- Cover Designer: Karen Montgomery
- Illustrator: Rebecca Demarest
- December 2016: First Edition
Revision History for the First Edition
- 2016-12-01: First Release
- 2017-01-06: Second Release
See http://oreilly.com/catalog/errata.csp?isbn=9781491964606 for release details.
The OReilly logo is a registered trademark of OReilly Media, Inc. Practical Machine Learning with H2O, the cover image of a crayfish, and related trade dress are trademarks of OReilly Media, Inc.
While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
978-1-491-96460-6
[LSI]
Preface
It feels like machine learning has finally come of age. It has been a long childhood, stretching back to the 1950s and the first program to learn from experience (playing checkers), as well as the first neural networks. Weve been told so many times by AI researchers that the breakthrough is just around the corner that we long ago stopped listening. But maybe they were on the right track all along, maybe an idea just needs one more order of magnitude of processing power, or a slight algorithmic tweak, to go from being pathetic and pointless to productive and profitable.
In the early 90s, neural nets were being hailed as the new AI breakthrough. I did some experiments applying them to computer go, but they were truly awful when compared to the (still quite mediocre) results I could get using a mix of domain-specific knowledge engineering, and heavily pruned tree searches. And the ability to scale looked poor, too. When, 20 years later, I heard talk of this new and shiny deep learning thing that was giving impressive results in computer go, I was confused how this was different from the neural nets Id rejected all those years earlier. Not that much was the answer; sometimes you just need more processing power (five or six orders of magnitude in this case) for an algorithm to bear fruit.
H2O is software for machine learning and data analysis. Wanting to see what other magic deep learning could perform was what personally led me to H2O (though it does more than that: trees, linear models, unsupervised learning, etc.), and I was immediately impressed. It ticks all the boxes:
With the high-quality team that H2O.ai (the company behind H2O) has put together, it is only going to get better. There is the attitude of not just How do we get this to work? but How do we get this to work efficiently at big data scale? permeating the whole development.
If machine learning has come of age, H2O looks to be not just an economical family car for it, but simultaneously the large load delivery truck for it. Stretching my vehicle analogy a bit further, this book will show you not just what the dashboard controls do, but also the best way to use them to get from A to B. It will be as practical as possible, with only the bare minimum explanation of the maths or theory behind the learning algorithms.
Of course H2O is not perfect; here are a few issues Ive noticed people mutter about. There is no GPU support (which could make deep learning, in particular, quicker). The cluster support is all bout that bass (big data), no treble (complex but relatively small data), so for the latter you may be limited to needing a single, fast, machine with lots of cores. Also no high availability (HA) for clusters. H2O compiles to Java; it is well-optimized and the H2O algorithms are known for their speed but, theoretically at least, carefully optimized C++ could be quicker. There is no SVM algorithm. Finally, it tries to support numerous platforms, so each has some rough edges, and development is sometimes slowed by trying to keep them all in sync.
In other words, and wringing the last bit of life out of my car analogy: a Formula 1 car might beat it on the straights, and it isnt yet available in yellow.
Who Uses It and Why?
A number of well-known companies are using H2O for their big data processing, and the website claims that over 5000 organizations currently use it. The company behind it, H2O.ai, has over 80 staff, more than half of which are developers.
But those are stats to impress your boss, not a no-nonsense developer. For R and Python developers, who already feel they have all the machine learning libraries they need, the primary things H2O brings are ease of use and efficient scalability to data sets too large to fit in the memory of your largest machine. For SparkML users, who feel they already have that, H2O algorithms are fewer in number but apparently significantly quicker. As a bonus, the intelligent defaults mean your code is very compact and clear to read: you can literally get a well-tuned, state-of-the-art, deep learning model as a one-liner. One of the goals of this book was to show you how to tune the models, but as we will see, sometimes Ive just had to give up and say I cant beat the defaults.
About You
To bring this book in at under a thousand pages, Ive taken some liberties. I am assuming you know either R or Python. Advanced language features are not used, so competence in any programming language should be enough to follow along, but the examples throughout the book are only in one of those two languages. Python users would benefit from being familiar with pandas, not least because it will make all your data science easier.
Im also assuming a bit of mental flexibility: to save repeating every example twice, Im hoping R users can grasp what is going on in a Python example, and Python users can grasp an R example. These slides on Python for R users are a good start (for R users too).
Some experience with manipulating data is assumed, even if just using spreadsheet software or SQL tables. And I assume you have a fair idea of what machine learning and AI are, and how they are being used more and more in the infrastructure that runs our society. Maybe you are reading this book because you want to be part of that and fmake sure the transformations to come are done ethically and for the good of everyone, whatever their race, sex, nationality, or beliefs. If so, I salute you.