Kubeflow for Machine Learning
by Trevor Grant , Holden Karau , Boris Lublinsky, Richard Liu , and Ilan Filonenko
Copyright 2021 Trevor Grant, Holden Karau, Boris Lublinsky, Richard Liu, and Ilan Filonenko. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .
- Acquisitions Editor: Jonathan Hassell
- Development Editor: Amelia Blevins
- Production Editor: Deborah Baker
- Copyeditor: JM Olejarz
- Proofreader: Justin Billing
- Indexer: Sue Klefstad
- Interior Designer: David Futato
- Cover Designer: Karen Montgomery
- Illustrator: Kate Dullea
- November 2020: First Edition
Revision History for the First Edition
- 2020-10-12: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781492050124 for release details.
The OReilly logo is a registered trademark of OReilly Media, Inc. Kubeflow for Machine Learning, the cover image, and related trade dress are trademarks of OReilly Media, Inc.
The views expressed in this work are those of the authors, and do not represent the publishers views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
978-1-492-05012-4
[LSI]
Foreword
Occasionally over the years people will ask me what skills are most in demand in tech. Ten years ago I would tell them to study machine learning, which can scale automated decision making in ways previously impossible. However, these days I have a different answer: machine learning engineering.
Even just a few years ago if you knew machine learning and started at an organization, you would likely walk in the door as the only person with that skill set, allowing you to have an outsized impact. However, a side effect of the proliferation of books, tutorials, e-courses, and boot camps (some of which I have written myself) teaching an entire generation of technologists the skills required is that now machine learning is being used across tens of thousands of companies and organizations.
These days a more likely scenario is that, walking into your new job, you find an organization using machine learning locally but unable to deploy it to production or able to deploy models but unable to manage them effectively. In this setting, the most valuable skill is not being able to train a model, but rather to manage all those models and deploy them in ways that maximize their impact.
In this volume, Trevor Grant, Holden Karau, Boris Lublinsky, Richard Liu, and Ilan Filonenko have put together what I believe is an important cornerstone in the education of data scientists and machine learning engineers. For the foreseeable future the open source Kubeflow project will be a common tool in an organizations toolkit for training, management, and deployment of machine learning models. This book represents the codification of a lot of knowledge that previously existed scattered around internal documentation, conference presentations, and blog posts.
If you believe, as I do, that machine learning is only as powerful as how we use it, then this book is for you.
Chris Albon
Director of Machine Learning,
The Wikimedia Foundation
https://chrisalbon.com
Preface
We wrote this book for data engineers and data scientists who are building machine learning systems/models they want to move to production. If youve ever had the experience of training an excellent model only to ask yourself how to deploy it into production or keep it up to date once it gets there, this is the book for you.We hope this gives you the tools to replace Untitled_5.ipynb
with something that works relatively reliably in production.
This book is not intended to serve as your first introduction to machine learning. The next section points to some resources that may be useful if you are just getting started on your machine learning journey.
Our Assumption About You
This book assumes that you either understand how to train models locally, or are working with someone who does. If neither is true, there are many excellent introductory books on machine learning to get you started, including Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition, by Aurlien Gron (OReilly).
Our goal is to teach you how to do machine learning in a repeatable way, and how to automate the training and deployment of your models. A serious problem here is that this goal includes a wide range of topics, and it is more than reasonable that you may not be intimately familiar with all of them.
Since we cant delve deeply into every topic, we would like to provide you a short list of our favorite primers on several of the topics you will see covered here:
Python for Data Analysis, 2nd Edition, by Wes McKinney (OReilly)
Data Science from Scratch, 2nd Edition, by Joel Grus (OReilly)
Introduction to Machine Learning with Python by Andreas C. Mller and Sarah Guido (OReilly)
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition, by Aurlien Gron (OReilly)
Kubernetes: Up and Running by Brendan Burns et al. (OReilly)
Learning Spark by Holden Karau et al. (OReilly)
Feature Engineering for Machine Learning by Alice Zheng and Amanda Casari (OReilly)
Building Machine Learning Pipelines by Hannes Hapke and Catherine Nelson (OReilly)
Apache Mahout: Beyond MapReduce by Dmitriy Lyubimov and Andrew Palumbo (CreateSpace)
R Cookbook, 2nd Edition, by J. D. Long and Paul Teetor (OReilly)
Serving Machine Learning Models by Boris Lublinsky (OReilly)
Continuous Delivery for Machine Learning by Danilo Sato et al.
Interpretable Machine Learning by Christoph Molnar (self-published)
A Gentle Introduction to Concept Drift in Machine Learning by Jason Brownlee
Model Drift and Ensuring a Healthy Machine Learning Lifecycle by A. Besir Kurtulmus
The Rise of the Model Servers by Alex Vikati
An Overview of Model Explainability in Modern Machine Learning by Rui Aguiar
Machine Learning with Python Cookbook by Chris Albon (OReilly)
Machine Learning Flashcards by Chris Albon
Of course, there are many others, but those should get you started. Please dont be overwhelmed by this listyou certainly dont need to be an expert in each of these topics to effectively deploy and manage Kubeflow. In fact, Kubeflow exists to streamline many of these tasks. However, there may be some topic into which you wish to delve deeperand so this should be thought of as a getting started list.