Data Science on AWS
by Chris Fregly and Antje Barth
Copyright 2021 Antje Barth and Flux Capacitor, LLC. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .
- Acquisitions Editor: Jessica Haberman
- Development Editor: Gary OBrien
- Production Editor: Katherine Tozer
- Interior Designer: David Futato
- Cover Designer: Karen Montgomery
Revision History for the Early Release
- 2020-05-05: First Release
- 2020-06-10: Second Release
- 2020-07-17: Third Release
- 2020-08-03: Fourth Release
- 2020-08-26: Fifth Release
- 2020-10-02: Sixth Release
- 2020-11-30: Seventh Release
See http://oreilly.com/catalog/errata.csp?isbn=9781492079392 for release details.
The OReilly logo is a registered trademark of OReilly Media, Inc. Data Science on AWS, the cover image, and related trade dress are trademarks of OReilly Media, Inc.
The views expressed in this work are those of the authors, and do not represent the publishers views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
978-1-492-07932-3
Chapter 1. Automated Machine Learning
A note for Early Release readers
With Early Release ebooks, you get books in their earliest formthe authors raw and unedited content as they writeso you can take advantage of these technologies long before the official release of these titles.
This will be the 3rd chapter of the final book. Please note that the GitHub repo will be made active later on.
If you have comments about how we might improve the content and/or examples in this book, or if you notice missing material within this chapter, please reach out to the author at support@pipeline.ai.
In this chapter, we will show how to use the fully-managed Amazon AI and machine learning services to offload the undifferentiated heavy lifting of building AI pipelines. We dive deep into two Amazon services for automated machine learning, Amazon SageMaker Autopilot and Amazon Comprehend, designed for users who want to build powerful predictive models from their datasets with just a few clicks.
What is Automated Machine Learning?
Automated machine learning (AutoML) commonly refers to the effort of automating the typical steps of a machine learning pipeline shown in .
Figure 1-1. Typical machine learning pipeline.
Machine learning practitioners spend a lot of time building and managing such pipelines. They need to prepare the data and decide on the framework and algorithm to use. Seasoned data scientists use years of experience and intuition to choose the best algorithm for a given dataset. In an iterative process, ML practitioners try to find the best performing model configuration called hyper-parameters. Unfortunately, there is no cheat sheet either for choosing any of these parameters. We still need experience, intuition, and patience to run many experiments and find the best hyper-parameters for our algorithm and dataset.
What if we could just use a service that automatically finds and trains the best model for our dataset and deploys the model to production with a single click? Amazon SageMaker Autopilot offers us exactly this functionality. Autopilot simplifies the model training and tuning processing by handling many aspects of the model development life cycle (MDLC) including feature transformation, algorithm selection, model training, tuning, and deployment.
Simply point Autopilot to your dataset - and out comes a set of fully-trained and optimized predictive models. Autopilot explores many algorithms and configurations based on many years of AI and machine learning experience at Amazon. The model candidates are summarized by Autopilot through a set of generated Jupyter notebooks and Python scripts. You have full control over these generated notebooks and scripts. You can modify them, automate them, and share them with colleagues. You can select the top model candidate based on your desired balance of model accuracy, model size, and prediction latency. Lets dive deeper into the process of automated machine learning with Autopilot.
Automated Machine Learning with Autopilot
Autopilot is the name of Amazon SageMakers AutoML service. You simply provide your raw data in a S3 bucket, for example in the form of a tabular CSV file, and tell Autopilot which column you want to predict. As the name implies, Autopilot then does the rest automatically.
Note
S3 is Amazons Simple Storage Service. S3 provides a simple web service interface that you can use to store and retrieve any amount of data. We will discuss this service in more detail in the next chapter.
Autopilot uses automated machine learning to analyze the data and identifies the best algorithm and model configuration for your data as shown in .
Figure 1-2. Amazon SageMaker Autopilot.
You can tell Autopilot how many model candidates to explore. In the process of building those candidates, Autopilot tries different algorithms and algorithm settings. Autopilot also applies all needed data transformations to your data to optimize the input for each algorithm. The algorithm, configuration, and data transformation code are then combined into a single ML pipeline definition. The most promising pipelines are selected by Autopilot and used to find the best performing model. Lastly, Autopilot shares the results in a model leaderboard. You can use the best performing model as a baseline and optimize the model even further. A second option is to simply deploy the model and start predicting.
Another highlight of Autopilot is the fact that it provides full visibility into each of those steps and shares all code needed to reproduce the results. AWS calls this a white-box approach. This white-box approach to AutoML is very unique. Lets explore the white-box vs. black-box approach to AutoML a bit further.
Understand Autopilots White-Box Approach to AutoML
In a black-box approach as shown in , you dont have control or visibility into the chosen algorithms, applied data transformations, or hyper-parameter choices. You point the AutoML service to your data and receive a trained model.