Practical Weak Supervision
by Wee Hyong Tok , Amit Bahree , and Senja Filipi
Copyright 2022 Wee Hyong Tok, Amit Bahree, and Senja Filipi. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .
- Acquisitions Editor: Rebecca Novack
- Development Editor: Jeff Bleiel
- Production Editor: Kristen Brown
- Copyeditor: nSight, Inc.
- Proofreader: Piper Editorial Consulting, LLC
- Indexer: Ellen Troutman-Zaig
- Interior Designer: David Futato
- Cover Designer: Karen Montgomery
- Illustrator: Kate Dullea
- October 2021: First Edition
Revision History for the First Edition
- 2021-09-30: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781492077060 for release details.
The OReilly logo is a registered trademark of OReilly Media, Inc. Practical Weak Supervision, the cover image, and related trade dress are trademarks of OReilly Media, Inc.
The views expressed in this work are those of the authors, and do not represent the publishers views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
978-1-492-07706-0
[LSI]
Foreword by Xuedong Huang
In specific industry scenarios AI systems can be brittle, and they often require heavy customization with lots of additional data to build machine learning models that help solve for those specific scenarios. However, with diverse data points, and the ability to combine these disparate data types together, there are new opportunities to pretrain machine learning models that are foundational to the downstream workloads. These models often require much less supervision for customization allowing for greater speed and agility at lower cost.
Transfer learning is a winning approach when combined with weak supervision. The foundational model can be strengthened with amazing gains. For example, when looking at very large pretrained speech, NLP, and computer vision models, weak supervision from big data can often produce a competent and sufficient quality allowing one to further compensate for limited data in the downstream task.
Finally, when building AI systems, one key challenge is to understand and act on user engagement signals. These signals are dynamic and weak by their nature. Combining weak supervision and reinforcement learning enables AI systems to learn which actions can solve for which tasks. The result is a high-quality dataset and an optimized model.
Over the last 30 years, I have had the privilege of working with many world-class researchers and engineers in creating what the world now sees as Microsofts Azure AI services. Amit Bahree, Senja Filipi, and Wee Hyong Tok are some of my amazing colleagues who have dealt with practical AI challenges in serving our customers. In this book, they show techniques for weak supervision that will benefit anyone involved in creating production AI systems.
I hope you enjoy this book as much as I have. Amit, Senja, and Wee Hyong show us a practical approach to help address many of the AI challenges that we face in the industry.
Xuedong Huang
Technical Fellow and Azure AI CTO, Microsoft
Bellevue, WA
September 2021
Foreword by Alex Ratner
The real-world impact of artificial intelligence (AI) has grown substantially in recent years, largely due to the advent of deep learning models. These models are more powerful and push-button than ever beforelearning their own powerful, distributed representations directly from raw data with minimal to no manual feature engineering across diverse data and task types. They are also increasingly commoditized and accessible in the open source.
However, deep learning models are also more data hungry than ever, requiring massive, carefully labeled training datasets to function. In a world where the latest and greatest model architectures are downloadable in seconds and the powerful hardware needed to train them is a click away in the cloud, access to high-quality labeled training data has become a major differentiator across both industry and academia. More succinctly, we have left the age of model-centric AI and are entering the era of data-centric AI.
Unfortunately, labeling data at the scale and quality required to trainor superviseuseful AI models tends to be both expensive and time-consuming because it requires manual human input over huge numbers of examples. Person-years of data labeling per model is not uncommon, and when model requirements changesay, to classify medical images as normal, abnormal, or emergent rather than just normal or abnormaldata must often be relabeled from scratch. When organizations are deploying tens, hundreds, or even thousands of ML models that must be constantly iterated upon and retrained to keep up with ever-changing real-world data distributions, hand-labeling simply becomes untenable even for the worlds largest organizations.
For the new data-centric AI reality to become practical and productionized, the next generation of AI systems must embody three key principles:
Data as the central interface
Dataand specifically, training datais often the key to success or failure in AI today; it can no longer be treated like a second-class citizen. Data must be at the center of iterative development in AI, and properly supported as the key interface to building and managing successful AI applications.
Data as a programmatic interface
For data to be the center point of AI development, we must move beyond the inefficient status quo of labeling and manipulating it by hand, one data point at a time. Users must be able to develop and manage the training data that defines AI models programmatically, like they would in developing any other type of practical software system.
Data as a collaborative hub
For AI to be data-centric, the subject-matter experts who actually understand the data and how to label it must be first-class citizens of the development process alongside data scientists and ML engineers.
Enter weak supervision. Instead of hand-labeling data, researchers have developed techniques that leverage more efficient, programmatic, and sometimes noisier forms of supervisionfor example, rules, heuristics, knowledge bases, and moreto create weakly labeled datasets upon which high-quality AI models can be rapidly built. These weaker forms of supervision can often be defined programmatically and can often be directly developed by subject-matter experts. AI models that used to need person-years of labeled data can now be built using only person-days of effort and managed programmatically in more transparent and adaptable ways, without impacting performance or quality. Organizations large and small have taken note of this fundamental change in how AI models are built and managed; in fact, in the last hour you have almost certainly used a weakly supervised AI system in your day-to-day life. In the world of data-centric AI, weak supervision has become a foundational tool.