Practical Synthetic Data Generation
by Khaled El Emam , Lucy Mosquera , and Richard Hoptroff
Copyright 2020 K Sharp Technology Inc., Lucy Mosquera, and Richard Hoptroff. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .
- Acquisitions Editor: Jonathan Hassell
- Development Editor: Corbin Collins
- Production Editor: Christopher Faucher
- Copyeditor: Piper Editorial
- Proofreader: JM Olejarz
- Indexer: Potomac Indexing, LLC
- Interior Designer: David Futato
- Cover Designer: Karen Montgomery
- Illustrator: Jenny Bergman
Revision History for the First Edition
- 2020-05-19: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781492072744 for release details.
The OReilly logo is a registered trademark of OReilly Media, Inc. Practical Synthetic Data Generation, the cover image, and related trade dress are trademarks of OReilly Media, Inc.
The views expressed in this work are those of the authors, and do not represent the publishers views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
978-1-492-07274-4
[LSI]
Preface
Interest in synthetic data has been growing rapidly over the last few years. This interest has been driven by two simultaneous trends. The first is the demand for large amounts of data to train and build artificial intelligence and machine learning (AIML) models. The second is recent work that has demonstrated effective methods for generating high-quality synthetic data. Both have resulted in the recognition that synthetic data can solve some difficult problems quite effectively, especially within the AIML community. Companies like NVIDIA, IBM, and Alphabet, as well as agencies such as the US Census Bureau, have adopted different types of data synthesis methodologies to support model building, application development, and data dissemination .
This book provides you with a gentle introduction to methods for the following: generating synthetic data, evaluating the data that has been synthesized, understanding the privacy implications of synthetic data, and implementing synthetic data within your organization. We show how synthetic data can accelerate AIML projects. Some of the problems that can be tackled by having synthetic data would be too costly or dangerous to solve using more traditional methods (e.g., training models controlling autonomous vehicles), or simply cannot be done otherwise. We also explain how to assess the privacy risks from synthetic data, even though they tend to be minimal if synthesis is done properly.
While we want this book to be an introduction, we also want it to be applied. Therefore, we will discuss some of the issues that will be encountered with real data, not curated or cleaned data. Real data is complex and messy, and data synthesis needs to be able to work within that context.
Our intended audience is analytics leaders who are responsible for enabling AIML model development and application within their organizations, as well as data scientists who want to learn how data synthesis can be a useful tool for their work. We will use examples of different types of data synthesis to illustrate the broad applicability of this approach. Our main focus here is on the synthesis of structured data.
Conventions Used in This Book
The following typographical conventions are used in this book:
ItalicIndicates new terms, URLs, email addresses, filenames, and file extensions.
OReilly Online Learning
Note
For more than 40 years, OReilly Media has provided technology and business training, knowledge, and insight to help companies succeed.
Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. OReillys online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from OReilly and 200+ other publishers. For more information, visit http://oreilly.com.
How to Contact Us
Please address comments and questions concerning this book to the publisher:
- OReilly Media, Inc.
- 1005 Gravenstein Highway North
- Sebastopol, CA 95472
- 800-998-9938 (in the United States or Canada)
- 707-829-0515 (international or local)
- 707-829-0104 (fax)
We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/practical-synthetic-data-generation.
Email to comment or ask technical questions about this book.
For news and information about our books and courses, visit http://oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://youtube.com/oreillymedia
Acknowledgments
The preparation of this book benefited from a series of interviews with subject matter experts. I would like to thank the following individuals for making themselves available to discuss their experiences and thoughts on the synthetic data market and technology: Fernanda Foertter, Jim Karkanias, Alexei Pozdnoukhov, Rev Lebaradian, John Ashley, Rob Csonger, and Simson Garfinkel.
Rob Csonger and his team provided the content for the section on autonomous vehicles .
Mike Hintze from Hintze Law LLC prepared the legal analysis in the identity disclosure chapter.
We wish to thank Janice Branson for reviewing earlier versions of the manuscript.
Our clients and collaborators, who often give us challenging problems, have been key to driving our innovations in the methods of data synthesis and the implementation of the technology in practice.
Chapter 1. Introducing Synthetic Data Generation
We start this chapter by explaining what synthetic data is and its benefits. Artificial intelligence and machine learning (AIML) projects run in various industries, and the use cases that we include in this chapter are intended to give a flavor of the broad applications of data synthesis. We define an AIML project quite broadly as well, to include, for example, the development of software applications that have AIML components .
Defining Synthetic Data
At a conceptual level, synthetic data is not real data, but data that has been generated from real data and that has the same statistical properties as the real data. This means that if an analyst works with a synthetic dataset, they should get analysis results similar to what they would get with real data. The degree to which a synthetic dataset is an accurate proxy for real data is a measure of