Building an Anonymization Pipeline
by Luk Arbuckle and Khaled El Emam
Copyright 2020 K Sharp Technology, Inc., and Luk Arbuckle. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .
- Acquisitions Editor: Jonathan Hassell
- Development Editor: Melissa Potter
- Production Editor: Christopher Faucher
- Copyeditor: Sonia Saruba
- Proofreader: Charles Roumeliotis
- Indexer: Angela Howard
- Interior Designer: David Futato
- Cover Designer: Karen Montgomery
- Illustrator: Rebecca Demarest
- April 2020: First Edition
Revision History for the First Edition
- 2020-04-10: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781492053439 for release details.
The OReilly logo is a registered trademark of OReilly Media, Inc. Building an Anonymization Pipeline, the cover image, and related trade dress are trademarks of OReilly Media, Inc.
The views expressed in this work are those of the authors, and do not represent the publishers views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
978-1-492-05343-9
[LSI]
Preface
A few years ago we partnered with OReilly to write a book of case studies and methods for anonymizing health data, walking readers through practical methods to produce anonymized data sets in a variety of contexts. Since that time, interest in anonymization, sometimes also called de-identification, has increased due to the growth and use of data, evolving and stricter privacy laws, and expectations of trust by privacy regulators, by private industry, and by citizens from whom data is being collected and processed.
Why We Wrote This Book
The sharing of data for the purposes of data analysis and research can have many benefits. At the same time, concerns and controversies about data ownership and data privacy elicit significant debate. OReillys Data Newsletter on January 2, 2019, recognized that tools for secure and privacy-preserving analytics are a trend on the OReilly radar. Thus an idea was born: write a book that provides strategic opportunities to leverage the spectrum of identifiability to disassociate the personal from data in a variety of contexts to enhance privacy while providing useful data. The result is this book, in which we explore end-to-end solutions to reduce the identifiability of data. We draw on various data collection models and use cases that are enabled by real business needs, have been learned from working in some of the most demanding data environments, and are based on practical approaches that have stood the test of time.
The central question we are consistently asked is how to utilize data in a way that protects individual privacy, but still ensures the data is of sufficient granularity that analytics will be useful and meaningful. By incorporating anonymization methods to reduce identifiability, organizations can establish and integrate secure, repeatable anonymization processes into their data flows and analytics in a sustainable manner. We will describe different technologies that reduce identifiability by generalizing, suppressing, or randomizing data, to produce outputs of data or statistics. We will also describe how these technologies fit within the broader theme of risk-based methods to drive the degree of data transformations needed based on the context of data sharing.
Note
The purpose of a risk-based approach is to replace an otherwise subjective gut check with a more guided decision-making approach that is scalable and proportionate, resulting in solutions that ensure data is useful while being sufficiently protected. Statistical estimators are used to provide objective support, with greater emphasis placed on empirical evidence to drive decision making.
We have a combined three decades of experience in data privacy, from academic research and authorship to training courses, seminars, and presentations, as well as leading highly skilled teams of researchers, data scientists, and practitioners. Weve learned a great deal, and we continue to learn a great deal, about how to put privacy technology into practice. We want to share that knowledge to help drive best practice forward, demonstrating that it is possible to achieve the win-win of data privacy that has been championed by the likes of former privacy commissioner Dr. Ann Cavoukian in her highly influental concept of Privacy by Design. There are many privacy advocates that believe that we can and should treat privacy as a societal good that is encouraged and enforced, and that there are practical ways we can achieve this while meeting the wants and needs of our modern society.
This is, however, a book of strategy, not a book of theory. Consider this book your advisor on how to plan for and use the full spectrum of anonymization tools and processes. The book will guide you in using data for purposes other than those originally intended, helping to ensure that data is not only richer but also that its use is legal and defensible. We will work through different scenarios based on three distinct classes of identifiability of the data involved, and provide details to understand some of the strategic considerations that organizations are struggling with.
Warning
Our aim is to help match privacy considerations to technical solutions. This book is generic, however, touching on a variety of topics relevant to anonymization. Legal interpretations are contextual, and we urge you to consult with your legal and privacy team! Materials presented in this book are for informational purposes only, and not for the purpose of providing legal advice. Okay, now that weve given our disclaimer, we can breathe easy.
Who This Book Was Written For
When conceptualizing this book, we divided the audience in two groups: those who need strategic support (our primary audience) and those who need to understand strategic decisions (our secondary audience). Whether in government or industry, it is a functional need to deliver on the promise of data. We assume that our audience is ready to do great things, beyond compliance with data privacy and data protection laws. And we assume that they are looking for data access models, to enable the safe and responsible use of data.
Primary audience (concerned with crafting a vision and ensuring the successful execution of that vision):
Executive teams concerned with how to make the most of data, e.g., to improve efficiencies, derive new insights, and bring new products to market, all in an effort to make their services broader and better while enhancing the privacy of data subjects. They are more likely to skim this book to nail down their vision and how anonymization fits within it.