Sharing Big Data Safely
Ted Dunning and Ellen Friedman
Copyright 2015 Ted Dunning and Ellen Friedman. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .
- Editors: Holly Bauer and Tim McGovern
- Cover Designer: Randy Comer
- September 2015: First Edition
Revision History for the First Edition
- 2015-09-02: First Release
- 2015-12-11: Second Release
The OReilly logo is a registered trademark of OReilly Media, Inc. Sharing Big Data Safely, the cover image, and related trade dress are trademarks of OReilly Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
Images copyright Ellen Friedman unless otherwise specified in the text.
978-1-491-95212-2
[LSI]
Preface
This is not a book to tell you how to build a security system. Its not about how to lock data down. Instead, we provide solutions for how to share secure data safely.
The benefit of collecting large amounts of many different types of data is now widely understood, and its increasingly important to keep certain types of data locked down securely in order to protect it against intrusion, leaks, or unauthorized eyes. Big data security techniques are becoming very sophisticated. But how do you keep data secure and yet get access to it when needed, both for people within your organization and for outside experts? The challenge of balancing security with safe sharing of data is the topic of this book.
These suggestions for safely sharing data fall into two groups:
- How to share original data in a controlled way such that each different group using itsuch as within your organizationonly sees part of the whole dataset.
- How to employ synthetic data to let you get help from outside experts without ever showing them original data.
The book explains in a non-technical way how specific techniques for safe data sharing work. The book also reports on real-world use cases in which customized synthetic data has provided an effective solution. You can read Chapters 14 and get a complete sense of the story.
In Chapters 57, we go on to provide a technical deep-dive into these techniques and use cases and include links to open source code and tips for implementation.
Who Should Use This Book
If you work with sensitive data, personally identifiable information (PII), data of great value to your company, or any data for which youve made promises about disclosure, or if you consult for people with secure data, this book should be of interest to you. The book is intended for a mixed non-technical and technical audience that includes decision makers, group leaders, developers, and data scientists.
Our starting assumption is that you know how to build a secure system and have already done so. The question is: do you know how to safely share data without losing that security?
Chapter 1. So Secure Its Lost
What do buried 17th-century treasure, encoded messages from the Siege of Vicksburg in the US Civil War, tree squirrels, and big data have in common?
Someone buried a massive cache of gemstones, coins, jewelry, and ornate objects under the floor of a cellar in the City of London, and it remained undiscovered and undisturbed there for about 300 years. The date of the burying of this treasure is fixed with considerable confidence over a fairly narrow range of time, between 1640 and 1666. The latter was the year of the Great Fire of London, and the treasure appeared to have been buried before that destructive event. The reason to conclude that the cache was buried after 1640 is the presence of a small, chipped, red intaglio with the emblem of the newly appointed 1st Viscount Stafford, an aristocratic title that had only just been established that year. Many of the contents of the cache appear to be from approximately that time period, late in the time of Shakespeare and Queen Elizabeth I. Otherssuch as a cameo carving from Egyptwere probably already quite ancient when the owner buried the collection of treasure in the early 17th century.
What this treasure represents and the reason for hiding it in the ground in the heart of the City of London are much less certain than its age. The items were of great value even at the time they were hidden (and are of much greater value today). The location where the treasure was buried was beneath a cellar at what was then 3032 Cheapside. This spot was in a street of goldsmiths, silversmiths, and other jewelers. Because the collection contains a combination of set and unset jewels and because the location of the hiding place was under a building owned at the time by the Goldsmiths Company, the most likely explanation is that it was the stock-in-trade of a jeweler operating at that location in London in the early 1600s.
Why did the owner hide it? The owner may have buried it as a part of his normal workas perhaps many of his fellow jewelers may have done from time to time with their own stockin order to keep it secure during the regular course of business. In other words, the hidden location may have been functioning as a very inconvenient, primitive safe when something happened to the owner.
Most likely the security that the owner sought by burying his stock was in response to something unusual, a necessity that arose from upheavals such as civil war, plague, or an elevated level of activity by thieves. Perhaps the owner was going to be away for an extended time, and he buried the collection of jewelry to keep it safe for his return. Even if the owner left in order to escape the Great Fire, its unlikely that that conflagration prevented him from returning to recover the treasure. Very few people died in the fire. In any event, something went wrong with the plan. One assumes that if the location of the valuables were known, someone would have claimed it.
Another possible but less likely explanation is that the hidden bunch of valuables were stolen goods, held by a fence who was looking for a buyer. Or these precious items might have been secreted away and hoarded up a few at a time by someone employed by (and stealing from) the jeweler or someone hiding stock to obscure shady dealings, or evade paying off a debt or taxes. That idea isnt so far-fetched. The collection is known to contain two counterfeit balas rubies that are believed to have been made by the jeweler Thomas Sympson of Cheapside. By 1610, Sympson had already been investigated for alleged fraudulent activities. These counterfeit stones are composed of egg-shaped quartz treated to accept a reddish dye, making them look like a type of large and very valuable ruby that was highly desired at the time. Regardless of the reason the treasure was hidden, something apparently went wrong for it to have remained undiscovered for so many years.