DARK
DATA
DARK
DATA
DAVID JHAND
PRINCETON UNIVERSITY PRESS
PRINCETON AND OXFORD
Copyright 2020 by David J. Hand
Requests for permission to reproduce material from this work should be sent to
Published by Princeton University Press
41 William Street, Princeton, New Jersey 08540
6 Oxford Street, Woodstock, Oxfordshire OX20 1TR
press.princeton.edu
All Rights Reserved
Library of Congress Cataloging-in-Publication Data
Names: Hand, D. J. (David J.), 1950author.
Title: Dark data : why what you dont know matters / David J. Hand.
Description: Princeton : Princeton University Press, [2020] | Includes
bibliographical references and index.
Identifiers: LCCN 2019022971 (print) | LCCN 2019022972 (ebook) |
ISBN 9780691182377 (hardback) | ISBN 9780691198859 (ebook)
Subjects: LCSH: Missing observations (Statistics) | Big data.
Classification: LCC QA276 .H3178 2020 (print) | LCC QA276 (ebook) | DDC
519.5dc23
LC record available at https://lccn.loc.gov/2019022971
LC ebook record available at https://lccn.loc.gov/2019022972
Version 1.0
British Library Cataloging-in-Publication Data is available
Editorial: Ingrid Gnerlich and Arthur Werneck
Production Editorial: Karen Carter
Text Design: Leslie Flis
Jacket/Cover Design: Jason Alejandro
To Shelley
PREFACE
This book is unusual. Most books about databe they popular books about big data, open data, or data science, or technical statistical books about how to analyze dataare about the data you have. They are about the data sitting in folders on your computer, in files on your desk, or as records in your notebook. In contrast, this book is about data you dont haveperhaps data you wish you had, or hoped to have, or thought you had, but nonetheless data you dont have. I argue, and illustrate with many examples, that the missing data are at least as important as the data you do have. The data you cannot see have the potential to mislead you, sometimes even with catastrophic consequences, as we shall see. I show how and why this can happen. But I also show how it can be avoidedwhat you should look for to sidestep such disasters. And then, perhaps surprisingly, once we have seen how dark data arise and can cause such problems, I show how you can use the dark data perspective to flip the conventional way of looking at data analysis on its head: how hiding data can, if you are clever enough, lead to deeper understanding, better decisions, and better choice of actions.
The question of whether the word data should be treated as singular or plural has been a fraught one. In the past it was typically treated as plural, but language evolves, and many people now treat it as singular. In this book I have tried to treat data as plural except in those instances where to do so sounded ugly to my ears. Since beauty is said to be in the eye of the beholder, it is entirely possible that my perception may not match yours.
My own understanding of dark data grew slowly throughout my career, and I owe a huge debt of gratitude to the many people who brought me challenges which I slowly realized were dark data problems and who worked with me on developing ways to cope with them. These problems ranged over medical research, the pharmaceutical industry, government and social policy, the financial sector, manufacturing, and other domains. No area is free from the risks of dark data.
Particular people who kindly sacrificed their time to read drafts of the book include Christoforos Anagnostopoulos, Neil Channon, Niall Adams, and three anonymous publishers readers. They prevented me from making too many embarrassing mistakes. Peter Tallack, my agent, has been hugely supportive in helping me find the ideal publisher for this work, as well as graciously advising me and steering the emphasis and direction of the book. My editor at Princeton University Press, Ingrid Gnerlich, has been a wise and valuable guide in helping me beat my draft into shape. Finally, I am especially grateful to my wife, Professor Shelley Channon, for her thoughtful critique of multiple drafts. The book is significantly improved because of her input.
Imperial College, London
PART 1
DARK DATA
THEIR ORIGINS AND CONSEQUENCES
Chapter 1
DARK DATA
What We Dont See Shapes Our World
The Ghost of Data
First, a joke.
Walking along the road the other day, I came across an elderly man putting small heaps of powder at intervals of about 50 feet down the center of the road. I asked him what he was doing. Its elephant powder, he said. They cant stand it, so it keeps them away.
But there are no elephants here, I said.
Exactly! he replied. Its wonderfully effective.
Now, on to something much more serious.
Measles kills nearly a 100,000 people each year. One in 500 people who get the disease die from complications, and others suffer permanent hearing loss or brain damage. Fortunately, its rare in the United States; for example, only 99 cases were reported in 1999. But a measles outbreak led Washington to declare a statewide emergency in January 2019, and other states also reported dramatically increased numbers of cases. From 1 January 2016 through the end of March 2017, Romania reported more than 4,000 cases and 18 deaths from measles.
Measles is a particularly pernicious disease, spreading undetected because the symptoms do not become apparent until some weeks after you contract it. It slips under the radar, and you have it before you even know that its around.
But the disease is also preventable. A simple vaccination can immunize you against the risk of contracting measles. And, indeed, national immunization programs of the kind carried out in the United States have been immensely successfulso successful in fact that most parents in countries which carry out such programs have never seen or experienced the terrible consequences of such preventable diseases.
So, when parents are advised to vaccinate their children against a disease they have neither seen nor heard of any of their friends or neighbors having, a disease which the Centers for Disease Control and Prevention announced was no longer endemic in the United States, they naturally take the advice with a pinch of salt.
Vaccinate against something which is not there? Its like using the elephant powder.
Except that, unlike the elephants, the risks are still there, just as real as ever. Its merely that the information and data these parents need to make decisions are missing, so that the risks have become invisible.
My general term for the various kinds of missing data is dark data. Dark data are concealed from us, and that very fact means we are at risk of misunderstanding, of drawing incorrect conclusions, and of making poor decisions. In short, our ignorance means we get things wrong.
The term dark data arises by analogy with the dark matter of physics. About 27 percent of the universe consists of this mysterious substance, which doesnt interact with light or other electromagnetic radiation and so cant be seen. Since dark matter cant be seen, astronomers were long unaware of its existence. But then observations of the rotations of galaxies revealed that the more distant stars were not moving more slowly than stars nearer the center, contradicting what we would have expected from our understanding of gravity. This rotational anomaly can be explained by supposing that galaxies have more mass than appears to be the case judging from the stars and other objects we can see through our telescopes. Since we cant see this extra mass, it has been called dark matter. And it can be significant (I almost said it can matter): our home galaxy, the Milky Way, is estimated to have some ten times as much dark matter as ordinary matter.