The 9 Pitfalls of Data Science
Gary Smith and Jay Cordes have a most captivating way and special talent to describe how easy it is to be fooled by the promises of spurious data and by the hype of data science.
Professor John P.A. Ioannidis, Stanford University
Smith and Cordes have produced a remarkably lucid, example-driven text that anybody working near data would do well to read. Though the book is presented as fables and pitfalls, a cogent, scientific approach reveals itself. Managers of data science teams stand to learn a great deal; seasoned data scientists will nod their heads knowingly.
D. Alex Hughes Adjunct Assistant Professor, UC Berkeley School of Information
The current AI hype can be disorienting, but this refreshing book informs to realign expectations, and provides entertaining and relevant narrative examples that illustrate what can go wrong when you ignore the pitfalls of data science. Responsible data scientists should take heed of Smith and Cordes guidance, especially when considering using AI in healthcare where transparency about safety, efficacy, and equity is life-saving.
Michael Abramoff, MD, PhD, Founder and CEO of IDx Watzke Professor of Ophthalmology and Visual Sciences at the University of Iowa
In this era of big data, its good to have a book that collects ways that big data can lie and mislead. This book provides practical advice for users of big data in a way thats easy to digest and appreciate, and will help guide them so that they can avoid its pitfalls.
Joseph Halpern Joseph C. Ford Professor of Engineering, Computer Science Department, Cornell University
Increasingly, the world is immersed in data! Gary Smith and Jay Cordes offer up a veritable firehose of fabulous examples of the uses/misuses of all that big data in real life. You will be a more informed citizen and better-armed consumer by reading their book and, it couldnt come at a better time!
Shecky Riemann math blogger
Great Clarendon Street, Oxford, OX2 6DP, United Kingdom
Oxford University Press is a department of the University of Oxford. It furthers the Universitys objective of excellence in research, scholarship, and education by publishing worldwide. Oxford is a registered trade mark of Oxford University Press in the UK and in certain other countries
Smith and Cordes 2019
The moral rights of the authors have been asserted
First Edition published in 2019
Impression: 1
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, by licence or under terms agreed with the appropriate reprographics rights organization. Enquiries concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford University Press, at the address above
You must not circulate this work in any other form and you must impose this same condition on any acquirer
Published in the United States of America by Oxford University Press
198 Madison Avenue, New York, NY 10016, United States of America
British Library Cataloguing in Publication Data
Data available
Library of Congress Control Number: 2019934000
ISBN 9780198844396
ebook ISBN 9780192582768
DOI: 10.1093/oso/9780198844396.001.0001
Printed and bound by CPI Group (UK) Ltd, Croydon, CR0 4YY
Links to third party websites are provided by Oxford in good faith and for information only. Oxford disclaims any responsibility for the materials contained in any third party website referenced in this work.
Contents
A 2012 article in the Harvard Business Review named data scientist the sexiest job of the 21st century. Governments and businesses are scrambling to hire data scientists, and workers are clamoring to become data scientists, or at least label themselves as such.
Many colleges and universities now offer data science degrees, but their curricula differ wildly. Many businesses have data science divisions, but few restrictions on what they do. Many people say they are data scientists, but may have simply taken some online programming courses and dont know what they dont know. The result is that the analyses produced by data scientists are sometimes spectacular and, other times, disastrous. In a rush to learn the technical skills, the crucial principles of data science are often neglected.
Too many would-be data scientists have the misguided belief that we dont need theories, common sense, or wisdom. An all-too-common thought is, We shouldnt waste time thinking about why something may or may not be true. Its enough to let computers find a pattern and assume that the pattern will persist and make useful predictions. This ill-founded belief underlies misguided projects that have attempted to use Facebook status updates to price auto insurance, Google search queries to predict flu outbreaks, and Twitter tweets to predict stock prices.
Data science is surely revolutionizing our lives, allowing decisions to be based on data rather than lazy thinking, whims, hunches, and prejudices. Unfortunately, data scientists themselves can be plagued by lazy thinking, whims, hunches, and prejudicesand end up fooling themselves and others.
One of Jays managers understood the difference between good data science and garbage. He categorized people who knew what they were doing as kids; those who spouted nonsense were clowns. One of our goals is to explain the difference between a data scientist and a data clown.
Our criticism of data clowns should not be misinterpreted as a disdain for science or scientists. The Economist once used a jigsaw puzzle and a house of cards as metaphors:
In any complex scientific picture of the world there will be gaps, misperceptions and mistakes. Whether your impression is dominated by the whole or the holes will depend on your attitude to the project at hand. You might say that some see a jigsaw where others see a house of cards. Jigsaw types have in mind an overall picture and are open to bits being taken out, moved around or abandoned should they not fit. Those who see houses of cards think that if any piece is removed, the whole lot falls down.
We are firmly in the jigsaw puzzle camp. When done right, theres no question that science works and enriches our lives immensely.
We are not going to bombard you with equations or bore you with technical tips. We want to propose enduring principles. We offer these principles as nine pitfalls to be avoided. We hope that these principles will not only help data scientists be more effective, but also help everyone distinguish between good data science and rubbish. Our book is loaded with grand successes and epic failures. It highlights winning approaches and warns of common pitfalls. We are confident that after reading it, you will recognize good data science when you see it, know how to avoid being duped by data, and make better, more informed decisions. Whether you want to be an effective creator, interpreter, user, or consumer of data, it is important to know the nine pitfalls of data science.
The 9 Pitfalls of Data Science. Gary Smith and Jay Cordes. Oxford University Press (2019).
Gary Smith and Jay Cordes 2019. DOI: 10.1093/oso/9780198844396.001.0001