In his earlier life as the very successful author of statistics textbooks, Gary Smith had a knack for creating creative applications that helped students learn important statistical concepts in a fun and intuitive way. In this collaboration with Jay Cordes, Smith takes this same approach mainstream with entertaining example after entertaining example highlighting their central point that not all patterns are meaningful. Smith and Cordes argue that the solution to the dilemma is not more data, but rather more intelligent theorizing about how the world works. Readers should heed their warningand, those who dont should not be surprised if they make an appearance in the next Smith and Cordes book as a cautionary tale!
Shawn Bushway, Senior Policy ResearcherBehavioral and Policy Sciences Department, RAND Corporation
A nice little antidote to big claims about big benefits of Big Data.
Marc Abrahams, Editor of the Annals of Improbable Researchfounder of the Ig Nobel Prize ceremony
Its refreshing to see a book on paleo thats about distance running, pattern recognition, and jokes, rather than scarfing steaks, pumping iron, and violence. But, as Gary Smith and Jay Cordes explain and demonstrate, pattern recognition can lead to superficially appealing but ultimately misleading conclusions.
Andrew Gelman, Professor of Statistics and Computer Science Columbia University
Gary and Jay hit the ball out of the park with The Phantom Pattern Problem: The Mirage of Big Data. Full of fun stories and spurious correlations and patterns, the book excels at its aim: Explaining the hazards of big data, how many can easily be fooled by putting too much trust in blind statistics, as well as highlighting many pitfalls such as overfitting, data mining with out-of-sample data, over-reliance on backtesting, and Hypothesizing after the Results are Known, or HARKing. The text is a home run on the importance of building models guided by human expertise, the critical process of theory before data, and is a welcome addition to any readers library.
Brian Nelson, CFA, President Investment Research, Valuentum Securities, Inc.
The legendary economist Ronald Coase once famously said, If you torture the data long enough, it will confess. As Smith and Cordes demonstrate in spades, the era of Big Data has only exacerbated Coases assertion. Packed with great examples and solid research, The Phantom Pattern Problem is a cri de coeur to those who believe in the unassailable power of data.
Phil Simon Award winning author of Too Big to Ignore: The Business Case for Big Data
Using easily understood examples from sports, the stock market, economics, medical testing, and gambling, Smith and Cordes illustrate how data analytics and big data can be seductively misleading. I learned a lot.
Robert J. Marks II, Ph.D.Distinguished Professor of Electrical & Computer Engineering, Baylor UniversityDirector, The Walter Bradley Center for Natural & Artificial Intelligence
Great Clarendon Street, Oxford, OX2 6DP, United Kingdom
Oxford University Press is a department of the University of Oxford. It furthers the Universitys objective of excellence in research, scholarship, and education by publishing worldwide. Oxford is a registered trade mark of Oxford University Press in the UK and in certain other countries
Gary Smith and Jay Cordes 2020
The moral rights of the authors have been asserted
First Edition published in 2020
Impression: 1
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, by licence or under terms agreed with the appropriate reprographics rights organization. Enquiries concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford University Press, at the address above
You must not circulate this work in any other form and you must impose this same condition on any acquirer
Published in the United States of America by Oxford University Press
198 Madison Avenue, New York, NY 10016, United States of America
British Library Cataloguing in Publication Data
Data available
Library of Congress Control Number: 2020930015
ISBN 9780198864165
ebook ISBN 9780192609694
Printed and bound by CPI Group (UK) Ltd, Croydon, CR0 4YY
Links to third party websites are provided by Oxford in good faith and for information only. Oxford disclaims any responsibility for the materials contained in any third party website referenced in this work.
This book is dedicated to our families.
Table of Contents
In October 2001, Apples Steve Jobs unveiled the iPod, a revolutionary hand-held music player with a built-in hard drive: To have your whole CD library with you at all times is a quantum leap when it comes to music. You can fit your whole music library in your pocket. Despite the $399 price tag, sales were phenomenal, hitting thirty-nine million in 2006, before being eclipsed by the 2007 introduction of the iPhone.
shows that the explosion of iPod sales in 2005 and 2006 coincided with an increase in the number of murders in the United States. Were people killing each other in order to get their hands on an iPod? Were iPod listeners driven insane by the incessant music and then murdering friends and strangers?
Figure I.1 iPod sales and murders.
When we showed to a friend, her immediate reaction was, Surely you jest. We are jesting, but there is a reason why we jest. In 2007, the Urban Institute, a highly regarded Washington think-tank, released a research report on the increase in murders in 2006 and 2007:
The rise in violent offending and the explosion in the sales of iPods and other portable media devices is more than coincidental. We propose that, over the past two years, America may have experienced an iCrime wave.
Unlike us, they were not jesting.
We have all been warned over and over that correlation is not causation, but too often, we ignore the warnings. We have inherited from our distant ancestors an often-irresistible desire to seek patterns and succumb to their allure. We laugh at some obviously nutty correlations; for example, the number of lawyers in Nevada is statistically correlated with the number of people who died after tripping over their own two feet. Yet other correlations, like iPod sales and murders, have a seductive appeal. If esteemed researchers at the Urban Institute can be seduced by fanciful correlations, so can any of us.
Thousands of people didnt kill each other so that they could steal their iPods, and thousands of iPod listeners werent driven murderously insane. Murders and iPod sales both happened to increase in 2005 and 2006, as did many other things. The serendipitous correlation between murders and iPod sales did not last long. Murders dropped in 2007 and have fallen since then, even though iPod sales continued to grow for a few more years until they were dwarfed by iPhone sales.
The correlation between murders and iPod sales is particularly laughable since it is based on a mere three years of data. Anything that increased (or decreased) steadily during this two-year span will be highly correlated with murdersfor example, ice cream sales in the U.S.