Cathy ONeil earned a Ph.D. in math from Harvard, was postdoc at the MIT math department, and a professor at Barnard College where she published a number of research papers in arithmetic algebraic geometry. She then chucked it and switched over to the private sector. She worked as a quant for the hedge fund D.E. Shaw in the middle of the credit crisis, and then for RiskMetrics, a risk software company that assesses risk for the holdings of hedge funds and banks. She is currently a data scientist on the New York start-up scene, writes a blog at mathbabe.org, and is involved with Occupy Wall Street.
Rachel Schutt is the Senior Vice President for Data Science at News Corp. She earned a PhD in Statistics from Columbia University, and was a statistician at Google Research for several years. She is an adjunct professor in Columbias Department of Statistics and a founding member of the Education Committee for the Institute for Data Sciences and Engineering at Columbia. She holds several pending patents based on her work at Google, where she helped build user-facing products by prototyping algorithms and building models to understand user behavior. She has a master's degree in mathematics from NYU, and a master's degree in Engineering-Economic Systems and Operations Research from Stanford University. Her undergraduate degree is in Honors Mathematics from the University of Michigan.
Chapter 1. Introduction: What Is Data Science?
Over the past few years, theres been a lot of hype in the media about data science and Big Data. A reasonable first reaction to all of this might be some combination of skepticism and confusion; indeed we, Cathy and Rachel, had that exact reaction.
And we let ourselves indulge in our bewilderment for a while, first separately, and then, once we met, together over many Wednesday morning breakfasts. But we couldnt get rid of a nagging feeling that there was something real there, perhaps something deep and profound representing a paradigm shift in our culture around data. Perhaps, we considered, its even a paradigm shift that plays to our strengths. Instead of ignoring it, we decided to explore it more.
But before we go into that, lets first delve into what struck us as confusing and vagueperhaps youve had similar inclinations. After that well explain what made us get past our own concerns, to the point where Rachel created a course on data science at Columbia University, Cathy blogged the course, and youre now reading a book based on it.
Big Data and Data Science Hype
Lets get this out of the way right off the bat, because many of you are likely skeptical of data science already for many of the reasons we were. We want to address this up front to let you know: were right there with you . If youre a skeptic too, it probably means you have something useful to contribute to making data science into a more legitimate field that has the power to have a positive impact on society.
So, what is eyebrow-raising about Big Data and data science? Lets count the ways:
- Theres a lack of definitions around the most basic terminology. What is Big Data anyway? What does data science mean? What is the relationship between Big Data and data science? Is data science the science of Big Data? Is data science only the stuff going on in companies like Google and Facebook and tech companies? Why do many people refer to Big Data as crossing disciplines (astronomy, finance, tech, etc.) and to data science as only taking place in tech? Just how big is big? Or is it just a relative term? These terms are so ambiguous, theyre well-nigh meaningless.
- Theres a distinct lack of respect for the researchers in academia and industry labs who have been working on this kind of stuff for years, and whose work is based on decades (in some cases, centuries) of work by statisticians, computer scientists, mathematicians, engineers, and scientists of all types. From the way the media describes it, machine learning algorithms were just invented last week and data was never big until Google came along. This is simply not the case. Many of the methods and techniques were usingand the challenges were facing noware part of the evolution of everything thats come before. This doesnt mean that theres not new and exciting stuff going on, but we think its important to show some basic respect for everything that came before.
- The hype is crazypeople throw around tired phrases straight out of the height of the pre-financial crisis era like Masters of the Universe to describe data scientists, and that doesnt bode well. In general, hype masks reality and increases the noise-to-signal ratio. The longer the hype goes on, the more many of us will get turned off by it, and the harder it will be to see whats good underneath it all, if anything.
- Statisticians already feel that they are studying and working on the Science of Data. Thats their bread and butter. Maybe you, dear reader, are not a statistican and dont care, but imagine that for the statistician, this feels a little bit like how identity theft might feel for you. Although we will make the case that data science is not just a rebranding of statistics or machine learning but rather a field unto itself, the media often describes data science in a way that makes it sound like as if its simply statistics or machine learning in the context of the tech industry.
- People have said to us, Anything that has to call itself a science isnt. Although there might be truth in there, that doesnt mean that the term data science itself represents nothing, but of course what it represents may not be science but more of a craft.
Getting Past the Hype
Rachels experience going from getting a PhD in statistics to working at Google is a great example to illustrate why we thought, in spite of the aforementioned reasons to be dubious, there might be some meat in the data science sandwich. In her words:
It was clear to me pretty quickly that the stuff I was working on at Google was different than anything I had learned at school when I got my PhD in statistics. This is not to say that my degree was useless; far from itwhat Id learned in school provided a framework and way of thinking that I relied on daily, and much of the actual content provided a solid theoretical and practical foundation necessary to do my work.
But there were also many skills I had to acquire on the job at Google that I hadnt learned in school. Of course, my experience is specific to me in the sense that I had a statistics background and picked up more computation, coding, and visualization skills, as well as domain expertise while at Google. Another person coming in as a computer scientist or a social scientist or a physicist would have different gaps and would fill them in accordingly. But what is important here is that, as individuals, we each had different strengths and gaps, yet we were able to solve problems by putting ourselves together into a data team well-suited to solve the data problems that came our way.
Heres a reasonable response you might have to this story. Its a general truism that, whenever you go from school to a real job, you realize theres a gap between what you learned in school and what you do on the job. In other words, you were simply facing the difference between academic statistics and industry statistics.
We have a couple replies to this:
- Sure, theres is a difference between industry and academia. But does it really have to be that way? Why do many courses in school have to be so intrinsically out of touch with reality?
- Even so, the gap doesnt represent simply a difference between industry statistics and academic statistics. The general experience of data scientists is that, at their job, they have access to a