Chapter 1. The Data Science Lifecycle
A Note for Early Release Readers
With Early Release ebooks, you get books in their earliest formthe authors raw and unedited content as they writeso you can take advantage of these technologies long before the official release of these titles.
This will be the 1st chapter of the final book. Please note that the GitHub repo will be made active later on.
If you have comments about how we might improve the content and/or examples in this book, or if you notice missing material within this chapter, please reach out to the author at mpotter@oreilly.com.
Data science is a rapidly evolving field.At the time of this writing people are still trying to pin down exactlywhat data science is, what data scientists do, and what skills datascientists should have.What we do know, though, is that data science uses a combination ofmethods and principles from statistics and computer science to work with and draw insights from data.And, learning computer science and statistics in combination makes us better data scientists. We also know that any insights we glean need to be interpreted in the context of the problem that we are working on.
This book covers fundamental principles and skills that data scientists need to help make all sorts of important decisions.With both technical skills and conceptual understanding we can work on data-centric problems to, say, assess whether a vaccine works,filter out fake news automatically, calibrate air quality sensors,and advise analysts on policy changes.
To help you keep track of the bigger picture, weve organized topicsaround a workflow that we call the data science lifecycle.In this chapter, we introduce this lifecycle.Unlike other data science books that tend to focus on one part of the lifecycle or address only computational or statistical topics,we cover the entire cycle from start to finish and consider both statistical and computational aspects together.
The Stages of the Lifecycle
shows the data science lifecycle.Its split into four stages: ask a question, obtain data,understand the data, and understand the world.Weve purposefully made these stages broad.In our experience, the mechanics of the lifecycle change frequently.Computer scientists and statisticians continue to build new software packages and programming languagesfor working with data, and they develop new methodologies that are more specialized.Despite these changes, weve found that almost every data project follows the four steps in this lifecycle.The first step is to ask a question.
Figure 1-1. This diagram of the data science lifecycle shows four high-level steps.The arrows indicate how the steps can lead into one another.
Ask a Question. Asking good questions lies at the heart of data science, and recognizingdifferent kinds of questions guides us in our analyses.We cover four categories of questions:descriptive, exploratory, inferential, and predictive.For example, How have house prices changed over time? is descriptive in nature, whereasWhich aspects of houses are related to sale price? is exploratory.Narrowing down a broad question into one that can be answered with data is a key element of this first stage in the lifecycle. It can involve consulting the people participating in a study, figuring out how to measure something, and designing data collection protocols.A clear and focused research question helps us determine the data we need,the patterns to look for, and how to interpret results. It can also help us refine our question, recognize the type of question being asked, and plan the data collection phase of the lifecycle.
Obtain Data. When data are expensive and hard to gather and when our aim is to generalize from the data to the world, we aim to define precise protocols for collecting the data. Other times, data are cheap and easily accessed.This is especially true for online data sources.For example, Twitter lets people quickly download millions of datapoints .When data are plentiful, we can start an analysis by obtaining data, exploring it, and then honing a research question.In both situations, most data have missing or unusual values and other anomalies that we need to account for. No matter the source, we need to check the data quality. And, typically, we must manipulate the data before we can analyze it more formally. We may need to modify structure, clean data values, and transform measurements to prepare for analysis.
Understand the Data. After obtaining and preparing data, we want to carefully examine them, and exploratory data analysis is often key. In our explorations we make plots to uncover interesting patterns and summarize the data visually. We also continue to look for problems with the data.As we search for patterns and trends, we use summary statistics and build statistical models, like linear and logistic regression.In our experience, this stage of the lifecycle is highly iterative.Understanding the data can also lead us back to earlier stages in the data science lifecycle. We may find that we need to modify or redo the data cleaning and manipulation, acquire more data to supplement our analysis, or refine our research question given the limitations of the data. The descriptive and exploratory analyses that we carry out in this stage may adequately answer our question, or, we may need to go on to the next stage in order to make generalizations beyond our data.