A. Data Science Methods
Artim: Data,... havent you ever just played for fun?
Data: Androids... dont have... fun.
Artim: Why not... ?
Data: No ones ever asked me that before.
MICHAEL WELCH AS ARTIM AND BRENT SPINER AS DATA IN Star Trek: Insurrection (1998)
Doing data science means implementing flexible, scalable, extensible systems for data preparation, analysis, visualization, and modeling. We are empowered by the growth of open source. Whatever the modeling technique or application, there is likely a relevant package, module, or library that someone has written or is thinking of writing. Doing data science for the and R, and drawing on other languages as needed.
Data scientists, those working in the field of predictive analytics, speak the language of businessaccounting, finance, marketing, and management. They know about information technology, including data structures, algorithms, and object-oriented programming. They understand statistical modeling, machine learning, and mathematical programming.
These are the things that data scientists do:
Finding out about. This is the first thing we doinformation search, finding what others have done before, learning from the literature. We draw on the work of academics and practitioners in many fields of study, contributors to predictive analytics and data science.
Preparing text and data. Text is unstructured or partially structured. Data are often messy or missing. We extract features from text. We define measures. We prepare text and data for analysis and modeling.
Looking at data. We do exploratory data analysis, data visualization for the purpose of discovery. We look for groups in data. We find outliers. We identify common dimensions, patterns, and trends.
Predicting how much. We are often asked to predict how many units or dollars of product will be sold, the price of financial securities or real estate. Regression techniques are useful for making these predictions. Prediction is distinct from explanation. We may not know why models work, but we need to know when they work and when to show others how they work. We identify the most critical components of models and focus on the things that make a difference.
).
Predicting yes or no. Many business problems are classification problems. We use classification methods to predict whether or not a person will buy a product, default on a loan, or access a web page.
Testing it out. We examine models with diagnostic graphics. We see how well a model developed on one data set works on other data sets. We employ a training-and-test regimen with data partitioning, cross-validation, or bootstrap methods.
Playing what-if. We manipulate key variables to see what happens to our predictions. We play what-if games in simulated marketplaces. We employ sensitivity or stress testing of mathematical programming models. We see how values of input variables affect outcomes, payoffs, and predictions. We assess uncertainty about forecasts.
Explaining it all. Data and models help us understand the world. We turn what we have learned into an explanation that others can understand. We present project results in a clear and concise manner.
Data scientists are methodological eclectics, drawing from many scientific disciplines and translating the results of empirical research into words and pictures that management can understand. These presentations benefit from well-constructed data visualizations. In communicating with management, data scientists need to go beyond formulas, numbers, definitions of terms, and the magic of algorithms. Data scientists convert the results of predictive models into simple, straightforward language that others can understand.
The data scientists are knowledge workers par excellence. They are communicators playing a critical role in todays data-intensive world. Data scientists turn data into models and models into plans for action.
The role of data science in business has been discussed by many ().
This appendix identifies classes of methods and reviews selected methods in databases and data preparation, statistics, machine learning, data visualization, and text analytics. We provide an overview of these methods and cite relevant sources for further reading.
A.1 Databases and Data Preparation
As noted earlier, there have always been more data than we can use. What is new today is the ease of collecting data and the low cost of storing data. Data come from many sources. There are unstructured text data from systems. There are pixels from sensors and cameras. There are data from mobile phones, tablets, and computers worldwide, located in space and time. Flexible, scalable, distributed systems are needed to accommodate these data.
Relational databases have a row-and-column table structure, similar to a spreadsheet. We access and manipulate these data using structured query language (SQL). Because they are transaction-oriented with enforced data integrity, relational databases provide the foundation for sales order processing and financial accounting systems.
It is easy to understand why non-relational (NoSQL) databases have received so much attention. Non-relational databases focus on availability and scalability. They may employ key-value, column-oriented, document-oriented, or graph structures. Some are designed for online or real-time applications, where fast response times are key. Others are well suited for massive storage and off-line analysis, with map-reduce providing a key data aggregation tool.
Many firms are moving away from internally owned, centralized computing systems and toward distributed cloud-based services. Distributed hardware and software systems, including database systems, can be expanded more easily as the data management needs of organizations grow.
Doing data science means being able to gather data from the full range of database systems, relational and non-relational, commercial and open source. We employ database query and analysis tools, gathering information across distributed systems, collating information, creating contingency tables, and computing indices of relationship across variables of interest. We use information technology and database systems as far as they can take us, and then we do more, applying what we know about statistical inference and the modeling techniques of predictive analytics.
Regarding analytics, we acknowledge an unwritten code in data science. We do not select only the data we prefer. We do not change data to conform to what we would like to see or expect to see. A two of clubs that destroys the meld is part of the natural variability in the game and must be played with the other cards. We play the hand that is dealt. The hallmarks of science are an appreciation of variability, an understanding of sources of error, and a respect for data. Data science is science.
We are often asked to make a model out of a mess. Management needs answers, and the data are replete with miscoded and missing observations, outliers and values of dubious origin. We use our best judgement in preparing data for analysis, recognizing that many decisions we make are subjective and difficult to justify.