1. Introduction
Every day people are faced with questions such as What route should I take to work today? Should I switch to a different cell phone carrier? How should I invest my money? or Will I get cancer? These questions indicate our desire to know future events, and we earnestly want to make the best decisions towards that future.
We usually make decisions based on information. In some cases we have tangible, objective data, such as the morning traffic or weather report. Other times we use intuition and experience like I should avoid the bridge this morning because it usually gets bogged down when it snows or I should have a PSA test because my father got prostate cancer. In either case, we are predicting future events given the information and experience we currently have, and we are making decisions based on those predictions.
As information has become more readily available via the internet and media, our desire to use this information to help us make decisions has intensified. And while the human brain can consciously and subconsciously assemble a vast amount of data, it cannot process the even greater amount of easily obtainable, relevant information for the problem at hand. To aid in our decision-making processes, we now turn to tools like Google to filter billions of web pages to find the most appropriate information for our queries, WebMD to diagnose our illnesses based on our symptoms, and E*TRADE to screen thousands of stocks and identify the best investments for our portfolios.
These sites, as well as many others, use tools that take our current information, sift through data looking for patterns that are relevant to our problem, and return answers. The process of developing these kinds of tools has evolved throughout a number of fields such as chemistry, computer science, physics, and statistics and has been called machine learning, artificial intelligence, pattern recognition, data mining, predictive analytics, and knowledge discovery. While each field approaches the problem using different perspectives and tool sets, the ultimate objective is the same: to make an accurate prediction . For this book, we will pool these terms into the commonly used phrase predictive modeling .
Geisser () defines predictive modeling as the process by which a model is created or chosen to try to best predict the probability of an outcome. We tweak this definition slightly:
Predictive modeling : the process of developing a mathematical tool or model that generates an accurate prediction
Steve Levy of Wired magazine recently wrote of the increasing presence of predictive models (Levy, ), Examples [of artificial intelligence] can be found everywhere: The Google global machine uses AI to interpret cryptic human queries. Credit card companies use it to track fraud. Netflix uses it to recommend movies to subscribers. And the financial system uses it to handle billions of trades (with only the occasional meltdown). Examples of the types of questions one would like to predict are:
How many copies will this book sell?
Will this customer move their business to a different company?
How much will my house sell for in the current market?
Does a patient have a specific disease?
Based on past choices, which movies will interest this viewer?
Should I sell this stock?
Which people should we match in our online dating service?
Is an e-mail spam?
Will this patient respond to this therapy?
Insurance companies, as another example, must predict the risks of potential auto, health, and life policy holders. This information is then used to determine if an individual will receive a policy, and if so, at what premium. Like insurance companies, governments also seek to predict risks, but for the purpose of protecting their citizens. Recent examples of governmental predictive models include biometric models for identifying terror suspects, models of fraud detection (Westphal, )] brings us into the predictive modeling world, and were often not even aware that weve entered it. Predictive models now permeate our existence .
While predictive models guide us towards more satisfying products, better medical treatments, and more profitable investments, they regularly generate inaccurate predictions and provide the wrong answers. For example, most of us have not received an important e-mail due to a predictive model (a.k.a. e-mail filter) that incorrectly identified the message as spam. Similarly, predictive models (a.k.a. medical diagnostic models) misdiagnose diseases, and predictive models (a.k.a. financial algorithms) erroneously buy and sell stocks predicting profits when, in reality, finding losses. This final example of predictive models gone wrong affected many investors in 2010. Those who follow the stock market are likely familiar with the flash crash on May 6, 2010, in which the market rapidly lost more than 600 points, then immediately regained those points. After months of investigation, the Commodity Futures Trading Commission and the Securities and Exchange Commission identified an erroneous algorithmic model as the cause of the crash (U.S. Commodity Futures Trading Commission and U.S. Securities & Exchange Commission, ).
Stemming in part from the flash crash and other failures of predictive models, Rodriguez () writes, Predictive modeling, the process by which a model is created or chosen to try to best predict the probability of an outcome has lost credibility as a forecasting tool. He hypothesizes that predictive models regularly fail because they do not account for complex variables such as human behavior. Indeed, our abilities to predict or make decisions are constrained by our present and past knowledge and are affected by factors that we have not considered. These realities are limits of any model, yet these realities should not prevent us from seeking to improve our process and build better models.
There are a number of common reasons why predictive models fail, and we address each of these in subsequent chapters. The common culprits include (1) inadequate pre-processing of the data, (2) inadequate model validation, (3) unjustified extrapolation (e.g., application of the model to data that reside in a space which the model has never seen), or, most importantly, (4) over-fitting the model to the existing data. Furthermore, predictive modelers often only explore relatively few models when searching for predictive relationships. This is usually due to either modelers preference for, knowledge of, or expertise in, only a few models or the lack of available software that would enable them to explore a wide range of techniques.
This book endeavors to help predictive modelers produce reliable, trustworthy models by providing a step-by-step guide to the model building process and to provide intuitive knowledge of a wide range of common models. The objectives of this book are to provide:
Foundational principles for building predictive models
Intuitive explanations of many commonly used predictive modeling methods for both classification and regression problems
Principles and steps for validating a predictive model
Computer code to perform the necessary foundational work to build and validate predictive models