Eric Vittinghoff , David V. Glidden , Stephen C. Shiboski and Charles E. McCulloch Statistics for Biology and Health Regression Methods in Biostatistics 2nd ed. 2012 Linear, Logistic, Survival, and Repeated Measures Models 10.1007/978-1-4614-1353-0_1 Springer Science+Business Media, LLC 2012
1. Introduction
Abstract
The book describes a family of statistical techniques that we call multipredictor regression modeling. This family is useful in situations where there are multiple measured factors (also called predictors, covariates, or independent variables) to be related to a single outcome (also called the response or dependent variable). The applications of these techniques are diverse, including those where we are interested in prediction, isolating the effect of a single predictor, or understanding multiple predictors. We begin with an example.
The book describes a family of statistical techniques that we call multipredictor regression modeling. This family is useful in situations where there are multiple measured factors (also called predictors, covariates, or independent variables) to be related to a single outcome (also called the response or dependent variable). The applications of these techniques are diverse, including those where we are interested in prediction, isolating the effect of a single predictor, or understanding multiple predictors. We begin with an example.
1.1 Example: Treatment of Back Pain
Korff et al. () studied the success of various approaches to treatment for back pain. Some physicians treat back pain more aggressively, with prescription pain medication and extended bed rest, while others recommend an earlier resumption of activity and manage pain with over-the-counter medications. The investigators classified the aggressiveness of a sample of 44 physicians in treating back pain as low, medium, or high, and then followed 1,071 of their back pain patients for two years. In the analysis, the classification of treatment aggressiveness was related to patient outcomes, including cost, activity limitation, pain intensity, and time to resumption of full activity.
The primary focus of the study was on a single categorical predictor, the aggressiveness of treatment. Thus for a continuous outcome like cost, we might think of an analysis of variance (ANOVA), while for a categorical outcome we might consider a contingency table analysis and a 2-test. However, these simple analyses would be incorrect at the very least because they would fail to recognize that multiple patients were clustered within physician practice and that there were repeated outcome measures on patients.
Looking beyond the clustering and repeated measures (which are covered in ), what if physicians with more aggressive approaches to back pain also tended to have older patients? If older patients recover more slowly (regardless of treatment), then even if differences in treatment aggressiveness have no effect, the age imbalance would nonetheless make for poorer outcomes in the patients of physicians in the high-aggressiveness category. Hence, it would be misleading to judge the effect of treatment aggressiveness without correcting for the imbalances between the physician groups in patient age and, potentially, other prognostic factorsthat is, to judge without controlling for confounding . This can be accomplished using a model which relates study outcomes to age and other prognostic factors as well as the aggressiveness of treatment. In a sense, multipredictor regression analysis allows us to examine the effect of treatment aggressiveness while holding the other factors constant .
1.2 The Family of Multipredictor Regression Methods
Multipredictor regression modeling is a family of methods for relating multiple predictors to an outcome, with each member of the family suitable for a different type of outcome. The cost outcome, for example, is a numerical measure and for our purposes can be taken as continuous . This outcome could be analyzed using the linear regression model, though we also show in why a generalized linear model (GLM) might be a better choice.
Perhaps the simplest outcome in the back pain study is the yes/no indicator of moderate-to-severe activity limitation; a subjects activities are limited by back pain or not. Such a categorical variable is termed binary because it can only take on two values. This type of outcome is analyzed using the logistic regression model, presented in .
In contrast, pain intensity was measured on a scale of ten equally spaced values. The variable is numerical and could be treated as continuous, although there were many tied values. Alternatively, it could be analyzed as a categorical variable, with the different values treated as ordered categories, using the proportional-odds or continuation-ratio models, both extensions of the logistic model and briefly covered in .
Another potential outcome might be time to resumption of full activity. This variable is also continuous, but what if a patient had not yet resumed full activity at the end of the follow-up period of two years? Then the time to resumption of full activity would only be known to exceed two years. When outcomes are known only to be greater than a given value (like two years), the variable is said to be right-censored a common feature of time-to-event data. This type of outcome can be analyzed using the Cox proportional hazards model, the primary topic of .
Furthermore, in the back pain example, study outcomes were measured on groups, or clusters, of patients with the same physician, and on multiple occasions for each patient. To analyze such hierarchical or longitudinal outcomes, we need to use extensions of the basic family of regression modeling techniques suitable for repeated measures data, described in .
The various regression modeling approaches, while differing in important statistical details, also share important similarities. Numeric, binary, and categorical predictors are accommodated by all members of the family, and are handled in a similar way: on some scale, the systematic part of the outcome is modeled as a linear function of the predictor values and corresponding regression coefficients . The different techniques all yield estimates of these coefficients that summarize the results of the analysis and have important statistical properties in common. This leads to unified methods for selecting predictors and modeling their effects, as well as for making inferences to the population represented in the sample. Finally, all the models can be applied to the same broad classes of practical questions involving multiple predictors.
1.3 Motivation for Multipredictor Regression
Multipredictor regression can be a powerful tool for addressing three important practical questions. These questions, which provide the framework for our discussion of predictor selection in , include prediction, isolating the effect of a single predictor , and understanding multiple predictors .
1.3.1 Prediction
How can we identify which patients with back pain will have moderate-to-severe limitation of activity? Multipredictor regression is a powerful and general tool for using multiple measured predictors to make useful predictions for future observations. In this example, the outcome is binary and thus a multipredictor logistic regression model could be used to estimate the predicted probability of limitation for any possible combination of the observed predictors. These estimates could then be used to classify patients as likely to experience limitation or not. Similarly, if our interest was future costs, a continuous variable, we could use a linear regression model to predict the costs associated with new observations characterized by various values of the predictors. In developing models for this purpose, we need to avoid over-fitting , and to validate their predictiveness in actual practice.