1.1 Introduction
This book examines how a response is related to covariates using mathematical models whose unknown parameters we wish to estimate using available informationthis endeavor is known as regression analysis . In this first chapter, we will begin in Sect. summarizes the overall message of this book which is that in many instances, carefully thought out Bayesian and frequentist analyses will provide similar conclusions; however, situations in which one or the other approach may be preferred are also described.
1.2 Model Formulation
In a regression analysis, the following steps may be followed:
Formulate a model based on the nature of the data, the subject matter context, and the aims of the data analysis.
Examine the mathematical properties of the initial model with respect to candidate inference procedures. This examination will focus on whether specific methods are suited to both the particular context under consideration and the specific questions of interest in the analysis.
Consider the computational aspects of the model.
The examination in steps 2 and 3 may suggest that we need to change the model. Historically, the range of model forms that were available for regression modeling was severely limited by computational and, to a lesser extent, mathematical considerations. For example, though generalized linear models contain a flexible range of alternatives to the linear model, a primary motivation for their formulation was ease of fitting and mathematical tractability. Hence, step 3 in particular took precedent over step 1.
Specific aspects of the initial model formulation will now be discussed in more detail. When carrying out a regression analysis, careful consideration of the following issues is vital and in many instances will outweigh in importance the particular model chosen or estimation method used. The interpretation of parameters also depends vitally on the following issues.
1.2.1 Observational Versus Experimental Data
An important first step in data analysis is to determine whether the data are experimental or observational in nature. In an experimental study, the experimenter has control over at least some aspects of the study. For example, units (e.g., patients) may be randomly assigned to covariate groups of interest (e.g., treatment groups). If this randomization is successfully implemented, any differences in response will (in expectation) be due to group assignment only, allowing a causal interpretation of the estimated parameters. The beauty of randomization is that the groups are balanced with respect to all covariates, crucially including those that are unobserved .
In an observational study, we never know whether observed differences between the responses of groups of interest are due, at least partially, to other confounding variables related to group membership. If the confounders are measured, then there is some hope for controlling for the variability in response that is not due to group membership, but if the confounders are unobserved variables, then such control is not possible. In the epidemiology and biostatistics literature, this type of discrepancy between the estimate and the true quantity of interest is often described as bias due to confounding. In later chapters, this issue will be examined in detail, since it is a primary motivation for regression modeling. In observational studies, estimated coefficients are traditionally described as associations , and causality is only alluded to more informally via consideration of the combined evidence of different studies and scientific plausibility. We expand upon this discussion in Sect..
Predictive models are more straightforward to build than causal models. To quote Freedman (), For description and prediction, the numerical values of the individual coefficients fade into the background; it is the whole linear combination on the right-hand side of the equation that matters. For causal inference, it is the individual coefficients that do the trick.
1.2.2 Study Population
Another important step is to determine the population from which the data were collected so that the individuals to whom inferential conclusions apply may be determined. Extrapolation of inference beyond the population providing the data is a risky enterprise.
Throughout this book, we will take a superpopulation view in which probability models are assumed to describe variability with respect to a hypothetical, infinite population. The study population that exists in practice consists of N units, of which n are sampled. To summarize:
Inference for the parameters of a superpopulation may be contrasted with a survey sampling perspective in which the focus is upon characteristics of the responses of the N units; in the latter case, a full census ( n = N ) will obviate the need for statistical analysis.
1.2.3 The Sampling Scheme
The data collection procedure has implications for the analysis, in terms of the models that are appropriate, the questions that may be asked, and the inferential approach that may be adopted. In the most straightforward case, the data arise through random sampling from a well-defined population. In other situations, the random samples may be drawn from within covariate-defined groups, which may improve efficiency of estimation by concentrating the sampling in informative groups but may limit the range of questions that can be answered by the data due to the restrictions on the sampling scheme. In more complex situations, the data may result from outcome-dependent sampling. For example, a case-control study is an outcome-dependent sampling scheme in which the binary response of interest is fixed by design, and the random variables are the covariates sampled within each of the outcome categories (cases and controls). For such data, care is required because the majority of conventional approaches will not produce valid inference, and analysis is carried out most easily using logistic regression models. Similar issues are encountered in the analysis of matched case-control studies, in which cases and controls are matched upon additional (confounder) variables. Bias in parameters of interest will occur if such data are analyzed using methods for unmatched studies, again because the sampling scheme has not been acknowledged. In the case of individually matched cases and controls (in which, for example, for each case a control is picked with the same gender, age, and race), conventional likelihood-based methods are flawed because the number of parameters (including one parameter for each case-control pair) increases with the sample size (providing an example of the importance of paying attention to the regularity conditions required for valid inference) conditional likelihood provides a valid inferential approach in this case. The analysis of data from case-control studies is described in Chap.7.