1.1 Statistical Methods in the Context of Scientific Studies
This book discusses statistical methods from the application point of view. More specifically, we focus on biostatistical methods, which involve applying statistical methods to biological and health-related problems. Each section poses one or more practical problems and then presents the statistical tools related to solving these problems. The materials presented in this book cover basic and essential steps involved in analysis of biological and health-related data.
The overall objective of statistical methods is to use empirical evidence in order to improve our knowledge about the target population , which includes the entire group of individuals and objects (e.g., people, plants, cells) we want to study. As a result, statistics helps us to make more informed decisions . We study the population of interest by measuring a set of characteristics (e.g., age, size, weight) that are related to our study. We refer to these characteristics, whose values can change from one member of the population to another one, as variables . The objective of many scientific studies is to learn about the variation of a specific characteristic (variable) in the population of interest. For example, we might be interested in the range of normal body temperature among healthy people, or tumor size in breast cancer patients, or growth rate of walnut tress, or BMI (body mass index) in the US population. In many studies, we want to explain or predict how a variable changes with respect to some other variables. That is, we want to identify possible relationships among different variables. For example, we might want to study the effects of different diets on early growth of chicks, or ask how heart rate changes with body temperature, or whether a higher BMI is associated with higher blood pressure, or whether survival of breast cancer patients depends on the type of treatments (mastectomy vs. breast conservation therapy) they receive. We refer to the variables that are the main focus of our study as the response (or target) variables. In contrast, we call variables that explain or predict the variation in the response variable as explanatory variables or predictors .
Statistical analysis begins with a scientific problem usually presented in the form of a hypothesis testing or a prediction problem. Hypothesis testing refers to the process of examining a scientific statement that explains a phenomenon. In general, hypothesis testing problems can be regarded as decision problems, where we need to decide to accept or reject the proposed explanation for the phenomenon. For example, Mackowiak et al. (1992) [] asked whether the average normal body temperature is the widely accepted value of 98.6F. Their hypothesis was that the average normal body temperature is less than the accepted value. A hypothesis might also be expressed in terms of possible relationships between two or more characteristics. For example, we might hypothesize that the normal body temperature is different between men and women. This means we believe that the body temperature and gender are related. For breast cancer patients, we might hypothesize that mastectomy leads to longer survival of patients compared to those who are treated with breast conservation therapy (lumpectomy, nodal dissection, and radiation).
Statistical methods are used to evaluate a hypothesis based on empirical data. Using these methods, we can decide whether we should reject a hypothesis or not. Such decisions in turn help us to make more informed decisions with respect to the scientific problem that inspired our study. For example, at the conclusion of their study, Mackowiak et al. argue that the average normal body temperature seems to be lower than previously believed, and a new upper limit for the range of normal body temperature should be considered. This recommendation has important consequences for deciding the body temperature set point and whether someone has a fever that requires medication. For treating breast cancer patients, several studies [] have shown that there is no evidence of difference in survival between mastectomy and breast conservation therapy, at least for patients with less severe situations (e.g., small tumors, node negative). Based on these results, The US National Cancer Institute (NCI) recommended breast conservation operations, especially for the type of patients who participated in these studies (i.e., with less sever cancer), instead of mastectomy, which was the standard treatment in the 1960s.
In recent years, high-throughput scientific studies without any clear hypothesis have become very common. For example, scientists may examine thousands of genes with respect to their relationship to a disease without hypothesizing that any specific gene is responsible for the disease. In these studies, the objective is to explore a large number of possible factors (e.g., genes) in order to identify a small number of them for follow-up studies that tend to be more thorough with much smaller scales. Therefore, the initial large-scale studies are not designed for hypothesis testing rather generating a small number of hypotheses, which can be the focus of follow-up studies and tested properly in future.
Scientific problems are sometimes presented as prediction problems. Prediction refers to the process of guessing the value of the response variable using a set of predictors. For example, we might want to predict percent body fat using abdomen circumference, or predict the survival time for cancer patients using tumor size. A large body of the literature in biostatistics is devoted to developing statistical methods for predicting the risk of different diseases such as cancer, Alzheimers disease, diabetes, and Parkinsons disease. Kahn et al. (2009) [] showed that statistical methods can be used to identify patients with Parkinsons disease by detecting dysphonia (an impairment in the normal production of vocal sounds). Predicting unknown outcomes and future events using statistical methods can help us with making better decisions. For example, people with high risk of diabetes might decide to follow preventing measures (e.g., diet).
1.2 Sampling
To answer our scientific questions, we would, ideally, study the entire population of interest (e.g., all breast cancer patients). However, this is usually impossible either physically, ethically, or economically. For example, to test the hypothesis about the average normal body temperature, it is not feasible to record the temperature of all healthy people. Instead, a sample of representative members is selected from the population. Then with the methods of statistical inference , the conclusions based on the sample can cautiously be attributed to the whole population. Mackowiak et al. (1992) selected n =148 people, took their oral temperature, and then made conclusions about the body temperature of the whole population. To compare the effects of different treatments, one of the studies discussed in [] includes 74 women treated by breast conservation therapy and 67 women treated by mastectomy.