ISBN 9783110694406
e-ISBN (PDF) 9783110693348
e-ISBN (EPUB) 9783110693379
Bibliographic information published by the Deutsche Nationalbibliothek
The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at http://dnb.dnb.de.
2021 Walter de Gruyter GmbH, Berlin/Boston
Basic statistical concepts
The ideas and concepts behind bootstrapping are highly accessible to beginners and require astonishingly little knowledge about mathematics and statistics. The main goal of this book is to make it as understandable as possible to a wide audience from different disciplines of the social sciences. While this might disappoint some readers who expected a more rigorous presentation, it should be spelled out clearly that this book is dedicated to people who are not experts in statistics. Therefore, although some math and formulas are unavoidable, they will be introduced in a reader-friendly manner. Realistically, the only requirement to understand the following topics are statistics I and II. For a more thorough, yet accessible, introduction to statistics, I refer you to the work of
1.1 Arithmetic mean
The arithmetic mean (also known as average), shortened to mean for the rest of the book, might be the most central concept of statistics. The mean is the sum of a collection of numbers divided by the count of numbers in the collection. This statistic is often denoted by x (read: x bar), or (mu) when we talk about the unknown mean of a population.
(1.1) x=x1+x2++xnn=1ni=1nxi
For example, the mean of the eyes printed on a regular six-sided die equals 3.5.
1.2 Standard deviation
The standard deviation (sigma) is the most important statistic to characterize how numbers in a collection are distributed around the mean of that collection. Note that the standard deviation is the square root of the variance and has the same metric as the original variable.
(1.2) =i=1n(xix)2n1
As an example, consider the following two vectors which have the same mean (5), but different standard deviations.
(1.3) a=[2,4,5,6,8]
(1.4) b=[1,3,5,7,9]
The standard deviation of a amounts to 2.24, while the standard deviation of b is larger and equals 3.16. This makes sense intuitively, as the values of b are spread farther around the mean than the values of a . Therefore, the standard deviation is a statistic that gives us a rough impression of how closely the numbers in a set are distributed around the mean. The smaller the standard deviation, the closer the values are to the mean on average.
1.3 Standard error
While the first two statistics mentioned here make sense to most people, things get a little trickier when one introduces the standard error. The reason for this is probably that many beginners tend to confuse standard deviations and standard errors. Admittedly, both are related, however, they are used to describe different statistical concepts. Keep in mind that the standard deviation is part of descriptive statistics and tells us how single data points within a sample are distributed around the mean. In contrast to that, the standard error is part of inferential statistics. To make matters worse, the standard error is used to describe another statistic and not the sample itself, which means that it has to be computed separately for each statistic of interest. For example, the standard error of the mean (SEM) is commonly used to describe the sampling distribution of the arithmetic mean (dont worry, we will discuss the concept of sampling distributions in greater detail later). However, even if used only rarely, one could easily think of the standard error of the standard deviation or the standard error of the kurtosis as a way to describe these statistics and their respective sampling distributions. Now, to the definition of the SEM:
(1.5) SEx=n
Here, is the standard deviation of the sample and n is the total number of cases. To understand the SEM, keep in mind that we will work with samples of a larger population most of the time. While we want to make inferences about the entire population, we only have a sample available to us. Therefore, our best point estimate for the mean of the population is the mean of the sample, and the SEM can help us assess how close the mean of the sample is to the mean of the entire population. If you still dont find this explanation satisfactory, the example below (section ) should make things clearer.
But first, lets take another moment to appreciate how the standard error relates to our sample, as it is influenced by two things: the standard deviation and the sample size. Firstly, the larger the standard deviation of the sample, the larger the SEM. Secondly, the larger the sample size, the smaller the SEM. This makes sense intuitively: the more data we collect, the more information we have to make an inference and, consequently, the more certain we can be about the estimate of the population mean. Consider the most extreme case, in which the sample includes the entire population: the standard error becomes zero, as one can perfectly estimate the true value of the mean and there is no uncertainty left.
1.4 Confidence intervals
Confidence intervals (CIs) are widely used in science to illustrate the uncertainty of point estimates. CIs can be quite easily calculated using the statistics that we have already discussed. However, there is often some confusion about their real meaning. Over time, most researchers develop an intuitive feeling for them, as they are omnipresent in research papers. However, when these intuitions are put to a test, even experienced scientists are often unable to provide a textbook definition of CIs. Wrong interpretations of the concept can even be found in research papers, as others have pointed out (). Personally, I think that the source of all this confusion is that the textbook definition of CIs is quite rigorous and requires several lines of text to spell out their complete meaning. Consequently, many people abandon the strict definition of CIs and expect more of them than they are actually capable of delivering, which sometimes results in incorrect interpretations.
Before calculating CIs, one must set a desired level of confidence. Most common are 95%-CIs and 99%-CIs. The higher the level of confidence, the broader the CI. Therefore, a 99%-CI of a certain point estimate is always broader than a 95%-CI of the same point estimate. To calculate the CI for your statistic of interest (theta), the following formula is used:
(1.6) CI95=1.96SE,
where SE is the standard error of the statistic of interest. Naturally, people ask where the value of 1.96 comes from. The complete answer is long and refers to standard normal distributions (). For example, the value for a 90%-CI is 1.65, and for a 99%-CI it is 2.58. While the topic is a bit dry, it is worth to recap the assumptions made behind this technique, since it also relates to numerous aspects of bootstrapping later on. We discuss this in more detail at the end of this section, see page 14.
Table 1.1 Critical values for two-sided confidence intervals.
Alpha (%) | Confidence Level | Critical value |
10 | 0.90 | 1.644854 |
5 | 0.95 | 1.959964 |
1 | 0.99 | 2.575829 |
0.1 | 0.999 | 3.290527 |
0.01 | 0.9999 |