1.1 Introduction
When we first meet Statistics, we encounter random quantities (random variables, in probability language, or variates, in statistical language) one at a time. This suffices for a first course.Soon however we need to handle more than one random quantity at a time. Already we have to think about how they are related to each other.
Let us take the simplest case first, of two variables. Consider first the two extreme cases.
At one extreme, the two variables may be independent (unrelated). For instance, one might result from laboratory data taken last week, the other might come from old trade statistics. The two are unrelated. Each is uninformative about the other. They are best looked at separately. What we have here are really two one-dimensional problems, rather than one two-dimensional problem, and it is best to consider matters in these terms.
At the other extreme, the two variables may be essentially the same, in that each is completely informative about the other. For example, in the Centigrade (Celsius) temperature scale, the freezing point of water is 0 and the boiling point is 100, while in the Fahrenheit scale, freezing point is 32 and boiling point is 212 (these bizarre choices are a result of Fahrenheit choosing as his origin of temperature the lowest temperature he could achieve in the laboratory, and recognising that the body is so sensitive to temperature that a hundredth of the freezing-boiling range as a unit is inconveniently large for everyday, non-scientific use, unless one resorts to decimals). The transformation formulae are accordingly
While both scales remain in use, this is purely for convenience. To look at temperature in both Centigrade and Fahrenheit together for scientific purposes would be silly. Each is completely informative about the other. A plot of one against the other would lie exactly on a straight line. While apparently a twodimensional problem, this would really be only one one-dimensional problem, and so best considered as such.
We are left with the typical and important case: twodimensional data, ( x 1, y 1), , ( x n , y n ) say, where each of the x and y variables is partially but not completely informative about the other .
Usually, our interest is on one variable, y say, and we are interested in what knowledge of the other x tells us about y . We then call y the response variable , and x the explanatory variable . We know more about y knowing x than not knowing x ; thus knowledge of x explains, or accounts for, part but not all of the variability we see in y . Another name for x is the predictor variable: we may wish to use x to predict y (the prediction will be an uncertain one, to be sure, but better than nothing: there is information content in x about y , and we want to use this information). A third name for x is the regressor , or regressor variable; we will turn to the reason for this name below. It accounts for why the whole subject is called regression .
The first thing to do with any data set is to look at it. We subject it to exploratory data analysis (EDA); in particular, we plot the graph of the n data points ( x i , y i ). We can do this by hand, or by using a statistical package: Minitab, for instance, using the command Regression , or S-Plus/R by using the command lm (for linear model see below).
Suppose that what we observe is a scatter plot that seems roughly linear. That is, there seems to be a systematic component, which is linear (or roughly so linear to a first approximation, say) and an error component, which we think of as perturbing this in a random or unpredictable way. Our job is to fit a line through the data that is, to estimate the systematic linear component.
For illustration, we recall the first case in which most of us meet such a task experimental verification of Ohms Law (G. S. Ohm (1787-1854), in 1826). When electric current is passed through a conducting wire, the current (in amps) is proportional to the applied potential difference or voltage (in volts), the constant of proportionality being the inverse of the resistance of the wire (in ohms). One measures the current observed for a variety of voltages (the more the better). One then attempts to fit a line through the data, observing with dismay that, because of experimental error, no three of the data points are exactly collinear. A typical schoolboy solution is to use a perspex ruler and fit by eye. Clearly a more systematic procedure is needed. We note in passing that, as no current flows when no voltage is applied, one may restrict to lines through the origin (that is, lines with zero intercept) by no means the typical case.
1.2 The Method of Least Squares
The required general method the Method of Least Squares arose in a rather different context. We know from Newtons Principia (Sir Isaac Newton (16421727), in 1687) that planets, the Earth included, go round the sun in elliptical orbits, with the Sun at one focus of the ellipse. By cartesian geometry, we may represent the ellipse by an algebraic equation of the second degree. This equation, though quadratic in the variables, is linear in the coefficients. How many coefficients p we need depends on the choice of coordinate system in the range from two to six. We may make as many astronomical observations of the planet whose orbit is to be determined as we wish the more the better, n say, where n is large much larger than p . This makes the system of equations for the coefficients grossly over-determined, except that all the observations are polluted by experimental error. We need to tap the information content of the large number n of readings to make the best estimate we can of the small number p of parameters.
Write the equation of the ellipse as
Here the a j are the coefficients , to be found or estimated, and the x j are those of x 2, xy , y 2, x , y , 1 that we need in the equation of the ellipse (we will always need 1, unless the ellipse degenerates to a point, which is not the case here). For the i th point, the left-hand side above will be 0 if the fit is exact, but i say (denoting the i th error) in view of the observational errors. We wish to keep the errors i small; we wish also to put positive and negative i on the same footing, which we may do by looking at the squared errors i 2. A measure of the discrepancy of the fit is the sum of these squared errors, i = 1 n i 2. The Method of Least Squares is to choose the coefficients a j so as to minimise this sums of squares,