The single most important question for a working scientistperhaps the single most useful question anyone can askis: whats going on here? Answering this question requires creative use of different ways to make pictures of datasets, to summarize them, and to expose whatever structure might be there. This is an activity that is sometimes known as Descriptive Statistics. There isnt any fixed recipe for understanding a dataset, but there is a rich variety of tools we can use to get insights.
1.1 Datasets
A dataset is a collection of descriptions of different instances of the same phenomenon. These descriptions could take a variety of forms, but it is important that they are descriptions of the same thing. For example, my grandfather collected the daily rainfall in his garden for many years; we could collect the height of each person in a room; or the number of children in each family on a block; or whether 10 classmates would prefer to be rich or famous. There could be more than one description recorded for each item. For example, when he recorded the contents of the rain gauge each morning, my grandfather could have recorded (say) the temperature and barometric pressure. As another example, one might record the height, weight, blood pressure and body temperature of every patient visiting a doctors office.
The descriptions in a dataset can take a variety of forms. A description could be categorical , meaning that each data item can take a small set of prescribed values. For example, we might record whether each of 100 passers-by preferred to be Rich or Famous. As another example, we could record whether the passers-by are Male or Female. Categorical data could be ordinal , meaning that we can tell whether one data item is larger than another. For example, a dataset giving the number of children in a family for some set of families is categorical, because it uses only non-negative integers, but it is also ordinal, because we can tell whether one family is larger than another.
Some ordinal categorical data appears not to be numerical, but can be assigned a number in a reasonably sensible fashion. For example, many readers will recall being asked by a doctor to rate their pain on a scale of 110a question that is usually relatively easy to answer, but is quite strange when you think about it carefully. As another example, we could ask a set of users to rate the usability of an interface in a range from very bad to very good, and then record that using 2 for very bad, 1 for bad, 0 for neutral, 1 for good, and 2 for very good.
Many interesting datasets involve continuous variables (like, for example, height or weight or body temperature) when you could reasonably expect to encounter any value in a particular range. For example, we might have the heights of all people in a particular room, or the rainfall at a particular place for each day of the year.
You should think of a dataset as a collection of d -tuples (a d -tuple is an ordered list of d elements). Tuples differ from vectors, because we can always add and subtract vectors, but we cannot necessarily add or subtract tuples. We will always write N for the number of tuples in the dataset, and d for the number of elements in each tuple. The number of elements will be the same for every tuple, though sometimes we may not know the value of some elements in some tuples (which means we must figure out how to predict their values, which we will do much later).
Each element of a tuple has its own type. Some elements might be categorical. For example, one dataset we shall see several times has entries for Gender; Grade; Age; Race; Urban/Rural; School; Goals; Grades; Sports; Looks; and Money for 478 children, so d = 11 and N = 478. In this dataset, each entry is categorical data. Clearly, these tuples are not vectors because one cannot add or subtract (say) Gender, or add Age to Grades.
Most of our data will be vectors. We use the same notation for a tuple and for a vector. We write a vector in bold, so x could represent a vector or a tuple (the context will make it obvious which is intended).
The entire data set is { x }. When we need to refer to the i th data item, we write x i . Assume we have N data items, and we wish to make a new dataset out of them; we write the dataset made out of these items as { x i } (the i is to suggest you are taking a set of items and making a dataset out of them).
In this chapter, we will work mainly with continuous data. We will see a variety of methods for plotting and summarizing 1-tuples. We can build these plots from a dataset of d -tuples by extracting the r th element of each d -tuple. All through the book, we will see many datasets downloaded from various web sources, because people are so generous about publishing interesting datasets on the web. In the next chapter, we will look at two-dimensional data, and we look at high dimensional data in Chap.