1. Introduction
The advances in the field of genetics over the past two generations have been astounding. The double helix structure of DNA, the genetic basis of life and reproduction in humans and many other species, was first described in print in 1953 [].
Such problems are by their nature interdisciplinary. Communication and cooperation between scientists are required even to start answering many of these questions. Insights from geneticists and bioinformaticians continue to be necessary to develop the technological software which is now available. None of these advances would have been possible without the incredible acceleration in computing speed and memory. Insights from geneticists and statisticians are necessary to build models. Bioinformaticians and statisticians are needed to analyze data, but require geneticists and biologists to explain the mechanisms underlying the patterns seen in the data. Many recent advances in statistical theory have been in response to the emergence of big data, i.e., huge data sets, in particular genetic data. However, as always, these advances are part of the continuing journey that underlies scientific progress and we are far from understanding many of the issues presented by such data sets.
This book aims to be a guidebook to part of this journey. Specifically, we look at developments in the studies of genetic association. The title of the book Phenotypes and Genotypes reflects this. Studies of genetic association aim to elucidate how our genetic code (genotypes) influence the traits we possess (phenotypes). This a relatively new and rapidly expanding field. Overall, the aims of the book are to present the theoretical background to studies of genetic association (both genetic and statistical), indicate how the field has advanced in recent years, give a snapshot of the most commonly used methods at present, together with their advantages and shortcomings, and finally indicate some of the problems that remain to be solved in the future. Since the authors are statisticians, stress will naturally be placed on the statistical models and methods involved. But by necessity, in order to understand the statistical models, one must first understand the biological concepts underlying the statistical models.
More specifically, Chap. gives an overview of the concepts from genetics required to be able to interpret and develop statistical models of genetic association. The ideas of phenotype and genotype are fundamental. A phenotype is any observable trait of an organism. In particular, here we will be interested in dichotomous traits (i.e., only two states are possible, for example, the presence or absence of a disease) and continuous traits (these are traits which are measured according to some scale, e.g., height, weight, milk yield).
We then give an overview of the genome. This is the genetic information which is found in each cell of an organism. The theory will concentrate on diploid organisms. Genetic information in such organisms is contained in pairs of chromosomes, humans have 23 such pairs. One chromosome of each pair is inherited from an individuals mother and the other comes from the father. In many organisms, including humans, one of these pairs is associated with the sex of an individual. The other pairs of chromosomes are called homologous, since the genetic information found at a pair of corresponding loci on such chromosomes combines to form an individuals genotype. In practice, we observe the genotype of an individual at a given locus, but we do not know which information came from the mother and which from the father.
Suppose for simplicity that two simple traits, say eye color and blood group, are each coded by a single gene. If these genes are located on different chromosome pairs, then the information passed on by a parent regarding one trait is independent of the information passed on regarding the second trait (in each case the information comes from the maternal chromosome with probability 0.5 and otherwise comes from the paternal chromosome). However, if the genes for these two traits are located close to each other on the same chromosome pair, then it is likely that the information passed on by a parent very likely comes from the same chromosome (either the maternal or the paternal). In this case, the genes for these traits are said to be linked, or equivalently that the two corresponding loci are linked. We consider the genetic distance between two loci, whose definition is based on the probability that the information passed on by a parent at two loci comes from the same chromosome. This is one minus the probability of a so-called crossover, which occurs when the information passed on at two loci on the same chromosome originally came from different chromosomes. The possibility of crossover results from the recombination of genetic material on homologous chromosomes before it is passed on to offspring. The closer two loci are on a chromosome, the less likely crossover is. Some probabilistic models linking genetic distance to the actual physical distance between loci are presented.
In general, the relation between traits (phenotypes) and genotypes is far more complex than the determination of eye color. For example, sex obviously has an influence on the height of humans, but the height of individuals of a particular sex follows a normal distribution. From the Central Limit Theorem, it would seem that height is affected by a large number of factors. Studies have shown that height depends on both environmental factors and various genetic loci [.
Obviously, in the case of many species, particularly humans, it is impossible to create such experimental populations. However, the emergence of genome wide sequencers has led to the possibility of carrying out so-called Genome Wide Association Studies (GWAS). The general concepts behind the design of such studies are outlined in Sect.. In such studies, the number of genetic variables considered is generally much greater than in QTL mapping, and so the statistical problem of multiple testing becomes much more serious. This problem arises from the fact that applying classical procedures of hypothesis testing, i.e., using a fixed significance level, very often leads to a large number of false discoveries.
It should be noted that the classical probability and statistical theory, which form the basis for Chaps.. Other readers should use the Appendix as a source of reference when necessary.
Chapter deals with methods of model selection.
Consider a simple situation in which we have m markers on one chromosome and we wish to test whether there is a QTL on the same chromosome. In order to do this, we might carry out a set of m tests where the null hypothesis of the i -th test states that the i -th marker is not associated with the quantitative trait in question and the alternative is that the i -th marker is associated with the quantitative trait. One might carry out all these tests at a significance level of 5 % and conclude that there exists a QTL on the same chromosome if and only if the null hypothesis is rejected at least once. One obvious problem with this approach is that as m increases, the probability of accepting that there is a QTL on the same chromosome also increases. In such a case, controlling the familywise error rate (FWER) rate is an appropriate criterion to ensure that the probability of any false detection (i.e., concluding there is a QTL on that chromosome, when there is none) remains low, regardless of the number of markers used. The classical approach to this problem would be to use the Bonferroni procedure, which involves dividing the nominal significance level (here 5 %) by the number of markers. This ensures that the FWER does not exceed 5 %. Refinements of the Bonferroni procedure are also considered.