Contents
Preface............................................... 1
About this book......................................... 1
About the cover......................................... 1
Exercises............................................. 12
Exercises............................................. 16
Exercises............................................. 21
Regression to the mean...................................... 23
Example............................................. 29
Exercises............................................. 32
Residuals.............................................. 34
Exercises............................................. 45
Exercises............................................. 52
Exercises............................................. 58
Exercises............................................. 72
Exercises............................................. 80
Exercises............................................. 91
Exercises............................................. 100
Exercises............................................. 115
Poisson distribution....................................... 116
Introduction
Before beginning
This book is designed as a companion to the Regression Models Coursera class as part of the Data Science Specialization , a ten course program offered by three faculty, Jeff Leek, Roger Peng and BrianCaffo,attheJohnsHopkinsUniversityDepartmentofBiostatistics.
The videos associated with this book can be watched in full here , though the relevant links to specific videos are placed at the appropriate locations throughout.
Before beginning, we assume that you have a working knowledge of the R programming language. If not, there is a wonderful Coursera class by Roger Peng, that can be found here . In addition, students should know the basics of frequentist statistical inference. There is a Coursera class here and a LeanPub book here .
The entirety of the book is on GitHub here . Please submit pull requests if you find errata! In addition the course notes can be found also on GitHub here . While most code is in the book, all of the code for every figure and analysis in the book is in the R markdown files files (.Rmd) for the respective lectures.
Finally, we should mention swirl (statistics with interactive R programming). swirl is an intelligent tutoring system developed by Nick Carchedi, with contributions by Sean Kross and Bill and Gina Croft. It offers a way to learn R in R. Download swirl here . Theres a swirl module for this course! . Try it out, its probably the most effective way to learn.
Regression models
Watch this video before beginning
https://www.coursera.org/course/regmods
https://www.coursera.org/specialization/jhudatascience/1?utm_medium=courseDescripTop https://www.youtube.com/playlist?list=PLpl-gQkQivXjqHAJd2t-J_One_fYE55tC
https://www.coursera.org/course/rprog
https://www.coursera.org/course/statinference
https://leanpub.com/LittleInferenceBook
https://github.com/bcaffo/regmodsbook
https://github.com/bcaffo/courses/tree/master/07_RegressionModels
http://swirlstats.com
https://github.com/swirldev/swirl_courses#swirl-courses
https://www.youtube.com/watch?v=58ZPhK32sU8&index=1&list=PLpl-gQkQivXjqHAJd2t-J_One_fYE55tC
Regression models are the workhorse of data science. They are the most well described, practical and theoretically understood models in statistics. A data scientist well versed in regression models will be able to solve and incredible array of problems.
Perhaps the key insight for regression models is that they produce highly interpretable model fits. This is unlike machine learning algorithms, which often sacrifice interpretability for improved prediction performance or automation. These are, of course, valuable attributes in their own rights. However, the benefit of simplicity, parsimony and intrepretability offered by regression models (and their close generalizations) should make them a first tool of choice for any practical problem.
Motivating examples
Francis Galtons height data
Francis Galton, the 19th century polymath, can be credited with discovering regression. In his landmark paper Regression Toward Mediocrity in Hereditary Stature he compared the heights of parents and their children. He was particularly interested in the idea that the children of tall parents tended to be tall also, but a little shorter than their parents. Children of short parents tended to be short, but not quite as short as their parents. He referred to this as regression to mediocrity (or regression to the mean). In quantifying regression to the mean, he invented what we would call regression.
It is perhaps surprising that Galtons specific work on height is still relevant today. In fact this European Journal of Human Genetics manuscript compares Galtons prediction models versus those using modern high throughput genomic technology (spoiler alert, Galton wins).
Some questions from Galtons data come to mind. How would one fit a model that relates parent and child heights? How would one predict a childs height based on their parents? How would we quantify regression to the mean? In this class, well answer all of these questions plus many more.
Simply Statistics versus Kobe Bryant
Simply Statistics is a blog by Jeff Leek, Roger Peng and Rafael Irizarry. It is one of the most widely read statistics blogs, written by three of the top statisticians in academics. Rafa wrote a (somewhat tongue in cheek) post regarding ball hogging among NBA basketball players. (By the way, your author has played basketball with Rafael, who is quite good by the way, but certainly doesnt pass up shots; glass houses and whatnot.)
Heres some key sentences:
http://galton.org/essays/1880-1889/galton-1886-jaigi-regression-stature.pdf
http://www.nature.com/ejhg/journal/v17/n8/full/ejhg20095a.html
http://simplystatistics.org/
http://simplystatistics.org/2013/01/28/data-supports-claim-that-if-kobe-stops-ball-hogging-the-lakers-will-win-more/
Data supports the claim that if Kobe stops ball hogging the Lakers will win more
Linear regression suggests that an increase of 1% in % of shots taken by Kobe results in a drop of 1.16 points (+/- 0.22) in score differential.
In this book we will cover how to create summary statements like this using regression model building. Note the nice interpretability of the linear regression model. With this model Rafa numerically relates the impact of more shots taken on score differential.
Summary notes: questions for this book
Regression models are incredibly handy statistical tools. One can use them to answer all sorts of questions. Consider three of the most common tasks for regression models:
1. Prediction Eg: to use the parents heights to predict childrens heights.
2. Modeling Eg: to try to find a parsimonious, easily described mean relationship between parental and child heights.
3. Covariation Eg: to investigate the variation in child heights that appears unrelated to parental heights (residual variation) and to quantify what impact genotype information has beyond parental height in explaining child height.
An important aspect, especially in questions 2 and 3 is assessing modeling assumptions. For example, it is important to figure out how/whether and what assumptions are needed to generalize findings beyond the data in question. Presumably, if we find a relationship between parental and child heights, wed like to extend that knowledge beyond the data used to build the model. This requires assumptions. In this book, well cover the main assumptions necessary.