Statistical Modeling: A Fresh Approach
Daniel T Kaplan
E-book version of Second Edition 2017
Statistical Modeling: A Fresh Approach
Preface to this electronic version
When Statistical Modeling: A Fresh Approach was being drafted, the Amazon Kindle had just been introduced and, at $400, was too pricey for most students. A year after the book appeared in print, the Apple iPad was released. Since then, a generation of students reads in an electronic format as a matter of course. Many have an ebook reader always at hand: a smart phone.
The e-book format has many advantages beyond portability. The books can be much cheaper and less of a drain on natural resources. The text can be searched easily. On many platforms, the display is in color. The reader can adjust the display to suit his or her preferences. All for the good.
The flip side of being able to read on devices of many different configuration is that authors can not reliably anticipate what the reader will be looking at on the "page." Traditionally, authors and book designers have worked with a fixed page, allowing them to lay out text, graphics, mathematical notation, computer code, etc. as an integrated whole. If you put two images side by side so that the reader could easily refer back and forth, those images stayed side by side. This isn't true with e-book formats. Tables, which are all about putting things in a spacial arrangement, can become almost microscopic on a smart phone. You can mitigate some of these deficiencies by being an active reader by, for instance, switching between portrait and landscape modes.
The original printed versions of this book had a "computational technique" section at the end of every chapter. That material is now online. The problems of formatting it for an e-book are just too great. Even more important, the web interface allows those materials to become interactive and thereby enhances learning.
The web-based materials are available through the project-mosaic-books.com
page for this book.
Preface (to the printed 2nd edition)
The purpose of this book is to provide an introduction to statistics that gives readers a sufficient mastery of statistical concepts, methods, and computations to apply them to authentic systems. By "authentic," I mean the sort of multivariable systems often encountered when working in the natural or social sciences, commerce, government, law, or any of the many contexts in which data are collected with an eye to understanding how things work or to making predictions about what will happen.
The world is uncertain and complex. We deal with the complexity and uncertainty with a variety of strategies including the scientific method and the discipline of statistics.
Statistics deals with uncertainty, quantifying it so that you can assess how reliable how likely to be repeatable your findings are. The scientific method deals with complexity: reduce systems to simpler components, define and measure quantities carefully, do experiments in which some conditions are held constant but others are varied systematically.
Beyond helping to quantify uncertainty and reliability, statistics provides another great insight of which most people are unaware. When dealing with systems involving multiple influences, it is possible and best to deal with those influences simultaneously. By appropriate data collection and analysis, the confusing tangle of influences can sometimes be straightened out. In other words, statistics goes hand-in-hand with the scientific method when it comes to dealing with complexity and understanding how systems work.
The statistical methods that can accomplish this are often considered advanced: multiple regression, analysis of covariance, logistic regression, among others. With appropriate software, any method is accessible in the sense of being able to produce a summary report on the computer. But a method is useful only when the user has a way to understand whether the method is appropriate for the situation, what the method is telling about the data, and what the method is not capable of revealing. Computer scientist Richard Hamming (1915-1998) said: "The purpose of computing is insight, not numbers." Without a solid understanding of the theory that underlies a method, the numbers generated by the computer may not give insight.
Advanced methods of statistics can give tremendous insight. For this reason, these methods need to be accessible both computationally and theoretically to the widest possible audience. Historically, access has been limited because few people have the algebraic skills needed to approach the methods in the way they are usually presented. But there are many paths to understanding and I have undertaken to find one the "fresh approach" in the title that takes the greatest advantage of the actual skills that most people already have in abundance.
In trying to meet that challenge, I have made many unconventional choices. Theory becomes simpler when there is a unified framework for treating many aspects of statistics, so I have chosen to present just about everything in the context of models: descriptive statistics as well as inference.
Consequently, algebraic notation and formulas are strongly de-emphasized in this book. The traditional role that formulas have played in providing instructions for how to carry out a calculation is no longer essential for effective use of statistical methods. Software now implements the calculations. What's needed is not a formula-based description that allows people to reproduce what computers do, but a way to understand the methods at a high level so that the rapidity and reliability of computers in performing calculations can be used to provide insight into real-world problems.
I have been fortunate to have the assistance and support of many people. Some of the colleagues who have played important roles are David Bressoud, George Cobb, Dan Flath, Tom Halverson, Gary Krueger, Weiwen Miao, Phil Poronnik, Victor Addona, Alicia Johnson, Karen Saxe, Michael Schneider, and Libby Shoop. Critical institutional support was given by Brian Rosenberg, Jan Serie, Dan Hornbach, Helen Warren, and Diane Michelfelder at Macalester and Mercedes Talley at the Keck Foundation.
I received encouragement from many in the statistics education community, including George Cobb, Joan Garfield, Dick De Veaux, Bob delMas, Julie Legler, Milo Schield, Paul Alper, Dennis Pearl, Jean Scott, Ben Hansen, Tom Short, Andy Zieffler, Sharon Lane-Getaz, Katie Makar, Michael Bulmer, Frank Shaw, and the participants in our monthly "Stat Chat" sessions. Helpful suggestions came from from Simon Blomberg, Dominic Hyde, Michael Lavine, Erik Larson, Julie Dolan, and Kendrick Brown. Michael Edwards helped with proofreading. Nick Trefethen and Dave Saville provided important insights about the geometry of fitting linear models.
It's important to recognize the role played by the developers of the R software the "core" R team as well as the group of volunteers who have provided numerous packages that extend R's capabilities. Hadley Wickham, in particular, developed the ggplot2
package used to create many of the graphics in this Second Edition, as well as a remarkable array of other utilities for treating data in a unified way. The design of R (and its progenitor S) are not just a matter of good software design, but of a brilliant understanding and systematization of statistics that makes the underlying logic of statistics accessible to students as well as experts. Further extending the reach of R, J.J. Allaire, Joe Chang, and Joshua Paulson have created the RStudio interface to R, which makes it much easier to teach and learn with R.
Special thanks are due to Randall Pruim and Nicholas Horton who, as