Matt Wiley
Columbia City, IN, USA
Joshua F. Wiley
Columbia City, IN, USA
Any source code or other supplementary material referenced by the author in this book is available to readers on GitHub via the books product page, located at www.apress.com/9781484228715 . For more detailed information, please visit http://www.apress.com/source-code .
ISBN 978-1-4842-2871-5 e-ISBN 978-1-4842-2872-2
https://doi.org/10.1007/978-1-4842-2872-2
Library of Congress Control Number: 2019932986
Matt Wiley and Joshua F. Wiley 2019
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.
Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.
While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein.
Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail orders-ny@springer-sbm.com, or visit www.springeronline.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.
Introduction
This book shows how to conduct data analysis using the popular R language. Our goal is to provide a practical resource for conducting advanced statistical analyses using R . As this is an advanced book, the reader is assumed to have some background in using R , including familiarity with general data management and the use of functions.
Because the book is primarily practical, we do not provide in-depth theoretical or conceptual introductions to the various statistical models discussed. However, to aid understanding and their appropriate application, we do provide some conceptual background on each analytic technique discussed.
Conventions
Bold lowercase letters are used to refer to a vector, for example, x . Bold uppercase letters are used to refer to a matrix, for example, X . Generally, the Latin alphabet is used for data and the Greek alphabet is used for parameters. Mathematical functions are indicated with parentheses, for example, f ().
In the text, reference to R code or function will be in monospaced font like this. R function names have parentheses included to help indicate it is a function, such as mean() to indicate the mean function in R .
Package Setup
Throughout the book, we will make use of many different R packages that make tasks easier or provide more robust or sophisticated graphing and analysis options.
Although not required for readers, we make use of the checkpoint package to help ensure the book is reproducible [23]. If you do not care about reproducibility and are happy to take your chances that our code that worked with one version of R and packages also works with whatever versions you have, then you can just skip reading this section. If you want reproducibility, but do not care why or how it works, then just create R scripts for the code for each chapter, save them, and then run the checkpoint package at the beginning. If you care and want to know why and how it all works, read on the next few paragraphs.
Details on Reproducibility
The many additional packages available for R are one of its greatest strengths. However, they also create some challenges. For example, as a reader, suppose that on your computer, you have R v3.4.3 installed and as part of that in January you had installed the ggplot2 package for graphs. By default, you will have whatever version of ggplot2 was available in January when you installed it. Now in one chapter, we tell you that you need both the ggplot2 and cowplot packages. Because you already had ggplot2 installed, you do not need to install it again. However, suppose that you did not have the cowplot package installed. So, whenever you happen to be reading that chapter, you attempt to install the cowplot package, lets say its in April. You will now by default get the latest version of cowplot available for that version of R as of April.
Now imagine a second reader comes along and also had R v3.4.3 but had neither the ggplot2 nor the cowplot package installed. They also read the chapter in April, but they install both packages in April, so they get the latest version of both packages available in April for R v3.4.3 .
Even though both you and the other reader had the same version of R installed, you will end up with different package versions from each other, and likely different versions yet from whatever versions we used to write the book.
The end result is that different people, even with the same version of R, very likely are using different versions of different packages. This can pose a major challenge for reproducibility. If you are reading a book, it can be a frustration because code does not seem to work as we said it would. If you are using code in production or for scientific research or decision-making, nonreproducibility can pose an even bigger challenge.
The solution to standardize versions across people and ensure results are fully reproducible is to control not only the version of R but also the version of all packages. This requires a different approach to package installation and management than the default system, which uses the latest package versions from CRAN. The checkpoint package is designed to solve this challenge. It does require some extra steps and processes to use, and at first may seem a nuisance, but the payoff is that you can be guaranteed that you are not only using the same version of R but also the same version of all packages.
To understand how the checkpoint package works, we need a bit more background regarding how R s libraries and package system work.
Mainstream R packages are distributed through CRAN. Package authors can submit new versions of their packages to CRAN, and CRAN updates nightly. For some operating systems, CRAN just stores the package source code, such as for Linux machines. For others, such as Windows operating systems, CRAN builds precompiled package binaries and hosts those. CRAN keeps old source code but generally not old binary packages for long. On a local machine, when install.packages is run, R goes online to a repository, by default CRAN, finds the package name, downloads it, and installs it into a local