Contents
Modern Data Science with R
CHAPMAN & HALL/CRC
Texts in Statistical Science Series
Joseph K. Blitzstein, Harvard University, USA
Julian J. Faraway, University of Bath, UK
Martin Tanner, Northwestern University, USA
Jim Zidek, University of British Columbia, Canada
Recently Published Titles
Practical Multivariate Analysis, Sixth Edition
Abdelmonem Afifi, Susanne May, Robin A. Donatello, and Virginia A. Clark
Time Series: A First Course with Bootstrap Starter
Tucker S. McElroy and Dimitris N. Politis
Probability and Bayesian Modeling
Jim Albert and Jingchen Hu
Surrogates
Gaussian Process Modeling, Design, and Optimization for the Applied Sciences
Robert B. Gramacy
Statistical Analysis of Financial Data
With Examples in R
James Gentle
Statistical Rethinking
A Bayesian Course with Examples in R and STAN, Second Edition
Richard McElreath
Statistical Machine Learning
A Model-Based Approach
Richard Golden
Randomization, Bootstrap and Monte Carlo Methods in Biology
Fourth Edition
Bryan F. J. Manly, Jorje A. Navarro Alberto
Principles of Uncertainty, Second Edition
Joseph B. Kadane
Beyond Multiple Linear Regression
Applied Generalized Linear Models and Multilevel Models in R
Paul Roback, Julie Legler
Bayesian Thinking in Biostatistics
Gary L. Rosner, Purushottam W. Laud, and Wesley O. Johnson
Modern Data Science with R, Second Edition
Benjamin S. Baumer, Daniel T. Kaplan, and Nicholas J. Horton
For more information about this series, please visit: https://www.crcpress.com/ChapmanHallCRC-Texts-in-Statistical-Science/book-series/CHTEXSTASCI
First edition published 2021
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
and by CRC Press
2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN
2021 Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, LLC
The right of Benjamin S. Baumer, Daniel T. Kaplan, and Nicholas J. Horton to be identified as authors of this work has been asserted by them in accordance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988.
Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for identification and explanation without intent to infringe.
Library of Congress CataloginginPublication Data
ISBN: 9780367191498 (hbk)
ISBN: 9780367745448 (pbk)
ISBN: 9780429200717 (ebk)
Typeset in Latin Modern font
by KnowledgeWorks Global Ltd.
Benjamin S. Baumer is an associate professor in the Statistical & Data Sciences program at Smith College. He has been a practicing data scientist since 2004, when he became the first full-time statistical analyst for the New York Mets. Ben is a co-author of The Sabermetric Revolution and Analyzing Baseball Data with R. He received the 2019 Waller Education Award and the 2016 Contemporary Baseball Analysis Award from the Society for American Baseball Research.
Daniel T. Kaplan is the DeWitt Wallace Professor of Mathematics and Computer Science at Macalester College. He is the author of several textbooks on statistical modeling and statistical computing. Danny received the 2006 Macalester Excellence in Teaching award and the 2017 CAUSE Lifetime Achievement Award.
Nicholas J. Horton is the Beitzel Professor of Technology and Society (Statistics and Data Science) at Amherst College. He is a Fellow of the ASA and the AAAS, co-chair of the National Academies Committee on Applied and Theoretical Statistics, recipient of a number of national teaching awards, author of a series of books on statistical computing, and actively involved in data science curriculum efforts to help students think with data.
Background and motivation
The increasing volume and sophistication of data poses new challenges for analysts, who need to be able to transform complex data sets to answer important statistical questions. A consensus report on data science for undergraduates (National Academies of Science, Engineering, and Medicine, 2018) noted that data science is revolutionizing science and the workplace. They defined a data scientist as a knowledge worker who is principally occupied with analyzing complex and massive data resources.
Michael I. Jordan has described data science as the marriage of computational thinking and inferential (statistical) thinking. Without the skills to be able to wrangle or marshal the increasingly rich and complex data that surround us, analysts will not be able to use these data to make better decisions.
Demand is strong for graduates with these skills. According to the company ratings site Glassdoor, data scientist was the best job in America every year from 20162019 (Columbus, 2019).
New data technologies make it possible to extract data from more sources than ever before. Streamlined data processing libraries enable data scientists to express how to restructure those data into a form suitable for analysis. Database systems facilitate the storage and retrieval of ever-larger collections of data. State-of-the-art workflow tools foster well-documented and reproducible analysis. Modern statistical and machine learning methods allow the analyst to fit and assess models as well as to undertake supervised or unsupervised learning to glean information about the underlying real-world phenomena. Contemporary data science requires tight integration of these statistical, computing, data-related, and communication skills.
Intended audience
This book is intended for readers who want to develop the appropriate skills to tackle complex data science projects and think with data (as coined by Diane Lambert of Google). The desire to solve problems using data is at the heart of our approach.
We acknowledge that it is impossible to cover all these topics in any level of detail within a single book: Many of the chapters could productively form the basis for a course or series of courses. Instead, our goal is to lay a foundation for analysis of real-world data and to ensure that analysts see the power of statistics and data analysis. After reading this book, readers will have greatly expanded their skill set for working with these data, and should have a newfound confidence about their ability to learn new technologies on-the-fly.