Practical Data Science with R
Nina Zumel and John Mount
Copyright
For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact
Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 261 Shelter Island, NY 11964 Email:
orders@manning.com2014 by Manning Publications Co. All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps.
Recognizing the importance of preserving what has been written, it is Mannings policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine.
| Manning Publications Co.20 Baldwin RoadPO Box 261Shelter Island, NY 11964 | Development editor: Cynthia KaneCopyeditor: Benjamin BergProofreader: Katie TennantTypesetter: Dottie MarsicoCover designer: Marija Tudor |
ISBN 9781617291562
Printed in the United States of America
1 2 3 4 5 6 7 8 9 10 EBM 19 18 17 16 15 14
Dedication
To our parents
Olive and Paul Zumel
Peggy and David Mount
Brief Table of Contents
Table of Contents
Foreword
If youre a beginning data scientist, or want to be one, Practical Data Science with R (PDSwR) is the place to start. If youre already doing data science, PDSwR will fill in gaps in your knowledge and even give you a fresh look at tools you use on a daily basisit did for me.
While there are many excellent books on statistics and modeling with R, and a few good management books on applying data science in your organization, this book is unique in that it combines solid technical content with practical, down-to-earth advice on how to practice the craft. I would expect no less from Nina and John.
I first met John when he presented at an early Bay Area R Users Group about his joys and frustrations with R. Since then, Nina, John, and I have collaborated on a couple of projects for my former employer. And John has presented early ideas from PDSwRboth to the big group and our Berkeley R-Beginners meetup. Based on his experience as a practicing data scientist, John is outspoken and has strong views about how to do things. PDSwR reflects Nina and Johns definite views on how to do data sciencewhat tools to use, the process to follow, the important methods, and the importance of interpersonal communications. There are no ambiguities in PDSwR.
This, as far as Im concerned, is perfectly fine, especially since I agree with 98% of their views. (My only quibble is around SQLbut thats more an issue of my upbringing than of disagreement.) What their unambiguous writing means is that you can focus on the craft and art of data science and not be distracted by choices of which tools and methods to use. This precision is what makes PDSwR practical. Lets look at some specifics.
Practical tool set: R is a given. In addition, RStudio is the IDE of choice; Ive been using RStudio since it came out. It has evolved into a remarkable toolintegrated debugging is in the latest version. The third major tool choice in PDSwR is Hadley Wickhams ggplot2. While R has traditionally included excellent graphics and visualization tools, ggplot2 takes R visualization to the next level. (My practical hint: take a close look at any of Hadleys R packages, or those of his students.) In addition to those main tools, PDSwR introduces necessary secondary tools: a proper SQL DBMS for larger datasets; Git and GitHub for source code version control; and knitr for documentation generation.
Practical datasets: The only way to learn data science is by doing it. Theres a big leap from the typical teaching datasets to the real world. PDSwR strikes a good balance between the need for a practical (simple) dataset for learning and the messiness of the real world. PDSwR walks you through how to explore a new dataset to find problems in the data, cleaning and transforming when necessary.
Practical human relations: Data science is all about solving real-world problems for your clienteither as a consultant or within your organization. In either case, youll work with a multifaceted group of people, each with their own motivations, skills, and responsibilities. As practicing consultants, Nina and John understand this well. PDSwR is unique in stressing the importance of understanding these roles while working through your data science project.
Practical modeling: The bulk of PDSwR is about modeling, starting with an excellent overview of the modeling process, including how to pick the modeling method to use and, when done, gauge the models quality. The book walks you through the most practical modeling methods youre likely to need. The theory behind each method is intuitively explained. A specific example is worked throughthe code and data are available on the authors GitHub site. Most importantly, tricks and traps are covered. Each section ends with practical takeaways.
In short, Practical Data Science with R is a unique and important addition to any data scientists library.
J IM P ORZAK
S ENIOR D ATA S CIENTIST AND C OFOUNDER OF THE B AY A REA R U SERS G ROUP
Preface
This is the book we wish wed had when we were teaching ourselves that collection of subjects and skills that has come to be referred to as data science. Its the book that wed like to hand out to our clients and peers. Its purpose is to explain the relevant parts of statistics, computer science, and machine learning that are crucial to data science.
Data science draws on tools from the empirical sciences, statistics, reporting, analytics, visualization, business intelligence, expert systems, machine learning, databases, data warehousing, data mining, and big data. Its because we have so many tools that we need a discipline that covers them all. What distinguishes data science itself from the tools and techniques is the central goal of deploying effective decision-making models to a production environment.
Our goal is to present data science from a pragmatic, practice-oriented viewpoint. Weve tried to achieve this by concentrating on fully worked exercises on real dataaltogether, this book works through over 10 significant datasets. We feel that this approach allows us to illustrate what we really want to teach and to demonstrate all the preparatory steps necessary to any real-world project.
Throughout our text, we discuss useful statistical and machine learning concepts, include concrete code examples, and explore partnering with and presenting to nonspecialists. We hope if you dont find one of these topics novel, that were able to shine a light on one or two other topics that you may not have thought about recently.