Hands-On Programming with R
Garrett Grolemund
Foreword
Learning to program is important if youre serious about understanding data. Theres no argument that data science must be performed on a computer, but you have a choice between learning a graphical user interface (GUI) or a programming language. Both Garrett and I strongly believe that programming is a vital skill for everyone who works intensely with data. While convenient, a GUI is ultimately limiting, because it hampers three properties essential for good data analysis:
Reproducibility The ability to re-create a past analysis, which is crucial for good science. Automation The ability to rapidly re-create an analysis when data changes (as it always does). Communication Code is just text, so it is easy to communicate. When learning, this makes it easy to get helpwhether its with email, Google, Stack Overflow, or elsewhere.
Dont be afraid of programming! Anyone can learn to program with the right motivation, and this book is organized to keep you motivated. This is not a reference book; instead, its structured around three hands-on challenges. Mastering these challenges will lead you through the basics of R programming and even into some intermediate topics, such as vectorized code, scoping, and S3 methods. Real challenges are a great way to learn, because youre not memorizing functions void of context; instead, youre learning functions as you need them to solve a real problem. Youll learn by doing, not by reading.
As you learn to program, you are going to get frustrated. You are learning a new language, and it will take time to become fluent. But frustration is not just natural, its actually a positive sign that you should watch for. Frustration is your brains way of being lazy; its trying to get you to quit and go do something easy or fun. If you want to get physically fitter, you need to push your body even though it complains. If you want to get better at programming, youll need to push your brain. Recognize when you get frustrated and see it as a good thing: youre now stretching yourself. Push yourself a little further every day, and youll soon be a confident programmer.
Hands-On Programming with R is friendly, conversational, and active. Its the next-best thing to learning R programming from me or Garrett in person. I hope you enjoy reading it as much as I have.
Hadley Wickham Chief Scientist, RStudio
P.S. Garrett is too modest to mention it, but his lubridate package makes working with dates or times in R much less painful. Check it out!
Preface
This book will teach you how to program in R. Youll go from loading data to writing your own functions (which will outperform the functions of other R users). But this is not a typical introduction to R. I want to help you become a data scientist, as well as a computer scientist, so this book will focus on the programming skills that are most related to data science.
The chapters in the book are arranged according to three practical projectsgiven that theyre fairly substantial projects, they span multiple chapters. I chose these projects for two reasons. First, they cover the breadth of the R language. You will learn how to load data, assemble and disassemble data objects, navigate Rs environment system, write your own functions, and use all of Rs programming tools, such as if else
statements, for loops, S3 classes, Rs package system, and Rs debugging tools. The projects will also teach you how to write vectorized R code, a style of lightning-fast code that takes advantage of all of the things R does best.
But more importantly the projects will teach you how to solve the logistical problems of data scienceand there are many logistical problems. When you work with data, you will need to store, retrieve, and manipulate large sets of values without introducing errors. As you work through the book, I will teach you not just how to program with R, but how to use the programming skills to support your work as a data scientist.
Not every programmer needs to be a data scientist, so not every programmer will find this book useful. You will find this book helpful if youre in one of the following categories:
- You already use R as a statistical tool but would like to learn how to write your own functions and simulations with R.
- You would like to teach yourself how to program, and you see the sense of learning a language related to data science.
One of the biggest surprises in this book is that I do not cover traditional applications of R, such as models and graphs; instead, I treat R purely as a programming language. Why this narrow focus? R is designed to be a tool that helps scientists analyze data. It has many excellent functions that make plots and fit models to data. As a result, many statisticians learn to use R as if it were a piece of softwarethey learn which functions do what they want, and they ignore the rest.
This is an understandable approach to learning R. Visualizing and modeling data are complicated skills that require a scientists full attention. It takes expertise, judgement, and focus to extract reliable insights from a data set. I would not recommend that any data scientist distract herself with computer programming until she feels comfortable with the basic theory and practice of her craft. If you would like to learn the craft of data science, I recommend the forthcoming book Data Science with R , my companion volume to this book.
However, learning to program should be on every data scientists to-do list. Knowing how to program will make you a more flexible analyst and augment your mastery of data science in every way. My favorite metaphor for describing this was introduced by Greg Snow on the R help mailing list in May 2006. Using the functions in R is like riding a bus. Writing programs in R is like driving a car.
Busses are very easy to use, you just need to know which bus to get on, where to get on, and where to get off (and you need to pay your fare). Cars, on the other hand, require much more work: you need to have some type of map or directions (even if the map is in your head), you need to put gas in every now and then, you need to know the rules of the road (have some type of drivers license). The big advantage of the car is that it can take you a bunch of places that the bus does not go and it is quicker for some trips that would require transferring between busses.
Using this analogy, programs like SPSS are busses, easy to use for the standard things, but very frustrating if you want to do something that is not already preprogrammed.
R is a 4-wheel drive SUV (though environmentally friendly) with a bike on the back, a kayak on top, good walking and running shoes in the passenger seat, and mountain climbing and spelunking gear in the back.
R can take you anywhere you want to go if you take time to learn how to use the equipment, but that is going to take longer than learning where the bus stops are in SPSS.
Greg Snow
Greg compares R to SPSS, but he assumes that you use the full powers of R; in other words, that you learn how to program in R. If you only use functions that preexist in R, you are using R like SPSS: it is a bus that can only take you to certain places.
This flexibility matters to data scientists. The exact details of a method or simulation will change from problem to problem. If you cannot build a method tailored to your situation, you may find yourself tempted to make unrealistic assumptions just so you can you use an ill-suited method that already exists.