Thank you for choosing this book on using R for health data analysis. Even if youre already familiar with the R language, we hope you will find some new approaches here as we make the most of the latest R tools including some weve developed ourselves. Those already familiar with R are encouraged to still skim through the first few chapters to familiarise yourself with the style of R we recommend.
R can be used for all the health data science applications we can think of. From bioinformatics and computational biology, to administrative data analysis and natural language processing, through internet-of-things and wearable data, to machine learning and artificial intelligence, and even public health and epidemiology. R has it all.
Here are the main reasons we love R:
R is versatile and powerful - use it for
graphics;
all the statistical tests you can dream of;
machine learning and deep learning;
automated reports;
websites;
and even books (yes, this book was written entirely in R).
R scripts can be reused - gives you efficiency and reproducibility.
It is free to use by anyone, anywhere.
A script is a list of instructions. It is just a text file and no special software is required to view one. An example R script is shown in Figure .
Dont panic! The only thing you need to understand at this point is that what youre looking at is a list of instructions written in the R language.
You should also notice that some parts of the script look like normal English. These are the lines that start with a # and they are called comments. We can (and should) include these comments in everything we do. These are notes of what we were doing, both for colleagues as well as our future selves.
Figure 1.1 An example R script from RStudio.
Lines that do not start with # are R code. This is where the number crunching really happens. We will cover the details of this R code in the next few chapters. The purpose of this chapter is to describe some of the terminology as well as the interface and tools we use.
For the impatient:
We interface R using RStudio
We use the tidyverse packages that are a substantial extension to base R functionality (we repeat: extension, not replacement)
Even though R is a language, dont think that after reading this book you should be able to open a blank file and just start typing in R code like an evil computer genius from a movie. This is not what real-world programming looks like.
Firstly, you should be copy-pasting and adapting existing R code examples - whether from this book, the internet, or later from your existing work. Re-writing everything from scratch is not efficient. Yes, you will understand and eventually remember a lot of it, but to spend time memorising specific functions that can easily be looked up and copied is simply not necessary.
Secondly, R is an interactive language. Meaning that we run R code line by line and get immediate feedback. We do not write a whole script without trying each part out as we go along.
Thirdly, do not worry about making mistakes. Celebrate them! The whole point of R and reproducibility is that manipulations are not applied directly on a dataset, but a copy of it. Everything is in a script, so you cant do anything wrong. If you make a mistake like accidentally overwriting your data, we can just reload it, rerun the steps that worked well and continue figuring out what went wrong at the end. And since all of these steps are written down in a script, R will redo everything with a single push of a button. You do not have to repeat a set of mouse clicks from dropdown menus as in other statistical packages, which quickly becomes a blessing.
RStudio is a free program that makes working with R easier. An example screenshot of RStudio is shown in Figure . We have already introduced what is in the top-left pane - the Script.
Figure 1.2 We use RStudio to work with R.
Now, look at the little Run and Source buttons at the top-right corner of the script pane. Clicking Run executes a line of R code. Clicking Source executes all lines of R code in the script (it is essentially Run all lines). When you run R code, it gets sent to the Console which is the bottom-left panel. This is where R really lives.
Keyboard Shortcuts!
Run line: Control+Enter
Run all lines (Source): Control+Shift+Enter
(On a Mac, both Control or Command work)
The Console is where R speaks to us. When were lucky, we get results in there - in this example the results of a t-test (last line of the script). When were less lucky, this is also where Errors or Warnings appear.
R Errors are a lot less scary than they seem! Yes, if youre using a regular computer program where all you do is click on some buttons, then getting a proper red error that stops everything is quite unusual. But in programming, Errors are just a way for R to communicate with us.
We see Errors in our own work every single day, they are very normal and do not mean that everything is wrong or that you should give up. Try to re-frame the word Error to mean feedback, as in Hello, this is R. I cant continue, this is the feedback I am giving you. The most common Errors youll see are along the lines of Error: something not found. This almost always means theres a typo or youve misspelled something. Furthermore, R is case sensitive so capitalisation matters (variable name lifeExp is not the same as lifeexp ).
The Console can only print text, so any plots you create in your script appear in the Plots pane (bottom-right).
Similarly, datasets that youve loaded or created appear in the Environment tab. When you click on a dataset, it pops up in a nice viewer that is fast even when there is a lot of data. This means you can have a look and scroll through your rows and columns, the same way you would with a spreadsheet.
To start using R, you should do these two things:
When you first open up RStudio, youll also want to install some extra packages to extend the base R functionality. You can do this in the Packages tab (next to the Plots tab in the bottom-right in Figure ).
A Package is just a collection of functions (commands) that are not included in the standard R installation, called base-R.
A lot of the functionality introduced in this book comes from the tidyverse family of R packages (http://tidyverse.org Wickham et al. (2019)). So when you go to Packages, click Install, type in tidyverse, and a whole collection of useful and modern packages will be installed.
Even though youve installed the tidyverse packages, youll still need to tell R when youre about to use them. We include library(tidyverse) at the top of every script we write:
library(tidyverse)
Next page