Mastering Software Development in R
Roger D. Peng, Sean Kross and Brooke Anderson
This book is for sale at http://leanpub.com/msdr
This version was published on 2017-01-05
* * * * *
This is a Leanpub book. Leanpub empowers authors and publishers with the Lean Publishing process. Lean Publishing is the act of publishing an in-progress ebook using lightweight tools and many iterations to get reader feedback, pivot until you have the right book and build traction once you do.
* * * * *
2016 - 2017 Roger D. Peng, Sean Kross and Brooke Anderson
Introduction
NOTE: This book is under active development.
This book is designed to be used in conjunction with the course sequence Mastering Software Development in R, available on Coursera. The book covers R software development for building data science tools. As the field of data science evolves, it has become clear that software development skills are essential for producing useful data science results and products. You will obtain rigorous training in the R language, including the skills for handling complex data, building R packages and developing custom data visualizations. You will learn modern software development practices to build tools that are highly reusable, modular, and suitable for use in a team-based environment or a community of developers.
Setup
This book makes use of the following R packages, which should be installed to take full advantage of the examples.
choroplethrchoroplethrMapsdata.tabledatasetsdevtoolsdlnmdplyrfarawayforcatsGGallyggmapggplot2ggthemesghitGISToolsgridgridExtrahttrknitrleafletlubridatemagrittrmethodsmicrobenchmarkpackagepanderplotlyprofvispryrpurrrrappdirsrasterRColorBrewerreadrrmarkdownspstatsstringrtestthattidyrtidyversetigristitanicviridis
You can install all of these packages with the following code:
install.packages(c("choroplethr", "choroplethrMaps", "data.table","datasets", "devtools", "dlnm", "dplyr", "faraway", "forcats","GGally", "ggmap", "ggplot2", "ggthemes", "ghit", "GISTools","grid", "gridExtra", "httr", "knitr", "leaflet", "lubridate","magrittr", "methods", "microbenchmark", "package", "pander","plotly", "profvis", "pryr", "purrr", "rappdirs", "raster","RColorBrewer", "readr", "rmarkdown", "sp", "stats", "stringr","testthat", "tidyr", "tidyverse", "tigris", "titanic", "viridis"))
The R Programming Environment
This chapter provides a rigorous introduction to the R programming language, with a particular focus on using R for software development in a data science setting. Whether you are part of a data science team or working individually within a community of developers, this chapter will give you the knowledge of R needed to make useful contributions in those settings.
As the first chapter in this book, the chapter provides the essential foundation of R needed for the following chapters. We cover basic R concepts and language fundamentals, key concepts like tidy data and related tidyverse tools, processing and manipulation of complex and large datasets, handling textual data, and basic data science tasks. Upon finishing this chapter, you will have fluency at the R console and will be able to create tidy datasets from a wide range of possible data sources.
The learning objectives for this chapter are to:
- Develop fluency in using R at the console
- Execute basic arithmetic operations
- Subset and index R objects
- Remove missing values from an R object
- Modify object attributes and metadata
- Describe differences in different R classes and data types
- Read tabular data into R and read in web data via web scraping tools and APIs
- Define tidy data and to transform non-tidy data into tidy data
- Manipulate and transform a variety of data types, including dates, times, and text data
- Describe how memory is used in R sessions to store R objects
- Read and manipulate large datasets
- Describe how to diagnose programming problems and to look up answers from the web or forums
1.1 Crash Course on R Syntax
Note: Some of the material in this section is taken from R Programming for Data Science.
The learning objectives for this section are to:
- Develop fluency in using R at the console
- Execute basic arithmetic operations
- Subset and index R objects
- Remove missing values from an R object
- Modify object attributes and metadata
- Describe differences in different R classes and data types
At the R prompt we type expressions. The <-
symbol (gets arrow) is the assignmentoperator.
x <-
1
print
(
x)
[
1
]
1
x[
1
]
1
msg <-
"hello"
The grammar of the language determines whether an expression iscomplete or not.
x <-
## Incomplete expression
The # character indicates a comment. Anything to the right of the #(including the # itself) is ignored. This is the only commentcharacter in R. Unlike some other languages, R does not supportmulti-line comments or comment blocks.
Evaluation
When a complete expression is entered at the prompt, it is evaluatedand the result of the evaluated expression is returned. The result maybe auto-printed.
x <-
5
## nothing printed
x ## auto-printing occurs
[
1
]
5
print
(
x)
## explicit printing
[
1
]
5
The [1]
shown in the output indicates that x
is a vector and 5
is its first element.
Typically with interactive work, we do not explicitly print objectswith the print
function; it is much easier to just auto-print themby typing the name of the object and hitting return/enter. However,when writing scripts, functions, or longer programs, there issometimes a need to explicitly print objects because auto-printingdoes not work in those settings.
When an R vector is printed you will notice that an index for thevector is printed in square brackets []
on the side. For example,see this integer sequence of length 20.
x <-
11
:
30
x [
1
]
11
12
13
14
15
16
17
18
19
20
21
22
[
13
]
23
24
25
26
27
28
29
30
The numbers in the square brackets are not part of the vector itself,they are merely part of the printed output.
With R, its important that one understand that there is a differencebetween the actual R object and the manner in which that R object isprinted to the console. Often, the printed output may have additionalbells and whistles to make the output more friendly to theusers. However, these bells and whistles are not inherently part ofthe object.