1. Programming Basics
As with most languages, more advanced usage requires delving into the underlying structure. This chapter covers such programming basics, and this first section of the book (through Chapter ), develops some advanced programming techniques. We start with Rs basic building blocks, which create our foundation for programming, data management, and cloud analytics.
Before we dig too deeply into R, some general principles to follow may well be in order. First, experimentation is good. It is much more powerful to learn hands-on than it is simply to read. Download the source files that come with this text, and try new things!
Second, it can help quite a bit to become familiar with the ? function. Simply type ? immediately followed by text in your R console to call up help of some kind. We cover more on functions later, but this is too useful to ignore until that time.
Finally, just before we dive into the real reason you bought this book, a word of caution: this is an applied text. There may be topics and areas of R we skip or ignore. While we, the authors, like to imagine this is due to careful pruning of ideas, it may well be due to ignorance. There are likely other ways to perform these tasks or additional good topics to learn. Our goal is to get you up and running as quickly as possible toward some useful skills. Good luck!
Advanced R Software Choices
This book is written for advanced users of the R language. We should note that for most of our examples, we continue using RStudio ( www.rstudio.com/products/rstudio/download/ ) as in Beginning R: An Introduction to Statistical Programming (Apress, 2015). We also assume you are using a Microsoft Windows ( www.microsoft.com ) operating system, except for the later chapters, where we delve into using R in the cloud via Ubuntu ( www.ubuntu.com ). What is different is the underlying R distribution.
We are going to use Microsoft R Open (MRO) , which is fully aligned with the current version(s) of R. This provides performance enhancements that happen behind the scenes. We also use Intel Math Kernel Library (Intel MKL) , which is available for download at the same site as MRO ( https://mran.microsoft.com/download/ ) . In fact, as this book goes to print, these two software programs combined in their latest release. It would be wonderful if that trend continues. These downloads are very straightforward, and we anticipate that our readers, familiar with using R and RStudio already, find this a seamless installation. On Windows (and Linux-based operating systems), the MKL replaces the default linear algebra system with an optimized system and allows implicit parallel processing for linear algebra operations, such as matrix multiplication and decomposition that are used in many statistical algorithms.
In case it is not already, you also need Java installed. We used Java Version 8 Update 91 for 64 bit in this book. Java may be downloaded at www.oracle.com/technetwork/java/javase/ ; specifically, get the Java Development Kit (JDK ).
While these choices may have minor consequences, our goal is to provide universal guidance that remains true enough regardless of environmental specifics. Nevertheless, some packages and prebuilt functions on occasion have quirks. We turn our attention to ensuring that you can readily reproduce our results.
Reproducing Results
One useful feature of R is the abundance of packages written by experts worldwide. This is also potentially the Achilles heel of using R: from the version of R itself to the version of particular packages, lots of code specifics are in flux. Your code has the potential to not work from day to day, let alone our code written months before this book was published. To solve this, we use the Revolution Analytics checkpoint package (Microsoft Corporation, 2016), which uses server-stored snapshots from the Comprehensive R Archive Network (CRAN) to lock our code to a specific version and date. To learn the technical specifics of how this is done, visit the link in the References section at the end of this chapter. Well get you started with the basics.
For this book, we used R version 3.3.1, Bug in Your Hair, along with Windows 10 Professional x64. As this version moves from the current version to historical, CRAN maintains an archive of past releases. Thus, the checkpoint package has ready access to previous versions of R, and indeed all packages. What you need to do is add the following code to the top of your Chapter R file in your project directory:
## uncomment to install the checkpoint package
## install.packages("checkpoint")
library(checkpoint)
checkpoint("2016-09-04", R.version = "3.3.1")
library(data.table)
We place all library calls at the start of each chapters project file, after the call to the checkpoint library. By including the date of September 4, 2016, we ensure that the latest version of all packages up to that cutoff is installed and run by checkpoint . The first time it is run, after asking permission, checkpoint creates a folder to host the needed versions of the packages used. Thus, as long as you start each chapters code file with the correct library calls, you use the same versions of the packages we use.
Types of Objects
First of all, we need things to build our language, and in R, these are called objects . We start with five very common types of objects.
Logical objects take on just two values: TRUE or FALSE . Computers are binary machines, and data often may be recorded and modeled in an all-or-nothing world. These logical values can be helpful, where TRUE has a value of , and FALSE has a value of :
TRUE
[1] TRUE
FALSE
[1] FALSE
As you may remember from the quickly muttered comments of your algebra professor, there are many types, or flavors, of numbers. Whole numbers, which include zero as well as negative values, are called integers . In set notation, {,-2, -1, 0, 1, 2, }, these numbers are helpful for headcounts or other indexes (as well as other things, naturally). In R, integers have the capital L suffix. If decimal numbers are needed, then double numeric objects are in order. These are the numbers suited for even-ratio data types. Complex numbers have useful properties as well and are understood precisely as you might expect, with an i suffix on the imaginary portion. R is quite friendly in using all of these numbers, and you simply type in the desired numbers (remember to add the L or i suffix as needed):
42L
[1] 42
1.5
[1] 1.5
2+3i
[1] 2+3i
Nominal-level data may be stored via the character class and is designated with quotation marks:
Of course, numerical data may have missing values. These missing values are of the type that the rest of the data in that set would be (we discuss data storage shortly). Nevertheless, it can be helpful to know how to hand-code logical, integer, double, complex, or character missing values:
NA
[1] NA
NA_integer_
[1] NA
NA_real_
[1] NA
NA_character_