Thanks for purchasing this book. If you are interested inhearing more from me about things that Im working on (books, datascience courses, podcast, etc.), you can do two things:
Preface
Exploratory data analysis is a bit difficult to describe in concrete definitive terms, but I think most data analysts and statisticians know it when they see it. I like to think of it in terms of an analogy.
Filmmakers will shoot a lot of footage when making a movie or some film production, not all of which will be used. In addition, the footage will typically not be shot in the order that the storyline takes place, because of actors schedules or other complicating factors. In addition, in some cases, it may be difficult to figure out exactly how the story should be told while shooting the footage. Rather, its sometimes easier to see how the story flows when putting the various clips together in the editing room.
In the editing room, the director and the editor can play around a bit with different versions of different scenes to see which dialogue sounds better, which jokes are funnier, or which scenes are more dramatic. Scenes that just dont work might get dropped, and scenes that are particularly powerful might get extended or re-shot. This rough cut of the film is put together quickly so that important decisions can be made about what to pursue further and where to back off. Finer details like color correction or motion graphics might not be implemented at this point. Ultimately, this rough cut will help the director and editor create the final cut, which is what the audience will ultimately view.
Exploratory data analysis is what occurs in the editing room of a research project or any data-based investigation. EDA is the process of making the rough cut for a data analysis, the purpose of which is very similar to that in the film editing room. The goals are many, but they include identifying relationships between variables that are particularly interesting or unexpected, checking to see if there is any evidence for or against a stated hypothesis, checking for problems with the collected data, such as missing data or measurement error), or identifying certain areas where more data need to be collected. At this point, finer details of presentation of the data and evidence, important for the final product, are not necessarily the focus.
Ultimately, EDA is important because it allows the investigator to make critical decisions about what is interesting to follow up on and what probably isnt worth pursuing because the data just dont provide the evidence (and might never provide the evidence, even with follow up). These kinds of decisions are important to make if a project is to move forward and remain within its budget.
This book covers some of the basics of visualizing data in R and summarizing high-dimensional data with statistical multivariate analysis techniques. There is less of an emphasis on formal statistical inference methods, as inference is typically not the focus of EDA. Rather, the goal is to show the data, summarize the evidence and identify interesting patterns while eliminating ideas that likely wont pan out.
Throughout the book, we will focus on the R statistical programming language. We will cover the various plotting systems in R and how to use them effectively. We will also discuss how to implement dimension reduction techniques like clustering and the singular value decomposition. All of these techniques will help you to visualize your data and to help you make key decisions in any data analysis.
Getting Started with R
3.1 Installation
The first thing you need to do to get started with R is to install iton your computer. R works on pretty much every platform available,including the widely available Windows, Mac OS X, and Linuxsystems. If you want to watch a step-by-step tutorial on how to installR for Mac or Windows, you can watch these videos:
- Installing R on Windows
- Installing R on the Mac
There is also an integrated development environment available for Rthat is built by RStudio. I really like this IDEit has a niceeditor with syntax highlighting, there is an R object viewer, andthere are a number of other nice features that are integrated. You cansee how to install RStudio here
The RStudio IDE is available from RStudios website.
3.2 Getting started with the R interface
After you install R you will need to launch it and start writing Rcode. Before we get to exactly how to write R code, its useful to geta sense of how the system is organized. In these two videos I talkabout where to write code and how set your working directory, whichlets R know where to find all of your files.
- Writing code and setting your working directory on the Mac
- Writing code and setting your working directory on Windows
Managing Data Frames with the dplyr
package
Watch a video of this chapter
4.1 Data Frames
The data frame is a key data structure in statistics and in R. The basic structure of a data frame is that there is one observation per row and each column represents a variable, a measure, feature, or characteristic of that observation. R has an internal implementation of data frames that is likely the one you will use most often. However, there are packages on CRAN that implement data frames via things like relational databases that allow you to operate on very very large data frames (but we wont discuss them here).