1. Introduction
1.1 What Is Data Science?
We live in the age of data. In the present day, data is all around us and collected at unprecedented levels. The data can be in the form of network/graph data: a wealth of information in a billion user social network, web pages indexed by a search engine, shopping transactions of an e-commerce business, or a large wireless sensor network. The amount of data that we generate is enormous: in 2012, every day, we created 2.5 quintillion bytes or 2.5 million terabytes of data. The growth rate is even more staggering: 90% of worlds data was generated over the last two years [].
Data is not very useful by itself unless it is converted into knowledge. This knowledge is in the form of insights, which can provide a lot of information about the underlying process. Corporations are increasingly becoming more data driven : using insights from the data to drive their business decisions. A new class of applications is the data product [], which takes a step further by converting data insight into a usable consumer product.
Some of the prominent examples of data products include:
Google flu trends : By analyzing the search engine query logs, Google is able to track the prevalence of influenza faster than the Centers for Disease Control and Prevention (CDC).
Netflix recommendation engine : Looking at the movie ratings and watching patterns of pairs of users, the Netflix recommendation engine is able to accurately predict the ratings for the movies that a user has not seen before.
The methodology of extracting insights from data is called as data science . Historically, data science has been known by different names: in the early days, it was known simply as statistics , after which it became known as data analytics . There is an important difference between data science as compared to statistics and data analytics. Data science is a multi-disciplinary subject: it is a combination of statistical analysis, programming, and domain expertise []. Each of these aspects is important:
Statistical skills are essential in applying the right kind of statistical methodology along with interpreting the results.
Programming skills are essential to implement the analysis methodology, combine data from multiple sources and especially, working with large-scale datasets.
Domain expertise is essential in identifying the problems that need to be solved, forming hypotheses about the solutions, and most importantly understanding how the insights of the analysis should be applied.
Over the last few years, data science has emerged as a discipline in its own right.
However, there is no standardized set of tools that are used in the analysis. Data scientists use a variety of programming languages and tools in their work, sometimes even using a combination of heterogeneous tools to perform a single analysis. This increases the learning curve for the new data scientists. The R programming environment presents a great homogeneous set of tools for most data science tasks.
1.2 Why R?
The R programming environment is increasingly becoming a one-stop solution to data science. R was first created in 1993 and has evolved into a stable product. It is becoming the de facto standard for data analysis in academia and industry.
The first advantage of using R is that it is open source software. It has many advantages of other commercial statistical platforms such as MATLAB, SAS, and SPSS. Additionally, R works on most platforms: GNU/Linux, OS X, Windows.
R has its roots in the statistics community, being created by statisticians for statisticians. This is reflected in the design of the programming language: many of its core language elements are geared toward statistical analysis. The second advantage using R is that the amount of code that we need to write in R is very small compared to other programming languages. There are many high-level data types and functions available in R that hide the low-level implementation details from the programmer. Although there exist R systems used in production with significant complexity, for most data analysis tasks, we need to write only a few lines of code.
R can be used both as an interactive or a noninteractive environment. We can use R as an interactive console, where we can try out individual statements and observe the output directly. This is useful in exploring the data, where the output of the first statement can inform which step to take next. However, R can also be used to run a script containing a set of statements in a noninteractive environment.
The final benefit of using R is the set of R packages. The single most important reason for the growing popularity of R is its vast package library called the Comprehensive R Archive Network, or more commonly known as CRAN. Most statistical analysis methods usually have an open source implementation in the form of an R package. R is supported by a vibrant community and a growing ecosystem of package developers.
1.3 1.3 Goal of This Book
Due to its statistical focus, however, R is one of the more difficult tools to master, especially for programmers without a background in statistics. As compared to other programming languages, there are relatively few resources to learn R. All R packages are supported with documentation; but it is usually structured as reference material. Most documentation assumes a good understanding of the fundamentals of statistics.
The goal of this book is to introduce the readers to some of the useful data science techniques and their implementation with the R programming language. In terms of the content, the book attempts to strike a balance between the how : specific processes and methodologies, while also talking about the why : going over the intuition behind how a particular technique works, so that the reader can apply it to the problem at hand.
The book does not assume familiarity with statistics. We will review the prerequisite concepts from statistics as they are needed. The book assumes that the reader is familiar with programming: proficient in at least one programming language. We provide an overview of the R programming language and the development environment in the Appendix.
This book is not intended to be a replacement for a statistics textbook. We will not go into deep theoretical details of the methods including the mathematical formulae. The focus of the book is practical; with the goal of covering how to implement these techniques in R. To gain a deeper understanding of the underlying methodologies, we refer the reader to textbooks on statistics [].
The scope of this book is not encyclopedic: there are hundreds of data science methodologies that are used in practice. In this book we only cover some of the important ones that will help the reader get started with data science. All the methodologies that we cover in this book are also fairly detailed subjects by themselves: each worthy of a separate volume. We aim to cover the fundamentals and some of the most useful techniques with the goal of providing the user with a good understanding of the methodology and the steps to implement it in R. The best way to learn data analysis is by trying it out on a dataset and interpreting the results. In each chapter of this book, we apply a set of methodologies to a real-world dataset.