| Chapter 1 Machine Learning Overview |
Machine learning can be thought of as a set of tools and methods that attempt to infer patterns and extract insight from observations made of the physical world. For example, if you wanted to predict the price of a house based on the number of rooms, number of bathrooms, square footage, and lot size, you can use a simple machine learning algorithm (e.g. linear regression) to learn from an existing real estate sales data set where the price of each house is known, and then based on what youve learned, you can predict the price of other houses where the price is unknown. In practice, this sort of prediction requires data, and in contemporary applications, this often means a high volume of data (frequently in the terabyte range and beyond). The quantity of data is important to the predictive power of machine learning; as the old adage in data science goes, more data always trumps a clever algorithm.
The subject of machine learning is one that has matured considerably over the past several years. Machine learning has grown to be the facilitator of the field of Data Science , which is, in turn, the facilitator of Big Data . Machine learning, however, is not a totally new discipline; its general principles have been around for quite some time, just under different names: data mining, knowledge discovery in databases, and business intelligence. These terms have been used to describe what is now called machine learning . Prior to that, statistics and data analysis were terms used to describe the process of gleaning knowledge from data. I believe machine learning is the best term used to describe my field to date, and the hashtag #MachineLearning has certainly heated up the Twitter-verse with an impressive number of references. Machine learning is also considered to be a branch of artificial intelligence that concerns the construction and study of systems that can learn from data. Much of machine learnings current embodiment depends on new capabilities of hardware utilizing cloud storage solutions and high-performing parallel architectures such as Apache Hadoop and Spark.
Officially, the first use of the term machine learning was in 1959 by Arthur Samuel, at the time working at IBM, who described it as the field of study that gives computers the ability to learn without being explicitly programmed. Fast forward to 1998, when Tom Mitchell, Chair of the Machine Learning Department at Carnegie Mellon University, described a learning program this way:
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
Mitchells widely quoted formal definition is broad enough to include most tasks that we would conventionally call learning tasks. As an example of a machine learning problem under this definition, consider task T: classifying spam e-mails, performance measure P: percent of e-mail properly classified as spam, and training experience E: data set of e-mails with given classifications (i.e., spam or ham). The spam classifier is one of the first modern applications of machine learning to solve a real-life business problem, and it is incorporated into most of todays e-mail applications.
Another very important axiom to remember when starting up a new machine learning project is offered by American mathematician John Tukey, who is often revered in statistics circles for his many contributions to statistical methods as well as his seminal 1977 book Exploratory Data Analysis :
The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.
This maxim implies that a machine learning practitioner needs to know when to give up, when the data you have are just not sufficient to answer the question youre trying to answer. The familiar garbage in, garbage out axiom still applies to machine learning.
Types of Machine Learning
This book will introduce you to the essential tenets of machine learning. As the main enabler of data science and big data, machine learning has garnered much interest from a broad range of industries as a way to increase the value of enterprise data assets. In this book, well examine the principles underlying the two primary types of machine learning algorithms: supervised and unsupervised, based on the R statistical environment.
Supervised machine learning is typically associated with prediction, where for each observation of the predictor measurements (also known as feature variables), there is an associated response variable value. Supervised learning is where a model that relates the response to the predictors is trained with the aim of accurately predicting the response for future observations. Many classical learning algorithms, such a linear regression and logistic regression, operate in the supervised domain.
Unsupervised machine learning is a more open-ended style of statistical learning. Instead of using labeled data sets, unsupervised learning is a set of statistical tools intended for applications where there is only a set of feature variables measured across a number of observations. In this case, prediction is not the goal because the data set is unlabeled, i.e., there is no associated response variable that can supervise the analysis. Rather, the goal is to discover interesting things about the measurements on the feature variables. For example, you might find an informative way to visualize the data or discover subgroups among the variables or the observations.
One commonly used unsupervised learning technique is k-means clustering, which allows for the discovery of clusters of data points. Another technique, called principal component analysis (PCA), is used for dimensionality reduction, i.e., reduction of the number of feature variables while maintaining the variation in the data in order to simplify the data used in other learning algorithms, speed up processing, and reduce the required memory footprint.
Use Case Examples of Machine Learning
In this section, I present a few examples of real-life business problems with machine learning solutions. In order to provide such examples, it is useful for you to see the original requirements of the project, review the data sets and each feature variable, and understand how a solution can be judged in terms of a specific metric for success. You might even decide to attempt a solution of your own after you complete reading this book. To do all these things, Ill highlight a few Kaggle ( www.kaggle.com ) data challenges that have attracted thousands of data scientists from around the world to compete for monetary awards.
Competitors in these data science challenges were to consider the following characteristics when working to find a winning solution:
- What problem does it solve and for whom?
- How is the problem being solved today (if at all)?
- What are the data sets available for the problem and where do they come from?
- How are the results of the problem solution to be exposed (e.g., BI dashboard, algorithm integrated into an online application, a static management report, etc.)?
- What type of problem is this: revenue leakage (saves us money) or revenue growth (makes us money)?
Algorithm evaluation methods were diverse for the various competitions. The most commonly used method was to minimize the value of a calculated root mean square error (RMSE), which was evaluated on predictions made for a supplied test set. The RMSE evaluation method will be explained in . Another evaluation method was an area under the ROC curve also known as AUC.