A collection of Data Science Interview Questions Solved in Python and Spark
Hands-on Big Data
and Machine Learning
(volume I)
Antonio Gulli
Copyright 2015 Antonio Gulli
All rights reserved.
ISBN : 1517216710
ISBN-13: 978-1517216719
Data Science is the sixth of a series of 25 Chapters devoted to algorithms, problem solving, machine learning, big data and C++/Python programming.
DEDICATION
To Lorenzo, Leonardo, Aurora and Francesca
I heard there was a secret chord
That David played and it pleased the Lord
But you don't really care for music, do ya?
[Leonard Cohen, 1984]
La tua bocca si apre al sorriso e la tua mano ad aiutare gli altri
ACKNOWLEDGMENTS
Thanks to Eric, Francesco, Michele, Dario, Domenico, Carla, Antonio, Ettore, Federica, Laura, Antonella, Susana, and Antonello for their friendship.
Table of Contents
What are the most important machine learning techniques?
Solution
In his famous essay Computing Machinery and Intelligence Alan Turing asked a fundamental question " Can machines do what we (as thinking entities) can do? " Machine learning is not about thinking but more about a related activity: Learning or better, according to Arthur Samuel, the " Field of study that gives computers the ability to learn without being explicitly programmed ".
Machine learning techniques are typically classified into two categories:
In supervised learning pairs of examples made up by (inputs, desired output) are available and the computer learns a model according to which given an input, a desired output with a minimal error is predicted. Classification, Neural Networks and Regression are all examples of supervised learning. For all techniques we assume that there is an oracle or a teacher that can teach to computers what to do in order for them to apply the learned lessons on new unseen data.
In unsupervised learning computers have no teachers and they are left alone in searching for structures, patterns and anomalies in data. Clustering and Density Estimations are typical examples of unsupervised machine learning.
Let us now review the main machine learning techniques:
In classification the teacher presents pairs of (inputs, target classes) and the computer learns to attribute classes to new unseen data. Nave Bayesian, SVM, Decision Trees and Neural Networks are all classification methodologies. The first two are discussed in this volume, while the remaining ones will be part of the next volume.
In Regression the teacher presents pairs of (inputs, continuous targets) and computers learn how to predict continuous values on new and unseen data. Linear and Logistic regression are examples which will be discussed in the present volume. Decision Trees, SVM and Neural Networks can also be used for Regression.
In Associative rule learning c omputers are presented with a large set of observations, all being made up of multiple variables. The task is then to learn relations between variables such us A & B C (if A and B happen, then C will also happen).
In Clustering computers learn how to partition observations in various subsets, so that each partition will be made up of similar observations according to some well-defined metric. Algorithms like K-Means and DBSCAN belong also to this class.
In Density estimation computers learn how to find statistical values that describe data. Algorithms like Expectation Maximization belong also to this class.
Why is it important to have a robust set of metrics for machine learning?
Solution
Any machine learning technique should be evaluated by using metrics for analytically assessing the quality of results. For instance: if we need to categorize objects such as people, movies or songs into different classes, precision and recall might be suitable metrics.
Precision is the ratio where is the number of true positives and is the number of false positives. Recall is the ratio where is the number of true positives and is the number of false negatives. True and false are attributes derived by using manually created data. Precision and Recall are typically reported in a 2-d graph known as P/R Curves, where different algorithmic graphs can be compared by reporting the achieved Precision for fixed values of Recall.
In addition, F1 is another frequently used metric, which combines Precision and Recall into a single value:
Scikit-learn provides a comprehensive set of metrics for classification, clustering, regression, ranking and pairwise judgment . As an example the code below computes Precision and Recall.
Code
import numpy as np
from sklearn . metrics import precision_recall_curve
y_true = np . array ([ , , , , ])
y_scores = np . array ([ 0.5 , 0.6 , 0.38 , 0.9 , ])
precision , recall , thresholds = precision_recall_curve ( y_true , y_scores )
print precision
print recall
Why are Features extraction and engineering so important in machine learning?
Solution
The Features are the selected variables for making predictions. For instance, suppose youd like to forecast whether tomorrow there will be a sunny day then you will probably pick features like humidity (a numerical value), speed of wind (another numeric value), some historical information (what happened during the last few years), whether or not it is sunny today (a categorical value yes/no) and a few other features. Your choice can dramatically impact on your model for the same algorithm and you need to run multiple experiments in order to find what the right amount of data and what the right features are in order to forecast with minimal error. It is not unusual to have problems represented by thousands of features and combinations of them and a good feature engineer will use tools for stack ranking features according to their contribution in reducing the error for prediction.
Different authors use different names for different features including attributes, variables and predictors. In this book we consistently use features.
Features can be categorica l such as marital status, gender, state of residence, place of birth, or numerical such as age, income, height and weight. This distinction is important because certain algorithms such as linear regression work only with numerical attributes and if categorical features are present, they need to be somehow encoded into numerical values.
In other words, feature engineering is the art of extracting, selecting and transforming essential characteristics representing data. It is sometimes considered less glamourous than machine learning algorithms but in reality any experienced Data Scientist knows that a simple algorithm on a well-chosen set of features performs better than a sophisticated algorithm on a not so good set of features.
Next page