Introduction
Machine learning fits mathematical models to data in order to derive insights or make predictions. These models take features as input. A feature is a numeric representation of an aspect of raw data. Features sit between data and models in the machine learning pipeline. Feature engineering is the act of extracting features from raw data and transforming them into formats that are suitable for the machine learning model. It is a crucial step in the machine learning pipeline, because the right features can ease the difficulty of modeling, and therefore enable the pipeline to output results of higher quality. Practitioners agree that the vast majority of time in building a machine learning pipeline is spent on feature engineering and data cleaning. Yet, despite its importance, the topic is rarely discussed on its own. Perhaps this is because the right features can only be defined in the context of both the model and the data; since data and models are so diverse, its difficult to generalize the practice of feature engineering across projects.
Nevertheless, feature engineering is not just an ad hoc practice. There are deeper principles at work, and they are best illustrated in situ. Each chapter of this book addresses one data problem: how to represent text data or image data, how to reduce the dimensionality of autogenerated features, when and how to normalize, etc. Think of this as a collection of interconnected short stories, as opposed to a single long novel. Each chapter provides a vignette into the vast array of existing feature engineering techniques. Together, they illustrate the overarching principles.
Mastering a subject is not just about knowing the definitions and being able to derive the formulas. It is not enough to know how the mechanism works and what it can doone must also understand why it is designed that way, how it relates to other techniques, and what the pros and cons of each approach are. Mastery is about knowing precisely how something is done, having an intuition for the underlying principles, and integrating it into ones existing web of knowledge. One does not become a master of something by simply reading a book, though a good book can open new doors. It has to involve practiceputting the ideas to use, which is an iterative process. With every iteration, we know the ideas better and become increasingly more adept and creative at applying them. The goal of this book is to facilitate the application of its ideas.
This book tries to teach the reason first, and the mathematics second. Instead of only discussing how something is done, we try to teach why. Our goal is to provide the intuition behind the ideas, so that the reader may understand how and when to apply them. There are tons of descriptions and pictures for folks who learn in different ways. Mathematical formulas are presented in order to make the intuition precise, and also to bridge this book with other existing offerings.
Code examples in this book are given in Python, using a variety of free and open source packages. The NumPy library provides numeric vector and matrix operations. Pandas provides the DataFrame that is the building block of data science in Python. Scikit-learn is a general-purpose machine learning package with extensive coverage of models and feature transformers. Matplotlib and the styling library Seaborn provide plotting and visualization support. You can find these examples as Jupyter notebooks in our GitHub repo.
The first few chapters start out slow in order to provide a bridge for folks who are just getting started with data science and machine learning. by showing a few different techniques in an end-to-end example, creating a recommender for a dataset of academic papers.
In Living Color
The illustrations in this book are best viewed in color. Really, you should print out the color versions of the Swiss roll in and paste them into your book. Your aesthetic sense will thank us.
Feature engineering is a vast topic, and more methods are being invented every day, particularly in the area of automatic feature learning. In order to limit the book to a manageable size, weve had to make some cuts. This book does not discuss Fourier analysis for audio data, though it is a beautiful subject that is closely related to eigen analysis in linear algebra (which we touch upon in Chapters ). We also skip a discussion of random features, which are intimately related to Fourier analysis. We provide an introduction to feature learning via deep learning for image data, but do not go into depth on the numerous deep learning models under active development. Also out of scope are advanced research ideas like random projections, complex text featurization models such as word2vec and Brown clustering, and latent space models like Latent Dirichlet allocation and matrix factorization. If those words mean nothing to you, then you are in luck. If the frontiers of feature learning are where your interest lies, then this is probably not the book for you.
The book assumes knowledge of basic machine learning concepts, such as what a model is and what a vector is, though a refresher is provided so were all on the same page. Experience with linear algebra, probability distributions, and optimization are helpful, but not necessary.
Conventions Used in This Book
The following typographical conventions are used in this book:
ItalicIndicates new terms, URLs, email addresses, filenames, and file extensions.
Constant width
Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.