Introduction to Machine Learning with Python
by Andreas C. Mller and Sarah Guido
Copyright 2017 Sarah Guido, Andreas Mller. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .
- Editor: Dawn Schanafelt
- Production Editor: Kristen Brown
- Copyeditor: Rachel Head
- Proofreader: Jasmine Kwityn
- Indexer: Judy McConville
- Interior Designer: David Futato
- Cover Designer: Karen Montgomery
- Illustrator: Rebecca Demarest
- October 2016: First Edition
Revision History for the First Edition
- 2016-09-22: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781449369415 for release details.
The OReilly logo is a registered trademark of OReilly Media, Inc. Introduction to Machine Learning with Python, the cover image, and related trade dress are trademarks of OReilly Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
978-1-449-36941-5
[LSI]
Preface
Machine learning is an integral part of many commercial applications and research projects today, in areas ranging from medical diagnosis and treatment to finding your friends on social networks. Many people think that machine learning can only be applied by large companies with extensive research teams. In this book, we want to show you how easy it can be to build machine learning solutions yourself, and how to best go about it.With the knowledge in this book, you can build your own system for finding out how people feel on Twitter, or making predictions about global warming. The applications of machine learning are endless and, with the amount of data available today, mostly limited by your imagination.
Who Should Read This Book
This book is for current and aspiring machine learning practitioners looking to implement solutions to real-world machine learning problems. This is an introductory book requiring no previous knowledge of machine learning or artificial intelligence (AI). We focus on using Python and the scikit-learn
library, and work through all the steps to create a successful machine learning application. The methods we introduce will be helpful for scientists and researchers, as well as data scientists working on commercial applications. You will get the most out of the book if you are somewhat familiar with Python and the NumPy and matplotlib
libraries.
We made a conscious effort not to focus too much on the math, but rather on the practical aspects of using machine learning algorithms. As mathematics (probability theory, in particular) is the foundation upon which machine learning is built, we wont go into the analysis of the algorithms in great detail. If you are interested in the mathematics of machine learning algorithms, we recommend the book The Elements of Statistical Learning (Springer) by Trevor Hastie, Robert Tibshirani, and Jerome Friedman, which is available for free at the authors website. We will also not describe how to write machine learning algorithms from scratch, and will instead focus on how to use the large array of models already implemented in scikit-learn
and other libraries.
Why We Wrote This Book
There are many books on machine learning and AI. However, all of them are meant for graduate students or PhD students in computer science, and theyre full of advanced mathematics. This is in stark contrast with how machine learning is being used, as a commodity tool in research and commercial applications. Today, applying machine learning does not require a PhD. However, there are few resources out there that fully cover all the important aspects of implementing machine learning in practice, without requiring you to take advanced math courses. We hope this book will help people who want to apply machine learning without reading up on years worth of calculus, linear algebra, and probability theory.
Navigating This Book
This book is organized roughly as follows:
introduces the fundamental concepts of machine learning and its applications, and describes the setup we will be using throughout the book.
Chapters describe the actual machine learning algorithms that are most widely used in practice, and discuss their advantages and shortcomings.
discusses the importance of how we represent data that is processed by machine learning, and what aspects of the data to pay attention to.
covers advanced methods for model evaluation and parameter tuning, with a particular focus on cross-validation and grid search.
explains the concept of pipelines for chaining models and encapsulating your workflow.
shows how to apply the methods described in earlier chapters to text data, and introduces some text-specific processing techniques.
offers a high-level overview, and includes references to more advanced topics.
While Chapters to evaluate and tune your model.
Online Resources
While studying this book, definitely refer to the scikit-learn
website for more in-depth documentation of the classes and functions, and many examples.There is also a video course created by Andreas Mller, Advanced Machine Learning with scikit-learn, that supplements this book. You can find it at http://bit.ly/advanced_machine_learning_scikit-learn.
Conventions Used in This Book
The following typographical conventions are used in this book:
ItalicIndicates new terms, URLs, email addresses, filenames, and file extensions.
Constant width
Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords. Also used for commands and module and package names.
Constant width bold
Shows commands or other text that should be typed literally by the user.
Constant width italic
Shows text that should be replaced with user-supplied values or by values determined by context.
Tip
This element signifies a tip or suggestion.
Note
This element signifies a general note.
Warning
This icon indicates a warning or caution.
Using Code Examples
Supplemental material (code examples, IPython notebooks, etc.) is available for download at https://github.com/amueller/introduction_to_ml_with_python.
This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless youre reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from OReilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your products documentation does require permission.