Python Data Science Handbook
by Jake VanderPlas
Copyright 2017 Jake VanderPlas. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .
- Editor: Dawn Schanafelt
- Production Editor: Kristen Brown
- Copyeditor: Jasmine Kwityn
- Proofreader: Rachel Monaghan
- Indexer: WordCo Indexing Services, Inc.
- Interior Designer: David Futato
- Cover Designer: Karen Montgomery
- Illustrator: Rebecca Demarest
- December 2016: First Edition
Revision History for the First Edition
- 2016-11-17: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781491912058 for release details.
The OReilly logo is a registered trademark of OReilly Media, Inc. Python Data Science Handbook, the cover image, and related trade dress are trademarks of OReilly Media, Inc.
While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
978-1-491-91205-8
[LSI]
Preface
What Is Data Science?
This is a book about doing data science with Python, which immediatelybegs the question: what is data science? Its a surprisingly harddefinition to nail down, especially given how ubiquitous the term hasbecome. Vocal critics have variously dismissed the term as a superfluouslabel (after all, what science doesnt involve data?) or a simplebuzzword that only exists to salt rsums and catch the eye ofoverzealous tech recruiters.
In my mind, these critiques miss something important. Data science,despite its hype-laden veneer, is perhaps the best label we have for thecross-disciplinary set of skills that are becoming increasinglyimportant in many applications across industry and academia. Thiscross-disciplinary piece is key: in my mind, the best existingdefinition of data science is illustrated by ).
Figure P-1. Drew Conways Data Science Venn Diagram
While some of the intersection labels are a bit tongue-in-cheek, thisdiagram captures the essence of what I think people mean when they saydata science: it is fundamentally an interdisciplinary subject. Datascience comprises three distinct and overlapping areas: theskills of a statistician who knows how to model and summarize datasets(which are growing ever larger); the skills of a computer scientist whocan design and use algorithms to efficiently store, process, andvisualize this data; and the domain expertise what we might think ofas classical training in a subject necessary both to formulate theright questions and to put their answers in context.
With this in mind, I would encourage you to think of data science not asa new domain of knowledge to learn, but as a new set of skills that you canapply within your current area of expertise. Whether you are reportingelection results, forecasting stock returns, optimizing online adclicks, identifying microorganisms in microscope photos, seeking newclasses of astronomical objects, or working with data in any otherfield, the goal of this book is to give you the ability to ask andanswer new questions about your chosen subject area.
Who Is This Book For?
In my teaching both at the University of Washington and at various tech-focusedconferences and meetups, one of the most common questions I have heardis this: how should I learn Python? The people asking are generallytechnically minded students, developers, or researchers, often with analready strong background in writing code and using computational andnumerical tools. Most of these folks dont want to learn Python perse, but want to learn the language with the aim of using it as a toolfor data-intensive and computational science. While a large patchwork ofvideos, blog posts, and tutorials for this audience is available online,Ive long been frustrated by the lack of a single good answer to thisquestion; that is what inspired this book.
The book is not meant to be an introduction to Python or to programmingin general; I assume the reader has familiarity with the Pythonlanguage, including defining functions, assigning variables, callingmethods of objects, controlling the flow of a program, and other basictasks. Instead, it is meant to help Python users learn to use Pythonsdata science stack libraries such as IPython, NumPy, Pandas,Matplotlib, Scikit-Learn, and related tools to effectively store,manipulate, and gain insight from data.
Why Python?
Python has emerged over the last couple decades as a first-class toolfor scientific computing tasks, including the analysis and visualizationof large datasets. This may have come as a surprise to early proponentsof the Python language: the language itself was not specificallydesigned with data analysis or scientific computing in mind. Theusefulness of Python for data science stems primarily from the large andactive ecosystem of third-party packages: NumPy formanipulation of homogeneous array-based data, Pandas for manipulationof heterogeneous and labeled data, SciPy for common scientificcomputing tasks, Matplotlib for publication-quality visualizations,IPython for interactive execution and sharing of code, Scikit-Learnfor machine learning, and many more tools that will be mentioned in thefollowing pages.
If you are looking for a guide to the Python language itself, I wouldsuggest the sister project to this book,A Whirlwind Tour of thePython Language. This short report provides a tour of the essentialfeatures of the Python language, aimed at data scientists who alreadyare familiar with one or more other programming languages.
Python 2 Versus Python 3
This book uses the syntax of Python 3, which contains languageenhancements that are not compatible with the 2.x series of Python.Though Python 3.0 was first released in 2008, adoption has beenrelatively slow, particularly in the scientific and web developmentcommunities. This is primarily because it took some time for many of theessential third-party packages and toolkits to be made compatible withthe new language internals. Since early 2014, however, stable releasesof the most important tools in the data science ecosystem have beenfully compatible with both Python 2 and 3, and so this book will use thenewer Python 3 syntax. However, the vast majorityof code snippets in this book will also work without modification inPython 2: in cases where a Py2-incompatible syntax is used, I will makeevery effort to note it explicitly.