Python Companion to Data Science
Collect Organize Explore Predict Value
by Dmitry Zinoviev
Version: P1.0 (August 2016)
Copyright 2016 The Pragmatic Programmers, LLC. This book is licensed to the individual who purchased it. We don't copy-protect it because that would limit your ability to use it for your own purposes. Please don't break this trustyou can use this across all of your devices but please do not share this copy with other members of your team, with friends, or via file sharing services. Thanks.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and The Pragmatic Programmers, LLC was aware of a trademark claim, the designations have been printed in initial capital letters or in all capitals. The Pragmatic Starter Kit, The Pragmatic Programmer, Pragmatic Programming, Pragmatic Bookshelf and the linking g device are trademarks of The Pragmatic Programmers, LLC.
Every precaution was taken in the preparation of this book. However, the publisher assumes no responsibility for errors or omissions, or for damages that may result from the use of information (including program listings) contained herein.
About the Pragmatic Bookshelf
The Pragmatic Bookshelf is an agile publishing company. Were here because we want to improve the lives of developers. We do this by creating timely, practical titles, written by programmers for programmers.
Our Pragmatic courses, workshops, and other products can help you and your team create better software and have more fun. For more information, as well as the latest Pragmatic titles, please visit us at http://pragprog.com.
Our ebooks do not contain any Digital Restrictions Management, and have always been DRM-free. We pioneered the beta book concept, where you can purchase and read a book while its still being written, and provide feedback to the author to help make a better book for everyone. Free resources for all purchasers include source code downloads (if applicable), errata and discussion forums, all available on the book's home page at pragprog.com. Were here to make your life easier.
New Book Announcements
Want to keep up on our latest titles and announcements, and occasional special offers? Just create an account on pragprog.com (an email address and a password is all it takes) and select the checkbox to receive newsletters. You can also follow us on twitter as @pragprog.
About Ebook Formats
If you buy directly from pragprog.com, you get ebooks in all available formats for one price. You can synch your ebooks amongst all your devices (including iPhone/iPad, Android, laptops, etc.) via Dropbox. You get free updates for the life of the edition. And, of course, you can always come back and re-download your books when needed. Ebooks bought from the Amazon Kindle store are subject to Amazon's polices. Limitations in Amazon's file format may cause ebooks to display differently on different devices. For more information, please see our FAQ at pragprog.com/frequently-asked-questions/ebooks. To learn more about this book and access the free resources, go to https://pragprog.com/book/dzpyds, the book's homepage.
Thanks for your continued support,
Dave Thomas and Andy Hunt
The Pragmatic Programmers
The team that produced this book includes: Katharine Dvorak (editor) Potomac Indexing, LLC (indexer) Nicole Abramowitz (copyeditor) Gilson Graphics (layout) Janet Furlow (producer)
For customer support, please contact .
For international rights, please contact .
To my beautiful and most intelligent wife Anna; to our children: graceful ballerina Eugenia and romantic gamer Roman; and to my first data science class of summer 2015.
Table of Contents
Copyright 2016, The Pragmatic Bookshelf.
Early praise for Data Science Essentials in Python
This book does a fantastic job at summarizing the various activities when wrangling data with Python. Each exercise serves an interesting challenge that is fun to pursue. This book should no doubt be on the reading list of every aspiring data scientist.
Peter Hampton |
Ulster University |
Data Science Essentials in Python gets you up to speed with the most common tasks and tools in the data science field. Its a quick introduction to many different techniques for fetching, cleaning, analyzing, and storing your data. This book helps you stay productive so you can spend less time on technology research and more on your intended research.
Jason Montojo |
Coauthor of Practical Programming: An Introduction to Computer Science Using Python 3 |
For those who are highly curious and passionate about problem solving and making data discoveries, Data Science Essentials in Python provides deep insights and the right set of tools and techniques to start with. Well-drafted examples and exercises make it practical and highly readable.
Lokesh Kumar Makani |
CASB expert, Skyhigh Networks |
Acknowledgments
I am grateful to Professor Xinxin Jiang (Suffolk University) for his valuable comments on the statistics section of the book, and to Jason Montojo (one of the authors of Practical Programming: An Introduction to Computer Science Using Python 3), Amirali Sanatinia (Northeastern University), Peter Hampton (Ulster University), Anuja Kelkar (Carnegie Mellon University), and Lokesh Kumar Makani (Skyhigh Networks) for their indispensable reviews.
Copyright 2016, The Pragmatic Bookshelf.
I must instruct you in a little science by-and-by, to distract your thoughts.
Marie Corelli, British novelist
Preface
This book was inspired by an introductory data science course in Python that I taught in summer 2015 to a small group of select undergraduate students of Suffolk University in Boston. The course was expected to be the first in a two-course sequence, with an emphasis on obtaining, cleaning, organizing, and visualizing data, sprinkled with some elements of statistics, machine learning, and network analysis.
I quickly came to realize that the abundance of systems and Python modules involved in these operations (databases, natural language processing frameworks, JSON and HTML parsers, and high-performance numerical data structures, to name a few) could easily overwhelm not only an undergraduate student, but also a seasoned professional. In fact, I have to confess that while working on my own research projects in the fields of data science and network analysis, I had to spend more time calling the help function and browsing scores of online Python discussion boards than I was comfortable with. In addition, I must admit to some embarrassing moments in the classroom when I seemed to have hopelessly forgotten the name of some function or some optional parameter.
As a part of teaching the course, I compiled a set of cheat sheets on various topics that turned out to be a useful reference. The cheat sheets eventually evolved into this book. Hopefully, having it on your desk will make you think more about data science and data analysis than about function names and optional parameters.
About This Book
This book covers data acquisition, cleaning, storing, retrieval, transformation, visualization, elements of advanced data analysis (network analysis), statistics, and machine learning. It is not an introduction to data science or a general data science reference, although youll find a quick overview of how to do data science in Chapter 1, . I assume that you have learned the methods of data science, including statistics, elsewhere. The subject index at the end of the book refers to the Python implementations of the key concepts, but in most cases you will already be familiar with the concepts.