Treading on Python Series
Learning Pandas
Python Tools for Data Munging, Data Analysis, and Visualization
Matt Harrison
Technical Editor:
Copyright 2016
While every precaution has been taken in the preparation of this book, the publisher and author assumes no responsibility for errors oromissions, or for damages resulting from the use of theinformation contained herein.
Table of Contents
From the Author
Python is easy to learn. You can learn the basics in a day and beproductive with it. With only an understanding of Python, movingto pandas can be difficult or confusing. This book is meantto aid you in mastering pandas.
I have taught Python and pandas to many people over the years,in large corporate environments, small startups, and inPython and Data Science conferences. I have seen what hangspeople up, and confuses them. With the correct background,an attitude of acceptance, and a deep breath, much of this confusionevaporates.
Having said this, pandas is an excellent tool. Many are usingit around the world to great success. I hope you do as well.
Cheers!
Matt
Introduction
I have been using Python is some professional capacity since theturn of the century. One of the trends that I have seen in thattime is the uptake of Python for various aspects of "data science"- gatheringdata, cleaning data, analysis, machine learning, and visualization.The pandas library has seen much uptake in this area.
pandas is a data analysis library for Python that has exploded inpopularity over the past years. The website describes it thusly:
pandas is an open source, BSD-licensed library providinghigh-performance, easy-to-use data structures and data analysis toolsfor the Python programming language.
-pandas.pydata.org
My description of pandas is: pandas is an in memory nosql database,that has sql-like constructs, basic statistical and analytic support,as well as graphing capability. Because it is built on top of Cython,it has less memory overhead and runs quicker. Many people are using pandas toreplace Excel, perform ETL, process tabular data, load CSV or JSONfiles, and more. Though it grew out of the financial sector (foranalysis of time series data), it is now a general purpose datamanipulation library.
Because pandas has some lineage back to NumPy, it adopts someNumPy'isms that normal Python programmers may not be aware of orfamiliar with. Certainly, one could go out and use Cython to performfast typed data analysis with a Python-like dialect, but with pandas,you don't need to. This work is done for you. If you are using pandasand the vectorized operations, you are getting close to C level speeds,but writing Python.
Who this book is for
This guide is intended to introduce pandas to Python programmers. Itcovers many (but not all) aspects, as well as some gotchas or detailsthat may be counter-intuitive or even non-pythonic to longtime usersof Python.
This book assumes basic knowledge of Python. The author has writtenTreading on Python Vol 1 that provides all the backgroundnecessary.
Data in this Book
Some might complain that the datasets in this book are small. That is true,and in some cases (as in plotting a histogram), that is a drawback. On the otherhand, every attempt has been made to have real data that illustrates using pandasand the features found in it. As a visual learner, I appreciate seeing where datais coming and going. As such, I try to shy away from just showing tables ofrandom numbers that have no meaning.
Hints, Tables, and Images
The hints, tables, and graphics found in this book, have been collected overalmost five years of using pandas. They are derived from hangups, notes, and cheatsheetsthat I have developed after using pandas and teaching others how to use it. Hopefully,they are useful to you as well.
In the physical version of this book, is an index that has also been battle-testedduring development. Inevitably, when I was doing analysis not related to the book,I would check that the index had the information I needed. If it didn't, I added it.Let me know if you find any omissions!
Finally, having been around the publishing block and releasing content to the world,I realize that I probably have many omissions that others might consider requiredknowledge. Many will enjoy the content, others might have the opposite reaction.If you have feedback, or suggestions for improvement, please reach out tome. I love to hear back from readers! Your comments will improve future versions.
) refers to itself in lowercase, so this book will follow suit.
Installation
Python 3 has been out for a while now, and people claim it is the future. Asan attempt to be modern, this book will use Python 3 throughout! Do not despair,the code will run in Python 2 as well. In fact, review versions of thebook neglected to list the Python version, and there was a single complaintabout a superfluous list(range(10)) call. The lone line of (Python 2) code required for compatibilityis:
>>> from __future__ import print_function
Having gotten that out of the way, let's address installation of pandas.The easiest and least painful way to install pandas on most platforms is to usethe Anaconda distribution . Anaconda is a meta distribution of Python, thatcontains many additional packages that have traditionally been annoying toinstall unless you have toolchains to compile Fortran and C code. Anacondaallows you to skip the compile step and provides binaries for most platforms.The Anaconda distribution itself is freely available, though commercial supportis available as well.
After installing the Anaconda package, you should have a conda executable. Running:
$ conda install pandas
Will install pandas and any dependencies. To verify that this works, simply tryto import the pandas package:
$ python>>> import pandas>>> pandas.__version__'0.18.0'
If the library successfully imports, you should be good to go.
Other Installation Options
The pandas library will install on Windows, Mac,and Linux via pip .
Mac and Windows users wishing to install binaries maydownload them from the pandas website. Most Linux distributions also have nativepackages pre-built and available in their repos. On Ubuntu and Debian apt-get will install the library:
$ sudo apt-get install python-pandas
Pandas can also be installed from source.I feel the need to advise you that you might spend a bit of time going downthis rabbit hole if you are not familiar with getting compiler toolchains installedon your system.
It may be necessary to prepthe environment for building pandas from source by installingdependencies and the proper header files for Python. On Ubuntu this isstraightforward, other environments may be different:
$ sudo apt-get install build-essential python-all-dev
Using virtualenv will alleviate the need for superuser accessduring installation. Because virtualenv uses pip, it can downloadand install newer releases of pandas if the version found on thedistribution is lagging.
On Mac and Linux platforms, the followingcreate a virtualenv sandbox and installs the latest pandas in it(assuming that the prerequisite files are also installed):
$ virtualenv pandas-env$ source pandas-env/bin/activate$ pip install pandas
After a while, pandas should be ready for use. Try to import thelibrary and check the version:
$ source pandas-env/bin/activate$ python>>> import pandas>>> pandas.__version__'0.18.0'
scipy.stats
Some nicer plotting features require scipy.stats . Thislibrary is not required, but pandas will complain if the user tries toperform an action that has this dependency. scipy.stats has manynon-Python dependencies and in practice turns out to be a little moreinvolved to install. For Ubuntu, the following packages are requiredbefore a pip install scipy will work:
Next page