Master
Data Science
and
Data Analysis
With
Pandas
By
Arun
Copyright 2020 Arun Kumar.
All rights reserved.
No part of this publication may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the publisher, except in the case of brief quotations embodied in critical reviews and certain other non-commercial uses permitted by copyright law.
First printing edition 2020.
ACKNOWLEDGEMENT
First and foremost, praises and thanks to the God, the Almighty, for His showers of blessings throughout my research work to complete the book successfully.
I would like to express my deep and sincere gratitude to my friend and colleague Rita Vishwakarma for her motivation. She has been a great support throughout the journey and encouraged me whenever needed.
I would like to extend my thanks to my colleague Nidhi Srivastava for her faith in me. She always wanted me to share my knowledge in the form of books and contribute to the society.
I am extremely grateful to my parents for their love, prayers, caring and sacrifices for educating and preparing me for my future. I am very much thankful to them for their understanding and continuing support to complete this book. Also, I express my thanks to my sisters and brother for their support and valuable prayers. My Special thanks goes to my teachers who not only educated me but also prepared me for the future. They are the lamps that burns themselves to give light to the society.
Finally, my thanks go to all the people who have supported me to complete the research work directly or indirectly.
Arun
Table Of content
Introduction
Today, data is the biggest wealth (after health) as it acts as fuel to many algorithms. Artificial intelligence and data sciences are the biggest examples of this.
To use this data in the algorithm, need comes to handle this data and manage accordingly. Data analysis solves a huge part of the problem. But how to handle this? The answer lies in one of the most used Python libraries for data analysis named Pandas.
Pandas is mainly used to deal with sequential and tabular data to manage, analyze and manipulate data in convenient and efficient manner. It is built on top of the NumPy library and has two primary data structures Series (1-dimensional) and DataFrame (2-dimensional).
Pandas generally converts your data (from csv, html, excel, etc.) into a two-dimensional data structure (DataFrame). It is size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). The data in DataFrame is then manipulated as per the need, analyzed and then stored back into some form (like csv, excel etc.).
Lets look at the advantages of using Pandas as a data analysis tool in the next chapter.
Advantages
Pandas is a widely used library for data analysis. The main reasons are:
2.1 Speed : use of Pandas decreases the execution time when compared to the traditional programming.
2.2 Short code: use of Pandas facilitates smaller code compared to the traditional way of writing the code as what would have taken multiple lines of code without Pandas can be done in fewer lines.
2.3 Saves time: as the amount of code needs to be written is less, the amount of time spent on programming is also less and thus saves times for other works.
2.4 Easy: the DataFrame generated with Pandas is easy to analyze and manipulate.
2.5 Versatile: Pandas is a powerful tool which can be used for various tasks like analyzing, visualizing, filtering data as per certain condition, segmentation etc.
2.6 Efficiency: the library has been built in such a way that it can handle large datasets very easily and efficiently. Loading and manipulating data is also very fast.
2.7 Customizable: Pandas contain feature set to apply on the data so that you can customize, edit and pivot it according to your requirement.
2.8 Supports multiple formats for I/O: data handling and analysis on Pandas can be done on various formats available. i.e. the read/write operation for Pandas is supported for multiple formats like CSV, TSV, Excel, etc.
2.9 Python support: Python being most used for data analysis and artificial intelligence enhances the importance and value of Pandas. Pandas being a Python library make Python more useful for data analysis and vice versa.
In the coming chapter, well be learning the installation of Pandas.
Installation
3.1 Install Pandas
Pandas being a Python library, its platform independent and can be installed on any machine where Python exists.
Officially Pandas is supported by Python version 2.7 and above.
3.1.1 Installing with Anaconda
If you have anaconda, then Pandas can easily be installed by:
conda install Pandas
OR for a specific version
conda install Pandas=0.20.3
in case you need to install Pandas on a virtual environment then:
create virtual environment:
conda create -n
conda create -n venv
activate virtual environment:
source activate
source activate venv
install Pandas
conda install Pandas
3.1.2 Installing with PyPI
pip install Pandas
Note: you can create virtual environment here as well and then install Pandas
Create virtual environment
python3 -m venv
python3 -m venv venv
activate virtual environment
source activate
source activate venv
install Pandas
pip3 install Pandas
or
pip3 install Pandas=0.20.3 (for specific version)
3.2 Install Jupyter Notebook
Any program which used Pandas can be ran as traditional Python program but for better understanding and clarity we prefer Jupyter notebook in data science problems.
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more. In very simple words, jupyter notebooks makes easy to visualize the data.
Installation :
pip install jupyterlab
Run Jupyter:
Once the notebook has been installed by above commands, we can just write jupyter notebook on the terminal and a web notebook will be opened. A server will start and will be running while you are working on the notebook. If will kill or close the server, the notebook will also be closed.
Lets learn Dataframes and various ways to create it in the next chapter.
Creating DataFrames
DataFrame is the main thing on which well be mostly working on. Most manipulation or operation on the data will be applied by means of DataFrame. So now lets learn to create DataFrame by various means.
4.1 Creating DataFrame using dictionary data
This is a simple process in which we just need to pass the json data to the DataFrame method.
df = pd.DataFrame(cars)
Here, cars is a json data
We have created a Dataframe from the dictionary data we have.
4.2 Creating DataFrame using list data
This is also a simple process of just passing the list to the DataFrame method.