Jos Unpingco
Python Programming for Data Analysis
1st ed. 2021
Logo of the publisher
Jos Unpingco
University of California, San Diego, CA, USA
ISBN 978-3-030-68951-3 e-ISBN 978-3-030-68952-0
https://doi.org/10.1007/978-3-030-68952-0
The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To Irene, Nicholas, and Daniella, for all their patient support.
Preface
This book grew out of notes for the ECE143 Programming for Data Analysis class that I have been teaching at the University of California, San Diego, which is a requirement for both graduate and undergraduate degrees in Machine Learning and Data Science. The reader is assumed to have some basic programming knowledge and experience using another language, such as Matlab or Java. The Python idioms and methods we discuss here focus on data analysis, notwithstanding Pythons usage across many other topics. Specifically, because raw data is typically a mess and requires much work to prepare, this text focuses on specific Python language features to facilitate such cleanup, as opposed to only focusing on particular Python modules for this.
As with ECE143, here we discuss why things are the way they are in Python instead of just that they are this way. I have found that providing this kind of context helps students make better engineering design choices in their codes, which is especially helpful for newcomers to both Python and data analysis. The text is sprinkled with little tricks of the trade to make it easier to create readable and maintainable code suitable for use in both production and development.
The text focuses on using the Python language itself effectively and then moves on to key third-party modules. This approach improves effectiveness in different environments, which may or may not have access to such third-party modules. The Numpy numerical array module is covered in depth because it is the foundation of all data science and machine learning in Python. We discuss the Numpy array data structure in detail, especially its memory aspects. Next, we move on to Pandas and develop its many features for effective and fluid data processing. Because data visualization is key to data science and machine learning, third-party modules such as Matplotlib are developed in depth, as well as web-based modules such as Bokeh, Holoviews, Plotly, and Altair.
On the other hand, I would not recommend this book to someone with no programming experience at all, but if you can do a little Python already and want to improve by understanding how and why Python works the way it does, then this is a good book for you.
To get the most out of this book, open a Python interpreter and type-along with the many code samples. I worked hard to ensure that all of the given code samples work as advertised.
Acknowledgements
I would like to acknowledge the help of Brian Granger and Fernando Perez, two of the originators of the Jupyter Notebook, for all their great work, as well as the Python community as a whole, for all their contributions that made this book possible. Hans Petter Langtangen was the author of the Doconce [1] document preparation system that was used to write this text. Thanks to Geoffrey Poore [2] for his work with PythonTeX and LATE X, both key technologies used to produce this book.
References
H.P. Langtangen, DocOnce markup language. https://github.com/hplgit/doconce
G.M. Poore, Pythontex: reproducible documents with latex, python, and more. Comput. Sci. Discov. (1), 014010 (2015)
Jos Unpingco
San Diego, CA, USA
February, 2020
The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
J. Unpingco Python Programming for Data Analysis https://doi.org/10.1007/978-3-030-68952-0_1
1. Basic Programming
Jos Unpingco
(1)
University of California, San Diego, CA, USA
Keywords
Python List Dictionary Functions Asyncio Asynchronous Decorators Generators
1.1 Basic Language
Before we get into the details, it is a good idea to get a high-level orientation to Python. This will improve your decision-making later regarding software development for your own projects, especially as these get bigger and more complex. Python grew out of a language called ABC, which was developed in the Netherlands in the 1980s as an alternative to BASIC to get scientists to utilize microcomputers, which were new at the time. The important impulse was to make non-specialist scientists able to productively utilize these new computers. Indeed, this pragmatic approach continues to this day in Python which is a direct descendent of the ABC language.
There is a saying in Pythoncome for the language, stay for the community. Python is an open source project that is community driven so there is no corporate business entity making top-down decisions for the language. It would seem that such an approach would lead to chaos but Python has benefited over many years from the patient and pragmatic leadership of Guido van Rossum, the originator of the language. Nowadays, there is a separate governance committee that has taken over this role since Guidos retirement. The open design of the language and the quality of the source code has made it possible for Python to enjoy many contributions from talented developers all over the world over many years, as embodied by the richness of the standard library. Python is also legendary for having a welcoming community for newcomers so it is easy to find help online for getting started with Python.