Randy Betancourt
Chadds Ford, PA, USA
Sarah Chen
Livingston, NJ, USA
Any source code or other supplementary material referenced by the author in this book is available to readers on GitHub via the books product page, located at www.apress.com/9781484250006 . For more detailed information, please visit http://www.apress.com/source-code .
ISBN 978-1-4842-5000-6 e-ISBN 978-1-4842-5001-3
https://doi.org/10.1007/978-1-4842-5001-3
Randy Betancourt, Sarah Chen 2019
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.
Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.
While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein.
Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail orders-ny@springer-sbm.com, or visit www.springeronline.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.
Introduction
For decades, Base SAS software has been the gold standard for data manipulation and analysis. The software can read any data source and is superb at transforming and shaping data for analysis. It has been the beneficiary of enormous resource investments over its lifetime. The company has one of the industrys most innovative R&D staff, and its products are well supported by an outstanding technical support and well documented by very capable technical writers. SAS Institute Inc. has remained focused on gathering customer input and building desired features. All of these characteristics help explain its popularity.
Since the beginning of this millennium, the accelerated growth of open source software has produced outstanding projects offering data scientists enormous capabilities to tackle problems that were previously considered outside the realm of feasibility. Chief among these is Python. Python has its heritage in scientific and technical computing domains and has a very compact syntax. It is a full-featured language that is relatively easy to learn and is able to scale offering good performance with large data volumes. This is one of the reasons why firms like Netflix use it so extensively.
By nature, SAS users are intrepid and are constantly trying to find new ways to expand the use of the software in pursuit of meeting business objectives. And given the extensive role of SAS within organizations, it only makes sense to find ways to combine the capabilities of these two languages to complement one another.
We have four main goals for our readers. The first is to provide a quick start to learning Python for users already familiar with the SAS language.
Both languages have advantages and disadvantages when it comes to a particular task. And since they are programming languages, their designers had to make certain trade-offs which can manifest themselves as features or quirks, depending on ones perspective. This is our second goal: help readers compare and contrast common tasks taking into account differences in their default behaviors. For example, SAS names are case-insensitive, while Python names are case-sensitive. Or the default sort sequence for the pandas library is the opposite of SAS default sort sequence and so on.
Rather than attempting to promote one language over the other, our third goal is to point out the integration points between the two languages. The choice of which tool to utilize for a given task typically comes down to a combination of what you as a user are familiar with and the context of the problem being solved. Knowing both languages enlarges the set of tools you can apply for the task at hand.
And finally, our fourth goal is to develop working examples for all of the topics in both Python and SAS which allows you the opportunity to try out the examples by not just executing them but by extending them to suit your own needs.
We assume you already have some basic knowledge of Python, for example, you already know how to import modules and execute Python scripts. If you dont, then you will want to spend more time with Chapter , Introduction, covering topics such as Python installation, executing Python in a Windows environment, and executing Python in a Linux environment.
In Chapter , Python Types and Formatting, we cover topics related to the Python Standard Library such as data types, Booleans with a focus on truth testing, numerical and string manipulations, and basic formatting. If you are new to Python, then it is worthwhile to spend time on this chapter practicing execution of the Python and SAS examples.
If you have a solid grasp of Python Standard Library, you can skip to Chapter , pandas Library. Beyond introducing you to DataFrames, we deal with the missing data problem endemic to any analysis task. The understanding of the pandas library underpins the remainder of the book.
Chapter , Indexing and Grouping, extends your knowledge of the pandas library by focusing on DataFrame indexing and GroupBy operations. A detailed understanding of these operations is essential for shaping data. We end this chapter by introducing techniques you can use for report production.
Data manipulation such as merging, concatenation, subsetting, updating, appending, sorting, finding duplicates, drawing samples, and transposing are covered in Chapter . We have developed scores of examples in both Python and SAS to address and illustrate the range of problems you commonly face in preparing data for analysis.
In Chapter , pandas Readers, we cover many of the popular readers and writers used to read and write data from a range of different sources including Excel, .csv files, relational databases, JSON, web APIs, and more. And while we offer detailed explanations, it is the numerous working examples you can use in your own work that make this chapter so valuable.
Working with date, datetime, time, and time zone is the focus in Chapter . In this increasingly instrumented world we live in, we are faced with processing time-based data from literally trillions of sensors. Forming and appropriately handling time Series data is no longer just the domain for time-based forecasting. Once again, we rely on the breadth of the provided examples to help you improve your skills.