Data Science at the Command Line
by Jeroen Janssens
Copyright 2015 Jeroen H.M. Janssens. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .
- Editors: Mike Loukides, Ann Spencer,
and Marie Beaugureau - Production Editor: Matthew Hacker
- Copyeditor: Kiel Van Horn
- Proofreader: Jasmine Kwityn
- Indexer: Wendy Catalano
- Interior Designer: David Futato
- Cover Designer: Ellie Volckhausen
- Illustrator: Rebecca Demarest
- October 2014: First Edition
Revision History for the First Edition
- 2014-09-23: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781491947852 for release details.
The OReilly logo is a registered trademark of OReilly Media, Inc. Data Science at the Command Line, the cover image of a wreathed hornbill, and related trade dress are trademarks of OReilly Media, Inc.
While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
978-1-491-94785-2
[LSI]
To my wife, Esther. Without her encouragement, support,
and patience, this book would surely have ended up in /dev/null.
Preface
Data science is an exciting field to work in. Its also still very young. Unfortunately, many people, and especially companies, believe that you need new technology in order to tackle the problems posed by data science. However, as this book demonstrates, many things can be accomplished by using the command line instead, and sometimes in a much more efficient way.
Around five years ago, during my PhD program, I gradually switched from using Microsoft Windows to GNU/Linux. Because it was a bit scary at first, I started with having both operating systems installed next to each other (known as dual-boot). The urge to switch back and forth between the two faded and at some point I was even tinkering around with Arch Linux, which allows you to build up your own custom operating system from scratch. All youre given is the command line, and its up to you what you want to make of it. Out of necessity I quickly became comfortable using the command line. Eventually, as spare time got more precious, I settled down with a GNU/Linux distribution known as Ubuntu because of its easy-of-use and large community. Nevertheless, the command line is still where Im getting most of my work done.
It actually hasnt been too long ago that I realized that the command line is not just for installing software, system configuration, and searching files. I started learning about command-line tools such as cut
, sort
, and sed
. These are examples of command-line tools that take data as input, do something to it, and print the result. Ubuntu comes with quite a few of them. Once I understood the potential of combining these small tools, I was hooked.
After my PhD, when I became a data scientist, I wanted to use this approach to do data science as much as possible. Thanks to a couple of new, open source command-line tools including scrape
, jq
, and json2csv
, I was even able to use the command line for tasks such as scraping websites and processing lots of JSON data. In September 2013, I decided to write a blog post titled below), I was able to do just that.
Im sharing this personal story not so much because I think you should know how this book came about, but more because I want you to know that I had to learn about the command line as well. Because the command line is so different from using a graphical user interface, it can be intimidating at first. But if I can learn it, then you can as well. No matter what your current operating system is and no matter how you currently do data science, by the end of this book you will be able to also leverage the power of the command line. If youre already familiar with the command line, or even if youre already dreaming in shell scripts, chances are that youll still discover a few interesting tricks or command-line tools to use for your next data science project.
What to Expect from This Book
In this book, were going to obtain, scrub, explore, and model dataa lot of it. This book is not so much about how to become better at those data science tasks. There are already great resources available that discuss, for example, when to apply which statistical test or how data can be best visualized. Instead, this practical book aims to make you more efficient and more productive by teaching you how to perform those data science tasks at the command line.
While this book discusses over 80 command-line tools, its not the tools themselves that matter most. Some command-line tools have been around for a very long time, while others are fairly new and might eventually be replaced by better ones. There are even command-line tools that are being created as youre reading this. In the past 10 months, I have discovered many amazing command-line tools. Unfortunately, some of them were discovered too late to be included in the book. In short, command-line tools come and go, and thats OK.
What matters most are the underlying ideas of working with tools, pipes, and data. Most of the command-line tools do one thing and do it well. This is part of the Unix philosophy, which makes several appearances throughout the book. Once you become familiar with the command line, and learn how to combine command-line tools, you will have developed an invaluable skilland if you can create new tools, youll be a cut above.
How to Read This Book
In general, youre advised to read this book in a linear fashion. Once a concept or command-line tool has been introduced, chances are that we employ it in a later chapter. For example, in .
Data science is a broad field that intersects with many other fields, such as programming, data visualization, and machine learning. As a result, this book touches on many interesting topics that unfortunately cannot be discussed at full length. Throughout the book, there are suggestions for additional reading. Its not required to read this material in order to follow along with the book, but when you are interested, you can use turn to these suggested readings as jumping-off points.
Who This Book Is For
This book makes just one assumption about you: that you work with data. It doesnt matter which programming language or statistical computing environment youre currently using. The book explains all the necessary concepts from the beginning.
It also doesnt matter whether your operating system is Microsoft Windows, Mac OS X , or some other form of Unix. The book comes with the Data Science Toolbox, which is an easy-to-install virtual environment. It allows you to run the command-line tools and follow along with the code examples in the same environment as this book was written. You dont have to waste time figuring out how to install all the command-line tools and their dependencies.