ALA Neal-Schuman purchases fund advocacy, awareness, and accreditation programs for library professionals worldwide.
Kyle Banerjee has wrangled data for diverse purposes in academic, government, and nonprofit environments since 1996. A firm believer that understanding people is the key to building services of the future from the systems and data of the past, his professional interests revolve around understanding workflows and identifying opportunities in data previously thought inconsistent or incomplete. Kyle has published four other books and numerous articles on a variety of topics related to applying technology in library settings.
2019 by Kyle Banerjee
Extensive effort has gone into ensuring the reliability of the information in this book; however, the publisher makes no warranty, express or implied, with respect to the material contained herein.
ISBNs
978-0-8389-1909-5 (paper)
978-0-8389-1913-2 (PDF)
978-0-8389-1910-1 (ePub)
978-0-8389-1911-8 (Kindle)
Library of Congress Control Number: 2019024258
Cover image zaie/Adobe Stock, icons vasabii/Adobe Stock.
CONTENTS
FIGURES
TABLES
The subject of making technology more accessible is one thats been near and dear to me for many years, but this book never would have been written had it not been for a number of individuals. Bonnie Parks has been willing to share experiences that remind me why most peopleespecially womenstill get left behind when it comes to technology despite initiatives to help people become more proficient. As I began work on this book, Julie Swierczek, Erica Findley, and Rosie Le Faive also made suggestions about what sort of topics should be covered. I hope they and others will continue to raise their voices to draw attention to what areas we need to work on.
Making technology accessible is all about identifying the simple and powerful ideas and tools from an overwhelming sea of choices. David Forero is one of the technologists who's especially good at doing that. He excels at describing complex ideas in plain English, which youll benefit from directly when you learn about formats in , but his influence can also be seen throughout the book.
I also would like to thank ALA Neal-Schuman for their support over the years. Many people have asked why I dont simply post what I write on the web. The reason is simplethe result you see wouldnt be nearly as good without their hard work.
Finally, I would like to thank Mark Dahl. He convinced me that things are much easier than they look and that obstacles people find blocking their paths exist more in their minds than in reality. It never would have occurred to me that writing a book or hiking miles through wilderness to climb a mountain and ski off the top were reasonable things for mere mortals to do had he not approached me years ago to do both with him.
My hope is that this book will help others realize how achievable many intimidating-sounding data wrangling tasks are, and that people will help others know these things are much easier than they appear.
Virtually everyone needs to wrangle data. Your spreadsheet software might not offer a way to select the rows you want. If you can get the rows you want, you might not be able to sort them the way you need because the names or other fields you wanted to sort on werent entered in a consistent format. When you export your spreadsheet in delimited format so you can load it into other programs, you might experience difficulties if fields contain line breaks, HTML (HyperText Markup Language), or other data that wont load properly. You might have to restructure and remap data so a new system can understand it. In addition to spreadsheets, people often need automated ways to process large numbers of files in a wide variety of formats in complex hierarchies or interact with online systems.
If these tasks sound intimidating, this book is for you. You will understand everything in this book even if you have no special technical knowledge or programming experience. Youll see how easy it is to do things that previously sounded difficult or even impossible using tools that are already on your computer and that a rudimentary knowledge of a few basic but powerful concepts and tools can solve the vast majority of the data challenges youll ever face.
Data wrangling is relatively easy for two simple reasons. The first is that the vast majority of the information we need to manage and analyze is text. When you hear people talk about manipulating delimited, XML (eXtensible Markup Language), JSON (JavaScript Object Notation), Linked Open Data, RDF (Resource Description Framework) resources, or internal metadata, they are talking about manipulating text. When librarians manipulate MARC (MAchine Readable Cataloging), they convert it to text before making transformations. When you hear people talk about using a REST (Representational State Transfer) API (Application Programming Interface) to interact with a service, they are usually just talking about sending text across a network. When people manipulate internal metadata in binary files, that too is text. When you strip away the confusing jargon, the vast majority of data wrangling librarians do is ultimately about manipulating text you could view in any word processor.
The second reason data wrangling is easy is because the tools you need are already part of modern operating systems. Armed with just rudimentary knowledge of how to use these tools, you can do amazing things with files that are millions of lines long, talk to other systems, and perform tasks in seconds that may have previously seemed impossible. This book will show you how to use simple methods that work on any computer to manage text-based data.
The tools discussed in this book wont solve all your problems, but they will help you solve the vast majority of them fairly easily. To use them, youll have to access the Command Line Interface (CLI) on your computer. Most people find that little black box intimidating, but heres an analogy that may help. For most people, using a computer is a lot like eating at a restaurantthey use a Graphical User Interface (GUI) to point at what they want, much like youd select your dinner from a menu. GUIs are great for selecting from limited choices of actions. But sometimes you need something thats not on the menu. You need language to tell the server about food allergies or other special requirements. Or maybe youd just rather make something at home yourself because its easier to get what you really wanted. Thats what using the CLI is all about.
Every example used in this book will look like something youve seen before or are very likely to see. Think about it like thisto cook for yourself, you need a few basic skills and recipes. Once you understand these fundamental ingredients and methods, you can make simple dishes. Making complicated dishes is mostly a matter of combining these same ingredients and techniques in new waysa little knowledge goes a long way.
This book is like a basic cookbook for librarians who work with data. To keep things digestible so you can remember what you read and do useful things, the book presents only the most essential information, focusing on what has proven most useful to the author over more than twenty years of wrangling data in a library environment. It will not teach you how to program. Programming is a useful skill, but it requires a lot of extra knowledge and effort that are not necessary for most problems.
Next page