Principles of Data Wrangling
by Tye Rattenbury , Joseph M. Hellerstein , Jeffrey Heer , Sean Kandel , and Connor Carreras
Copyright 2017 Trifacta, Inc. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .
- Editor: Shannon Cutt
- Production Editor: Kristen Brown
- Copyeditor: Bob Russell, Octal Publishing, Inc.
- Proofreader: Christina Edwards
- Interior Designer: David Futato
- Cover Designer: Karen Montgomery
- Illustrator: Rebecca Demarest
Revision History for the First Edition
- 2017-04-25: First Release
- 2017-06-27: Second Release
The OReilly logo is a registered trademark of OReilly Media, Inc. Principles of Data Wrangling, the cover image, and related trade dress are trademarks of OReilly Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
978-1-491-93892-8
[LSI]
Foreword
Through the last decades of the twentieth century and into the twenty-first, data was largely a medium for bottom-line accounting: making sure that the books were balanced, the rules were followed, and the right numbers could be rolled up for executive decision-making. It was an era focused on a select group of IT staff engineering the golden master of organizational data; an era in which mantras like garbage in, garbage out captured the attitude that only carefully engineered data was useful.
Attitudes toward data have changed radically in the past decade, as new people, processes, and technologies have come forward to define the hallmarks of a data-driven organization. In this context, data is a medium for top-line value generation, providing evidence and content for the design of new products, new processes, and evermore efficient operation. Todays data-driven organizations have analysts working broadly across departments to find methods to use data creatively. It is an era in which new mantras like extracting signal from the noise capture a different attitude of agile experimentation and exploitation of large, diverse sources of data.
Of course, accounting still needs to get done in the twenty-first century, and the need remains to curate select datasets. But the data sources and processes for accountancy are relatively small and slow to change. The data that drives creative and exploratory analyses represents an (exponentially!) growing fraction of the data in most organizations, driving widespread rethinking of processes for data and computingincluding the way that IT organizations approach their traditional tasks.
The phrase data wrangling, born in the modern context of agile analytics, is meant to describe the lions share of the time people spend working with data. There is a common misperception that data analysis is mostly a process of running statistical algorithms on high-performance data engines. In practice, this is just the final step of a longer and more complex process; 50 to 80 percent of an analysts time is spent wrangling data to get it to the point at which this kind of analysis is possible. Not only does data wrangling consume most of an analysts workday, it also represents much of the analysts professional process: it captures activities like understanding what data is available; choosing what data to use and at what level of detail; understanding how to meaningfully combine multiple sources of data; and deciding how to distill the results to a size and shape that can drive downstream analysis. These activities represent the hard work that goes into both traditional data curation and modern data analysis. And in the context of agile analytics, these activities also capture the creative and scientific intuition of the analyst, which can dictate different decisions for each use case and data source.
We have been working on these issues with data-centric folks of various stripesfrom the IT professionals who fuel data infrastructure in large organizations, to professional data analysts, to data-savvy enthusiasts in roles from marketing to journalism to science and social causes. Much is changing across the board here. This book is our effort to wrangle the lessons we have learned in this context into a coherent overview, with a specific focus on the more recent and quickly growing agile analytic processes in data-driven organizations. Hopefully, some of these lessons will help to clarify the importanceand yes, the satisfactionof data wrangling done well.
Chapter 1. Introduction
Lets begin with the most important question: why should you read this book? The answer is simple: you want more value from your data. To put a little more meat on that statement, our objective in writing this book is to help the variety of people who manage the analysis or application of data in their organizations. The data might or might not be yours, in the strict sense of ownership. But the pains in extracting value from this data are.
Were focused on two kinds of readers. First are people who manage the analysis and application of data indirectlythe managers of teams or directors of data projects. Second are people who work with data directlythe analysts, engineers, architects, statisticians, and scientists.
If youre reading this book, youre interested in extracting value from data. We can categorize this value into two types along a temporal dimension: near-term value and long-term value. In the near term, you likely have a sizable list of questions that you want to answer using your data. Some of these questions might be vague; for example, Are people really shifting toward interacting with us through their mobile devices? Other questions might be more specific: When will our customers interactions primarily originate from mobile devices instead of from desktops or laptops?
What is stopping you from answering these questions? The most common answer we hear is time. You know the questions, you know how to answer them, but you just dont have enough hours in the day to wrangle your data into the right form.
Beyond the list of known questions related to the near-term value of your data is the optimism that your data has greater potential long-term value. Can you use it to forecast important seasonal changes? What about risks in your supply chain due to weather or geopolitical shifts? Can you understand how the move to mobile is affecting your customers purchasing patterns? Organizations generally hire data scientists to take on these longer-term, exploratory analyses. But even if you have the requisite skills to tackle these kinds of analyses, you might still struggle to be allocated sufficient time and resources. After all, exploratory analytics projects can take months, and often contain a nontrivial risk of producing primarily negative or ambiguous results.