Straight to the Point
The Straight to the Point e-books are designed to thoroughly cover one targeted aspect of Excel.
Excel Data Cleansing
Straight to the Point
Oz du Soleil
Holy Macro! Books
PO Box 541731, Merritt Island FL 32953
Excel Data Cleansing Straight to the Point
2019 by Tickling Keys, Inc.
All rights reserved. No part of this book may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information or storage retrieval system without written permission from the publisher.
All terms known in this book known to be trademarks have been appropriately capitalized. Trademarks are the property of their respective owners and are not affiliated with Holy Macro! Books
Every effort has been made to make this book as complete and accurate as possible, but no warranty or fitness is implied. The information is provided on an as is basis. The authors and the publisher shall have neither liability nor responsibility to any person or entity with respect to any loss or damages arising from the information contained in this book.
First Published: July 2019
Author: Oz du Soleil
Copyeditor: Kitty Wilson
Cover Design: Suat M. Ozgur
Cover Illustration: Shannon Mattiza, 6'4 Productions
Indexer: Nellie Jay
Tech Editor: Bill Jelen
Screen Reader Captions: Deb Govern
Compositor: Jill Cabot
Published by: Holy Macro! Books, PO Box 541731, Merritt Island, FL 32953
Distributed by Independent Publishers Group, Chicago, IL
ISBN 978-1-61547-150-8
Table of Contents
This page intentionally left blank. The Introduction starts on the next page.
About the Author
Oz du Soleil is an Excel MVP whos been working with Excel since 2001. Hes co-author, along with Bill Jelen, of Guerrilla Data Analysis , 2nd edition. Oz has several Excel courses on the LinkedIn Learning platform. Hes best known for the dramatic and colorful Excel tutorials on his YouTube channel Excel on Fire .
Data cleansing is Ozs area of specialization in Excel. From his earliest days with Excel, he has found himself constantly needing to fix names that are ALLCAPS, peel addresses away from phone numbers, clean up the messes that result from extracting data from PDF files, and fix all the many other things that prevent data from being useful.
Oz has presented Excel topics and master classes at conferences in Amsterdam; Sofia, Bulgaria; So Paulo, Brazil; Toronto, Canada; and cities around the United States.
When Oz isnt elbows-deep in the guts of a spreadsheet, he does storytelling around Portland, Oregon. He has told stories onstage for Risk! , Pants on Fire , Seven Deadly Sins , Pickathon , The Moth , and other storytelling shows.
Acknowledgments
Thanks to MrExcel, Bill Jelen, for the opportunity to share one of my favorite Excel topics.
INTRODUCTION
A newspaper reporter once described to me the problem she had in getting the public school system to turn over data that it was legally required to turn over. In the face of school closings and outraged parents, the mayor and school administrators kept repeating things like the data say and the data tell usand the reporter wanted to see this omniscient data.
She eventually did get the data, and it was a mess. The law didnt require the data to be free of duplicates and ready to be sorted, filtered, and analyzed. And thats where the investigation ended because the reporter didnt have the skill or resources to cleanse the data.
Lawyers have complained about similar problems. The opposing side is required to provide databut that data doesnt have to be immediately useful. Someone has to take emails, PDFs, and reports from various banks, PayPal, QuickBooks, etc.; get the formats all the same; and then merge them together before any analysis can happen.
Without clean data, no analysis can happen.
This Straight to the Point guide provides an introduction to data cleansing , which also goes by names such as data munging and data wrangling . Whatever the name, it basically means doing what needs to be done to make data useful and trustworthy. Data cleansing can include the following tasks:
- Deleting unnecessary headers
- Deleting summary rows
- Filling in gaps
- Flattening a report
- Merging and appending data from multiple sources
- Pulling data from source X to complete data in source Y
- Splitting names from addresses
- Identifying and deleting duplicate records
- Converting units of measurement in multiple sources
- Transposing data so it appears in columns rather than rows
A Data Cleansing Example
A program coordinator is all set to analyze some survey results and oh, no! the information looks like this:
The program coordinator would like to do a lot of things with this data, including the following:
- Make graphs
- Sort it
- Get summaries about all 12 surveys
- Get a summary for each day to see if one day fared better than the others
- Extract all responses that are 1 or 2 and see if theres anything interesting
But in its current format, this data cant be used in any of these ways. It would be possible to write delicate complex formulas to analyze the data in its current shape, but not a lot of people have the skills to do that. In any case, theres a much more powerful way to make this data usable: Cleanse it by turning it into a flat file of plain rows and columns, without extra headers and titles. (Youll learn more about flattening a report in the section ) Once youve turned this report into a flat file, the world is wide open: You can generate a variety of reports, easily merge this data with other flattened data, share it with someone who needs this source data for purposes different from yours, upload it to a database, ANYTHING!
Data Cleansing as a Skill
The sad fact is that many people who are taught to analyze data are not taught about data cleansing, and many peoplelike the reporter mentioned earlierare stumped when they encounter messy data. A marketing student is likely to be taught analysis on clean data. When he gets out of school, if hes handed data thats locked in a PDF and that turns into an atrocious mess when its extracted from the PDF, will he know what to do with it? Probably not. If he learns data cleansing as a skill, however, he wont panic when he encounters such data messes. He will be able to develop a confident strategy for extracting data gold from the mud, rocks, and even permafrost.
The Straight to the Point Ethos
Data cleansing is the topic of this short book, and Im excited to share it with you because this book doesnt focus on a specific Excel feature. Instead, it is about strategy and getting things done with whatever is available in all of Excel. This book digs into some of the most common data cleansing needs and focuses on strategy and context. (Sometimes, for example, a down-and-dirty one-time solution is most efficient, and other times a Power Query solution thats more suited for repeated tasks is the best approach.) In the section youll see multiple solutions for the same problem, followed by a summary of how to pick the best strategy for claiming victory over uncooperative data in a particular set of circumstances.