SQL for Data Analysis
by Cathy Tanimura
Copyright 2021 Cathy Tanimura. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .
- Acquisitions Editor: Andy Kwan
- Development Editors Amelia Blevins and Shira Evans
- Production Editor: Kristen Brown
- Copyeditor: Arthur Johnson
- Proofreader: Paula L. Fleming
- Indexer: Ellen Troutman-Zaig
- Interior Designer: David Futato
- Cover Designer: Karen Montgomery
- Illustrator: Kate Dullea
- September 2021: First Edition
Revision History for the First Edition
- 2021-09-09: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781492088783 for release details.
The OReilly logo is a registered trademark of OReilly Media, Inc. SQL for Data Analysis, the cover image, and related trade dress are trademarks of OReilly Media, Inc.
The views expressed in this work are those of the author, and do not represent the publishers views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
978-1-492-08878-3
[LSI]
Preface
Over the past 20 years, Ive spent many of my working hours manipulating data with SQL. For most of those years, Ive worked in technology companies spanning a wide range of consumer and business-to-business industries. In that time, volumes of data have increased dramatically, and the technology I get to use has improved by leaps and bounds. Databases are faster than ever, and the reporting and visualization tools used to communicate the meaning in the data are more powerful than ever. One thing that has remained remarkably constant, however, is SQL being a key part of my toolbox.
I remember when I first learned SQL. I started my career in finance, where spreadsheets rule, and Id gotten pretty good at writing formulas and memorizing all those keyboard shortcuts. One day I totally geeked out and Ctrl- and Alt-clicked every key on my keyboard just to see what would happen (and then created a cheat sheet for my peers). That was part fun and part survival: the faster I was with my spreadsheets, the more likely I would be to finish my work before midnight so I could go home and get some sleep. Spreadsheet mastery got me in the door at my next role, a startup where I was first introduced to databases and SQL.
Part of my role involved crunching inventory data in spreadsheets, and thanks to early internet scale, the data sets were sometimes tens of thousands of rows. This was big data at the time, at least for me. I got in the habit of going for a cup of coffee or for lunch while my computers CPU was occupied with running its vlookup magic. One day my manager went on vacation and asked me to tend to the data warehouse hed built on his laptop using Access. Refreshing the data involved a series of steps: running SQL queries in a portal, loading the resulting csv files into the database, and then refreshing the spreadsheet reports. After the first successful load, I started tinkering, trying to understand how it worked, and pestering the engineers to show me how to modify the SQL queries.
I was hooked, and even when I thought I might change directions with my career, Ive kept coming back to data. Manipulating data, answering questions, helping my colleagues work better and smarter, and learning about businesses and the world through sets of data have never stopped feeling fun and exciting.
When I started working with SQL, there werent many learning resources. I got a book on basic syntax, read it in a night, and from there mostly learned through trial and error. Back in the days when I was learning, I queried production databases directly and brought the website down more than once with my overly ambitious (or more likely just poorly written) SQL. Fortunately my skills improved, and over the years I learned to work forward from the data in tables, and backward from the output needed, solving technical and logic challenges and puzzles to write queries that returned the right data. I ended up designing and building data warehouses to gather data from different sources and avoid bringing down critical production databases. Ive learned a lot about when and how to aggregate data before writing the SQL query and when to leave data in a more raw form.
Ive compared notes with others who got into data around the same time, and its clear we mostly learned in the same ad hoc way. The lucky among us had peers with whom to share techniques. Most SQL texts are either introductory and basic (theres definitely a place for these!) or else aimed at database developers. There are few resources for advanced SQL users who are focused on analysis work. Knowledge tends to be locked up in individuals or small teams. A goal of this book is to change that, giving practitioners a reference for how to solve common analysis problems with SQL, and I hope inspiring new inquiries into data using techniques you might not have seen before.
Conventions Used in This Book
The following typographical conventions are used in this book:
ItalicIndicates new terms, URLs, email addresses, filenames, file extensions, and keywords .
Constant width
Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, environment variables, and statements.
Constant width bold
Shows commands or other text that should be typed literally by the user.
Constant width italic
Shows text that should be replaced with user-supplied values or by values determined by context.
Tip
This element signifies a tip or suggestion.
Note
This element signifies a general note.
Warning
This element indicates a warning or caution.
Using Code Examples
Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/cathytanimura/sql_book.
If you have a technical question or a problem using the code examples, please send email to .
This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless youre reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from OReilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your products documentation does require permission .