MINING SOCIAL MEDIA
Finding Stories in Internet Data
by Lam Thuy Vo
San Francisco
MINING SOCIAL MEDIA. Copyright 2020 by Lam Thuy Vo.
Some rights reserved. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/us/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.
ISBN-10: 1-59327-916-7
ISBN-13: 978-1-59327-916-5
Publisher: William Pollock
Production Editor: Meg Sneeringer
Cover Illustration: Gina Redman
Developmental Editors: Jan Cash and Alex Freed
Technical Reviewer: Melissa Lewis
Copyeditor: Rachel Monaghan
Compositor: Danielle Foster
Proofreader: Emelie Burnette
Indexer: Beth Nauman-Montana
For information on distribution, translations, or bulk sales, please contact No Starch Press, Inc. directly:
No Starch Press, Inc.
245 8th Street, San Francisco, CA 94103
phone: 1.415.863.9900;
Library of Congress Cataloging-in-Publication Data:
Names: Vo, Lam Thuy, author.
Title: Mining social media : finding stories in Internet data / Lam Thuy
Vo.
Description: San Francisco : No Starch Press, Inc., 2019. | Includes
bibliographical references and index.
Identifiers: LCCN 2019030568 (print) | LCCN 2019030569 (ebook) | ISBN
9781593279165 (paperback) | ISBN 9781593279172 (ebook)
Subjects: LCSH: Social sciences--Research--Methodology. | Internet
research. | Data mining. | Social media--Research. | Quantitative
research. | Qualitative research.
Classification: LCC H61.95 .V63 2019 (print) | LCC H61.95 (ebook) | DDC
302.23/1072--dc23
LC record available at https://lccn.loc.gov/2019030568
LC ebook record available at https://lccn.loc.gov/2019030569
No Starch Press and the No Starch Press logo are registered trademarks of No Starch Press, Inc. Other product and company names mentioned herein may be the trademarks of their respective owners. Rather than use a trademark symbol with every occurrence of a trademarked name, we are using the names only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark.
The information in this book is distributed on an As Is basis, without warranty. While every precaution has been taken in the preparation of this work, neither the author nor No Starch Press, Inc. shall have any liability to any person or entity with respect to any loss or damage caused or alleged to be caused directly or indirectly by the information contained in it.
To m Lua, ba Liem, and anh Luan
About the Author
Lam Thuy Vo is a senior reporter at BuzzFeed News where her area of expertise is the intersection of technology, society, and social media data, and where she covers the spread of misinformation, hatred online, and platform-related accountability. Previously, she led teams and reported for The Wall Street Journal, Al Jazeera America, and NPRs Planet Money, telling economic stories across the US and throughout Asia. She has also worked as an educator for a decade, developing newsroom-wide training programs, workshops for journalists around the world, and semester-long courses for the Craig Newmark CUNY Graduate School of Journalism. She has also spoken at Pop-Up Magazine, the Tribeca Film Festivals Interactive Day, and TEDxNYC, among other larger events.
About the Technical Reviewer
Melissa Lewis is a data reporter for Reveal from The Center for Investigative Reporting. Prior to joining Reveal, she was a data editor at The Oregonian, a data engineer at Simple, a data analyst at Periscopic, and a neuroscience research assistant at Oregon Health & Science University. She is an organizer for PyLadies Portland and the Portland chapter of the Asian American Journalists Association.
CONTENTS IN DETAIL
1
THE PROGRAMMING LANGUAGES YOULL NEED TO KNOW
2
WHERE TO GET YOUR DATA
3
GETTING DATA WITH CODE
4
SCRAPING YOUR OWN FACEBOOK DATA
5
SCRAPING A LIVE SITE
6
INTRODUCTION TO DATA ANALYSIS
7
VISUALIZING YOUR DATA
8
ADVANCED TOOLS FOR DATA ANALYSIS
9
FINDING TRENDS IN REDDIT DATA
10
MEASURING THE TWITTER ACTIVITY OF POLITICAL ACTORS
11
WHERE TO GO FROM HERE
ACKNOWLEDGMENTS
This is perhaps not a thank you but an acknowledgment of the people who once were part of my timeline and in one way or another broke my heart: without the pain there would not have been Quantified Breakup, a tumblr of data visualizations about emotional resiliency as captured through ones digital footprint. It was this project that essentially propelled my work into new directionsthe exploration of social media data and quantified selfies. It was also during a talk about this project that the wonderful Jan Cash, an editor at No Starch Press at the time, approached me to write this book.
More importantly, there are those who remain in my timeline and who have been exceptionally supportive of me. Thanks to m Lua and ba Liem for making me an empathetic and curious tinkerer, to my brother Luan Vo Nguyen Quang and my sister-in-law Tiffany Talsma for their steadfast and years-long support from across all continents, to Cathy Deng and Jamica El for constant encouragement during my early Python days in the Bay Area, to Julia B. Chan, Lo Benichou, Aaron Williams, Ted Han and Andrew Tran for the camaraderie in an industry full of competitors, to John Wingenter, Adrienne Lopes, Vita Ayala, Mariru Kojima and Toyin Ojih Odutola for providing family far away from family, and to my niece Elynna Quynh Vo whos the future.
INTRODUCTION
We experience the social web in brief moments that flash by, often without ever coming back to them. Liking a photo on Instagram, sharing a post that someone published on Facebook, or messaging a friend on WhatsAppwhatever the specific interaction, we do it once and likely dont think about it after.
But from swipes to clicks to status updates, our online lives are being captured by social media companies and used to fill some of the largest data servers in the world. We are producing more data than ever before. By looking at these data points as a whole, we can gain tremendous insight into human behavior. We can also investigate the harm done by these systems, from detecting false online actors (for example, automated bot accounts or fake profiles that seed misinformation) to understanding how algorithms surface questionable content to viewers over time.
If we look at these data points collectively, we can find patterns, trends, or anomalies and, hopefully, better understand the ways in which we consume and shape the human experience online. This book aims to help those who want to go from simply observing the social web one post or tweet at a time to understanding it on a larger, more meaningful scale.
What Is Data Analysis?
The main goal for any data analyst is to gain useful insights from large quantities of information. We can think of data analysis as a way to interview a vast number of records: we may ask about unusual single events, or we may be looking into long-term trends. Interviewing a data set can be a lengthy process with various twists and turns: it might take a few different approaches to find the answers to our questions, the same way it might take a few different meetings to get a good sense of an interviewee.