Web Scraping with Python
by Ryan Mitchell
Copyright 2018 Ryan Mitchell. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles ( .
- Editor: Allyson MacDonald
- Production Editor: Justin Billing
- Copyeditor: Sharon Wilkey
- Proofreader: Christina Edwards
- Indexer: Judith McConville
- Interior Designer: David Futato
- Cover Designer: Karen Montgomery
- Illustrator: Rebecca Demarest
- April 2018: Second Edition
Revision History for the Second Edition
- 2018-03-20: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781491985571for release details.
The OReilly logo is a registered trademark of OReilly Media, Inc. Web Scraping with Python, the cover image, and related trade dress are trademarks of OReilly Media, Inc.
While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
978-1-491-98557-1
[LSI]
Preface
To those who have not developed the skill, computer programming can seem like a kind of magic. If programming is magic, web scrapingis wizardry: the application of magic for particularly impressive and usefulyet surprisingly effortlessfeats.
In my years as a software engineer, Ive found that few programming practices capture the excitement of both programmers and laymen alike quite like web scraping. The ability to write a simple bot that collects data and streams it down a terminal or stores it in a database, while not difficult, never fails to provide a certain thrill and sense of possibility, no matter how many times you might have done it before.
Unfortunately, when I speak to other programmers about web scraping, theres a lot of misunderstanding and confusion about the practice. Some people arent sure its legal (it is), or how to handle problems like JavaScript-heavy pages or required logins. Many are confused about how to start a large web scraping project, or even where to find the data theyre looking for. This book seeks to put an end to many of these common questions and misconceptions about web scraping, while providing a comprehensive guide to most common web scraping tasks.
Web scraping is a diverse and fast-changing field, and Ive tried to provide both high-level concepts and concrete examples to cover just about any data collection project youre likely to encounter. Throughout the book, code samples are provided to demonstrate these concepts and allow you to try them out. The code samples themselves can be used and modified with or without attribution (although acknowledgment is always appreciated). All code samples are available on GitHub for viewing and downloading.
What Is Web Scraping?
The automated gathering of data from the internet is nearly as old as the internet itself. Although web scrapingis not a new term, in years past the practice has been more commonly known as screen scraping, data mining, web harvesting, or similar variations. General consensus today seems to favor web scraping, so that is the term I use throughout the book, although I also refer to programs that specifically traverse multiple pages as