Applied Text Analysis with Python
by Benjamin Bengfort , Tony Ojeda , and Rebecca Bilbro
Copyright 2016 Benjamin Bengfort, Tony Ojeda, Rebecca Bilbro. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles ( http://oreilly.com/safari ). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .
- Editor: Nicole Tache
- Production Editor: FILL IN PRODUCTION EDITOR
- Copyeditor: FILL IN COPYEDITOR
- Proofreader: FILL IN PROOFREADER
- Indexer: FILL IN INDEXER
- Interior Designer: David Futato
- Cover Designer: Karen Montgomery
- Illustrator: Rebecca Demarest
- January -4712: First Edition
Revision History for the First Edition
- 2016-12-19: First Early Release
See http://oreilly.com/catalog/errata.csp?isbn=9781491962978 for release details.
The OReilly logo is a registered trademark of OReilly Media, Inc. Applied Text Analysis with Python, the cover image, and related trade dress are trademarks of OReilly Media, Inc.
While the publisher and the author(s) have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author(s) disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
978-1-491-96297-8
[FILL IN]
Chapter 1. Text Ingestion and Wrangling
As we explored the architecture of language in the previous chapter, we began to see that it is possible to model natural language in spite of its complexity and flexibility. And yet, the best language models are often highly constrained and application-specific. Why is it that models trained in a specific field or domain of the language would perform better than ones trained on general language? Consider that the term bank is very likely to be an institution that produces fiscal and monetary tools in an economics, financial, or political domain, whereas in an aviation or vehicular domain it is more likely to be a form of motion that results in the change of direction of an aircraft. By fitting models in a narrower context, the prediction space is smaller and more specific, and therefore better able to handle the flexible aspects of language.
The bulk of our work in the subsequent chapters will be in feature extraction and knowledge engineering - where well be concerned with the identification of unique vocabulary words, sets of synonyms, interrelationships between entities, and semantic contexts. However, all of these techniques will revolve around a central text dataset: the corpus.
Corpora are collections of related documents that contain natural language. A corpus can be large or small, though generally they consist of hundreds of gigabytes of data inside of thousands of documents. For instance, considering that the average email inbox is 2GB, a moderately sized company of 200 employees would have around a half-terabyte email corpus. Documents contained by a corpus can also vary in size, from tweets to books. Corpora can be annotated, meaning that the text or documents are labeled with the correct responses for supervised learning algorithms, or unannotated, making them candidates for topic modeling and document clustering.
Note
No two corpora are exactly alike and there are many opportunities to customize the approach taken in this chapter. This chapter presents a general method for ingesting HTML data from the internet, a ubiquitous text markup that is easily parsed and available in a variety of domains. The HTML data is cleaned, parsed, segmented, tokenized, and tagged into a preprocessed data structure that will be used for the rest of the book.
Naturally the next question should then be how do we construct a dataset with which to build a language model? In order to equip you for the rest of the book, this chapter will explore the preliminaries of construction and organization of a domain-specific corpus. Working with text data is substantially different from working with purely numeric data, and there are a number of unique considerations that we will need to take. Whether it is done via scraping, RSS ingestion, or an API, ingesting a raw text corpus in a form that will support the construction of a data product is no trivial task. Moreover, when dealing with a text corpus, we must consider not only how the data is acquired, but also how it is organized on disk. Since these will be very large, often unpredictable datasets, we will need to anticipate potential performance problems and ensure memory safety through streaming data loading and multiprocessing. Finally, we must establish a systematic preprocessing method to transform our raw ingested text into a corpus that is ready for computation and modeling. By the end of this chapter, you should be able to organize your data and establish a reader that knows how to access the text on disk and present it in a standardized fashion for downstream analyses.
Acquiring a Domain-Specific Corpus
Acquiring a domain-specific corpus will be essential to producing a language-aware data product. Fortunately, the internet offers us a seemingly infinite resource with which to construct domain-specific corpora. Below are some examples of domains, along with corresponding web text data sources.
Category | Sources |
---|
Politics | http://www.politico.com http://www.cnn.com/politics https://www.washingtonpost.com/politics http://www.foxnews.com/politics.html http://www.huffingtonpost.com/section/politics |
Business | http://www.bloomberg.com http://www.inc.com https://www.entrepreneur.com https://hbr.org http://fortune.com |
Sports | http://espn.go.com http://sports.yahoo.com http://bleacherreport.com http://www.nytimes.com/pages/sports |
Technology | http://www.wired.com https://techcrunch.com http://radar.oreilly.com https://gigaom.com http://gizmodo.com |
Cooking | http://blog.foodnetwork.com http://www.delish.com http://www.epicurious.com http://www.skinnytaste.com |
One important question to address is the degree of specificity required of a corpus for effective language modeling; how specific is specific enough? As we increase the specificity of the domain, we will necessarily reduce the volume of our corpus. For instance, it would be easier to produce a large dataset about the general category sports, but that corpus would still contain a large degree of ambiguity. By specifically targeting text data about baseball or basketball, we reduce this ambiguity, but we also reduce the overall size of our corpus. This is a significant tradeoff, because we will need a very large corpus in order to provide sufficient training examples to our language models, thus we must find a balance between domain specificity and corpus size.