• Complain

Benjamin Bengfort - Applied Text Analysis with Python: Enabling Language Aware Data Products with Machine Learning

Here you can read online Benjamin Bengfort - Applied Text Analysis with Python: Enabling Language Aware Data Products with Machine Learning full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. year: 2017, publisher: O’Reilly Media, genre: Computer. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:

Romance novel Science fiction Adventure Detective Science History Home and family Prose Art Politics Computer Non-fiction Religion Business Children Humor

Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.

Benjamin Bengfort Applied Text Analysis with Python: Enabling Language Aware Data Products with Machine Learning

Applied Text Analysis with Python: Enabling Language Aware Data Products with Machine Learning: summary, description and annotation

We offer to read an annotation, description, summary or preface (depends on what the author of the book "Applied Text Analysis with Python: Enabling Language Aware Data Products with Machine Learning" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.

The programming landscape of natural language processing has changed dramatically in the past few years. Machine learning approaches now require mature tools like Pythons scikit-learn to apply models to text at scale. This practical guide shows programmers and data scientists who have an intermediate-level understanding of Python and a basic understanding of machine learning and natural language processing how to become more proficient in these two exciting areas of data science.

This book presents a concise, focused, and applied approach to text analysis with Python, and covers topics including text ingestion and wrangling, basic machine learning on text, classification for text analysis, entity resolution, and text visualization. Applied Text Analysis with Python will enable you to design and develop language-aware data products.

Youll learn how and why machine learning algorithms make decisions about language to analyze text; how to ingest, wrangle, and preprocess language data; and how the three primary text analysis libraries in Python work in concert. Ultimately, this book will enable you to design and develop language-aware data products.

Benjamin Bengfort: author's other books


Who wrote Applied Text Analysis with Python: Enabling Language Aware Data Products with Machine Learning? Find out the surname, the name of the author of the book and a list of all author's works by series.

Applied Text Analysis with Python: Enabling Language Aware Data Products with Machine Learning — read online for free the complete book (whole text) full work

Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "Applied Text Analysis with Python: Enabling Language Aware Data Products with Machine Learning" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.

Light

Font size:

Reset

Interval:

Bookmark:

Make
Applied Text Analysis with Python

by Benjamin Bengfort , Tony Ojeda , and Rebecca Bilbro

Copyright 2016 Benjamin Bengfort, Tony Ojeda, Rebecca Bilbro. All rights reserved.

Printed in the United States of America.

Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.

OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles ( http://oreilly.com/safari ). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .

  • Editor: Nicole Tache
  • Production Editor: FILL IN PRODUCTION EDITOR
  • Copyeditor: FILL IN COPYEDITOR
  • Proofreader: FILL IN PROOFREADER
  • Indexer: FILL IN INDEXER
  • Interior Designer: David Futato
  • Cover Designer: Karen Montgomery
  • Illustrator: Rebecca Demarest
  • January -4712: First Edition
Revision History for the First Edition
  • 2016-12-19: First Early Release

See http://oreilly.com/catalog/errata.csp?isbn=9781491962978 for release details.

The OReilly logo is a registered trademark of OReilly Media, Inc. Applied Text Analysis with Python, the cover image, and related trade dress are trademarks of OReilly Media, Inc.

While the publisher and the author(s) have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author(s) disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.

978-1-491-96297-8

[FILL IN]

Chapter 1. Text Ingestion and Wrangling

As we explored the architecture of language in the previous chapter, we began to see that it is possible to model natural language in spite of its complexity and flexibility. And yet, the best language models are often highly constrained and application-specific. Why is it that models trained in a specific field or domain of the language would perform better than ones trained on general language? Consider that the term bank is very likely to be an institution that produces fiscal and monetary tools in an economics, financial, or political domain, whereas in an aviation or vehicular domain it is more likely to be a form of motion that results in the change of direction of an aircraft. By fitting models in a narrower context, the prediction space is smaller and more specific, and therefore better able to handle the flexible aspects of language.

The bulk of our work in the subsequent chapters will be in feature extraction and knowledge engineering - where well be concerned with the identification of unique vocabulary words, sets of synonyms, interrelationships between entities, and semantic contexts. However, all of these techniques will revolve around a central text dataset: the corpus.

Corpora are collections of related documents that contain natural language. A corpus can be large or small, though generally they consist of hundreds of gigabytes of data inside of thousands of documents. For instance, considering that the average email inbox is 2GB, a moderately sized company of 200 employees would have around a half-terabyte email corpus. Documents contained by a corpus can also vary in size, from tweets to books. Corpora can be annotated, meaning that the text or documents are labeled with the correct responses for supervised learning algorithms, or unannotated, making them candidates for topic modeling and document clustering.

Note

No two corpora are exactly alike and there are many opportunities to customize the approach taken in this chapter. This chapter presents a general method for ingesting HTML data from the internet, a ubiquitous text markup that is easily parsed and available in a variety of domains. The HTML data is cleaned, parsed, segmented, tokenized, and tagged into a preprocessed data structure that will be used for the rest of the book.

Naturally the next question should then be how do we construct a dataset with which to build a language model? In order to equip you for the rest of the book, this chapter will explore the preliminaries of construction and organization of a domain-specific corpus. Working with text data is substantially different from working with purely numeric data, and there are a number of unique considerations that we will need to take. Whether it is done via scraping, RSS ingestion, or an API, ingesting a raw text corpus in a form that will support the construction of a data product is no trivial task. Moreover, when dealing with a text corpus, we must consider not only how the data is acquired, but also how it is organized on disk. Since these will be very large, often unpredictable datasets, we will need to anticipate potential performance problems and ensure memory safety through streaming data loading and multiprocessing. Finally, we must establish a systematic preprocessing method to transform our raw ingested text into a corpus that is ready for computation and modeling. By the end of this chapter, you should be able to organize your data and establish a reader that knows how to access the text on disk and present it in a standardized fashion for downstream analyses.

Acquiring a Domain-Specific Corpus

Acquiring a domain-specific corpus will be essential to producing a language-aware data product. Fortunately, the internet offers us a seemingly infinite resource with which to construct domain-specific corpora. Below are some examples of domains, along with corresponding web text data sources.

CategorySources

Politics

http://www.politico.com

http://www.cnn.com/politics

https://www.washingtonpost.com/politics

http://www.foxnews.com/politics.html

http://www.huffingtonpost.com/section/politics

Business

http://www.bloomberg.com

http://www.inc.com

https://www.entrepreneur.com

https://hbr.org

http://fortune.com

Sports

http://espn.go.com

http://sports.yahoo.com

http://bleacherreport.com

http://www.nytimes.com/pages/sports

Technology

http://www.wired.com

https://techcrunch.com

http://radar.oreilly.com

https://gigaom.com

http://gizmodo.com

Cooking

http://blog.foodnetwork.com

http://www.delish.com

http://www.epicurious.com

http://www.skinnytaste.com

One important question to address is the degree of specificity required of a corpus for effective language modeling; how specific is specific enough? As we increase the specificity of the domain, we will necessarily reduce the volume of our corpus. For instance, it would be easier to produce a large dataset about the general category sports, but that corpus would still contain a large degree of ambiguity. By specifically targeting text data about baseball or basketball, we reduce this ambiguity, but we also reduce the overall size of our corpus. This is a significant tradeoff, because we will need a very large corpus in order to provide sufficient training examples to our language models, thus we must find a balance between domain specificity and corpus size.

Next page
Light

Font size:

Reset

Interval:

Bookmark:

Make

Similar books «Applied Text Analysis with Python: Enabling Language Aware Data Products with Machine Learning»

Look at similar books to Applied Text Analysis with Python: Enabling Language Aware Data Products with Machine Learning. We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.


Jalaj Thanaki [Thanaki - Python Natural Language Processing
Python Natural Language Processing
Jalaj Thanaki [Thanaki
Reviews about «Applied Text Analysis with Python: Enabling Language Aware Data Products with Machine Learning»

Discussion, reviews of the book Applied Text Analysis with Python: Enabling Language Aware Data Products with Machine Learning and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.