Natural Language Processing in Action: Understanding, analyzing, and generating text with Python
Copyright
For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact
Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email:
orders@manning.com2019 by Manning Publications Co. All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps.
Recognizing the importance of preserving what has been written, it is Mannings policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine.
| Manning Publications Co. 20 Baldwin RoadPO Box 761Shelter Island, NY 11964 |
Acquisitions editor: Brian SawyerDevelopment editor: Karen MillerTechnical development editor: Ren van den BergReview editor: Ivan MartinoviProduction editor: Anthony CalcaraCopy editor: Darren MeissProofreader: Alyson BrenerTechnical proofreader: Davide CadamuroTypesetter and cover designer: Marija Tudor
ISBN 9781617294631
Printed in the United States of America
1 2 3 4 5 6 7 8 9 10 SP 24 23 22 21 20 19
Brief Table of Contents
Table of Contents
Foreword
I first met Hannes in 2006 when we started different post-graduate degrees in the same department. He quickly became known for his work leveraging the union of machine learning and electrical engineering and, in particular, a strong commitment to having a positive world impact. Throughout his career, this commitment has guided each company and project he has touched, and it was by following this internal compass that he connected with Hobson and Cole, who share similar passion for projects with a strong positive impact.
When approached to write this foreword, it was this passion for the application of machine learning (ML) for good that persuaded me. My personal journey in machine learning research was similarly guided by a strong desire to have a positive impact on the world. My path led me to develop algorithms for multi-resolution modeling ecological data for species distributions in order to optimize conservation and survey goals. I have since been determined to continue working in areas where I can improve lives and experiences through the application of machine learning.
With great power comes great responsibility.
Voltaire?
Whether you attribute these words to Voltaire or Uncle Ben, they hold as true today as ever, though perhaps in this age we could rephrase to say, With great access to data comes great responsibility. We trust companies with our data in the hope that it is used to improve our lives. We allow our emails to be scanned to help us compose more grammatically correct emails; snippets of our daily lives on social media are studied and used to inject advertisements into our feeds. Our phones and homes respond to our words, sometimes when we are not even talking to them. Even our news preferences are monitored so that our interests, opinions, and beliefs are indulged. What is at the heart of all these powerful technologies?
The answer is natural language processing. In this book you will learn both the theory and practical skills needed to go beyond merely understanding the inner workings of these systems, and start creating your own algorithms or models. Fundamental computer science concepts are seamlessly translated into a solid foundation for the approaches and practices that follow. Taking the reader on a clear and well-narrated tour through the core methodologies of natural language processing, the authors begin with tried and true methods, such as TF-IDF, before taking a shallow but deep (yes, I made a pun) dive into deep neural networks for NLP.
Language is the foundation upon which we build our shared sense of humanity. We communicate not just facts, but emotions; through language we acquire knowledge outside of our realm of experience, and build understanding through sharing those experiences. You have the opportunity to develop a solid understanding, not just of the mechanics of NLP, but the opportunities to generate impactful systems that may one day understand humankind through our language. The technology of NLP has great potential for misuse, but also great potential for good. Through sharing their knowledge, via this book, the authors hope to tip us towards a brighter future.
D R . A RWEN G RIFFIOEN
S ENIOR D ATA S CIENTIST - R ESEARCH
Z ENDESK
Preface
Around 2013, natural language processing and chatbots began dominating our lives. At first Google Search had seemed more like an index, a tool that required a little skill in order to find what you were looking for. But it soon got smarter and would accept more and more natural language searches. Then smart phone autocomplete began to get sophisticated. The middle button was often exactly the word you were looking for.[]
Hit the middle button (https://www.reddit.com/r/ftm/comments/2zkwrs/middle_button_game/:) repeatedly on a smart phone predictive text keyboard to learn what Google thinks you want to say next. It was first introduced on Reddit as the SwiftKey game (https://blog.swiftkey.com/swiftkey-game-winning-is/) in 2013.
In late 2014, Thunder Shiviah and I were collaborating on a Hack Oregon project to mine natural language campaign finance data. We were trying to find connections between political donors. It seemed politicians were hiding their donors identities behind obfuscating language in their campaign finance filings. The interesting thing wasnt that we were able to use simple natural language processing techniques to uncover these connections. What surprised me the most was that Thunder would often respond to my rambling emails with a succinct but apt reply seconds after I hit send on my email. He was using Smart Reply, a Gmail Inbox assistant that composes replies faster than you can read your email.
So I dug deeper, to learn the tricks behind the magic. The more I learned, the more these impressive natural language processing feats seemed doable, understandable. And nearly every machine learning project I took on seemed to involve natural language processing.
Perhaps this was because of my fondness for words and fascination with their role in human intelligence. I would spend hours debating whether words even have meaning with John Kowalski, my information theorist boss at Sharp Labs. As I gained confidence, and learned more and more from my mentors and mentees, it seemed like I might be able to build something new and magical myself.
One of the tricks I learned was to iterate through a collection of documents and count how often words like War and Hunger are followed by words like Games or III. If you do that for a large collection of texts, you can get pretty good at guessing the right word in a chain of words, a phrase, or sentence. This classical approach to language processing was intuitive to me.