This edition first published 2017
2017 John Wiley & Sons Ltd
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of Ted Kwartler to be identified as the author of this work has been asserted in accordance with law.
Registered Office
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
Editorial Offices
111 River Street, Hoboken, NJ 07030, USA
9600 Garsington Road, Oxford, OX4 2DQ, UK
The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of Warranty
The publisher and the authors make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of fitness for a particular purpose. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for every situation. In view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of experimental reagents, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each chemical, piece of equipment, reagent, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. The fact that an organization or website is referred to in this work as a citation and/or potential source of further information does not mean that the author or the publisher endorses the information the organization or website may provide or recommendations it may make. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this works was written and when it is read. No warranty may be created or extended by any promotional statements for this work. Neither the publisher nor the author shall be liable for any damages arising here from.
Library of Congress Cataloging-in-Publication Data
Names: Kwartler, Ted, 1978- author.
Title: Text mining in practice with R / Ted Kwartler.
Description: Hoboken, NJ : John Wiley & Sons, 2017. | Includes bibliographical references and index.
Identifiers: LCCN 2017006983 (print) | LCCN 2017010584 (ebook) | ISBN 9781119282013 (cloth) | ISBN 9781119282099 (pdf) | ISBN 9781119282082 (epub)
Subjects: LCSH: Data mining. | Text processing (Computer science)
Classification: LCC QA76.9.D343 K94 2017 (print) | LCC QA76.9.D343 (ebook) | DDC 006.3/12-dc23
LC record available at https://lccn.loc.gov/2017006983
Cover Design: Wiley
Cover Image: ChrisPole/Gettyimages
It's the math of talking...your two favorite things!
This book is dedicated to my beautiful wife, and best friend Meghan. Your patience, support and assurance cannot be quantified.
Additionally Nora and Brenna are my motivation, teaching me to be a better person.
Foreword
This book has been a long labor of love. When I agreed to write a book, I had no idea of the amount of work and research needed. Looking back, it was pure hubris on my part to accept a writing contract from the great people at Wiley. The six-month project extended outward to more than a year! From the outset I decided to write a book that was less technical or academic and instead focused on code explanations and case studies. I wanted to distill my years of work experience, blog reading and textbook research into a succinct and more approachable format. It is easy to copy a blog's code or state a textbook's explanation verbatim, but it is wholesale more difficult to be original, to explain technical attributes in an easy-to-understand manner and hopefully to make the journey more fun for the reader.
Each chapter demonstrates a text mining method in the context of a real case study. Generally, mathematical explanations are brief and set apart from the code snippets and visualizations. While it is still important to understand the underlying mathematical attributes of a method, this book merely gives you a glimpse. I believe it is easier to become an impassioned text miner if you get to explore and create first. Applying algorithms to interesting data should embolden you to undertake and learn more. Many of the topics covered could be expanded into a standalone book, but here they are related as a single section or chapter. This is on purpose, so you get a quick but effective glimpse at the text mining universe! So my hope is that this book will serve as a foundation as you continually add to your data science skillset.
As a writer or instructor I have always leaned on common sense and non-academic explanations. The reason for this is simple: I do not have a computer science or math degree. Instead, my MBA gives me a unique perspective on data science. It has been my observation that data scientists often enjoy the modeling and data wrangling, but very often fail to completely understand the needs of the business. Thus many data science business applications are actually months in implementation or miss a crucial aspect. This book strives to have original and thought-provoking case studies with truly messy data. In other text mining or data science books, data that perfectly describes the method is illustrated so the concept can be understood. In this book, I reverse that approach and attempt to use real data in context so you can learn how typical text mining data is modeled and what to expect. The results are less pretty but more indicative of what you should expect as a text mining practitioner.
It takes a village to write a book.
Throughout this journey I have had the help of many people. Thankfully, family and friends have been accommodating and understanding when I chose writing ahead of social gatherings. First and foremost thanks to my mother, Trish, who gave me the gift of gab, and qualitative understanding and to my father Yitz, who gave me quantitative and technical writing acumen. Additional thanks to Paul, MaryAnn, Holly, Rob, K, and Maureen for understanding when I had to steal away and write during visits.
Thank you to Barry Keating, Sarv Devaraj and Timothy Gilbride. The Notre Dame family, with their supportive, entertaining professors put me onto this path. Their guidance, dialogue and instructions opened my eyes to machine learning, data science and ultimately text mining. My time at Notre Dame has positively affected my life and those around me. I am forever grateful.
Multiple data scientists have helped me along the way. In fact to many to actually list. Particular thanks to Greg, Zach, Hamel, Jeremy, Tom, Dalin, Sergey, Owen, Peter, Dan, Hugo and Nick for their explanations at different points in my personal data science journey.
This book would not have been possible if it weren't for Kathy Powers. She has been a lifelong friend and supporter and amazingly stepped up to make revisions when asked. When I changed publishers and thought of giving up on the book her support and patience with my poor grammar helped me continue. My entire family owes you a debt of gratitude that is never able to be repaid.
Next page