Text Processing with Ruby
Extract Value from the Data That Surrounds You
by Rob Miller
Version: P1.0 (September 2015)
Copyright 2015 The Pragmatic Programmers, LLC. This book is licensed to the individual who purchased it. We don't copy-protect it because that would limit your ability to use it for your own purposes. Please don't break this trustyou can use this across all of your devices but please do not share this copy with other members of your team, with friends, or via file sharing services. Thanks.
Dave & Andy.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and The Pragmatic Programmers, LLC was aware of a trademark claim, the designations have been printed in initial capital letters or in all capitals. The Pragmatic Starter Kit, The Pragmatic Programmer, Pragmatic Programming, Pragmatic Bookshelf and the linking g device are trademarks of The Pragmatic Programmers, LLC.
Every precaution was taken in the preparation of this book. However, the publisher assumes no responsibility for errors or omissions, or for damages that may result from the use of information (including program listings) contained herein.
Our Pragmatic courses, workshops, and other products can help you and your team create better software and have more fun. For more information, as well as the latest Pragmatic titles, please visit us at http://pragprog.com.
The team that produced this book includes:
Jacquelyn Carter (editor)
Potomac Indexing, LLC (indexer)
Cathleen Small; Liz Welch (copyeditor)
Dave Thomas (typesetter)
Janet Furlow (producer)
Ellie Callahan (support)
For international rights, please contact .
For the Best Reading Experience...
We strongly recommend that you read this book with the publisher defaults setting enabled for your reading device or application. Certain formats and characters may not display correctly without this setting. Please refer to the instructions for your reader on how to enable the publisher defaults setting.
Table of Contents
Copyright 2015, The Pragmatic Bookshelf.
Early praise for Text Processing with Ruby
It is rare that a programming language can be unequivocally stated to be the right tool for a job. But when it comes to scanning, extracting, and transforming text, Ruby is that tool, and Rob Miller is the right guide to instruct you in the most effective and efficient application of it.
Avdi Grimm |
Author, Confident Ruby ; Head Chef, RubyTapas.com |
This is a fun, readable, and very useful book. Id recommend it to anyone who needs to deal with textwhich is probably everyone.
Paul Battley |
Developer, maintainer of text gem |
While Ruby has become established as a Web development language, thanks to Rails, its an excellent language for working with text as well. Text Processing with Ruby covers the nuts and bolts of what I believe is a natural domain for Ruby, all the way from bringing text into the environment via files, the Web, and other means through to parsing what it says and sending it back out again.
Peter Cooper |
Editor of Ruby Weekly , Cooper Press |
Id recommend this book to anyone who wants to get started with text processing. Ruby has powerful tools and libraries for the whole ETL workflow, and this book describes everything you need to get started and succeed in learning.
A lot of people get into Ruby via Rails. This book is really well suited to anyone who knows Rails, but wants to know more Ruby .
Drew Neil |
Director, Studio Nelstrom, and author of Practical Vim |
Acknowledgments
Thanks to my long-suffering partner, Gillian, for enduring a year of lost weekends, late nights, and generally having a sullen and distracted boyfriend who woke up in the middle of the night in a cold sweat, having had another nightmare about character encodings. Who knew writing a book could be so stressful?
Many thanks to Alessandro Bahgat, Paul Battley, Jacob Chae, Peter Cooper, Iris Faraway, Kevin Gisi, Derek Graham, James Edward Gray II, Avdi Grimm, Hajba Gbor Lszl, Jeremy Hinegardner, Kerri Miller, and Drew Neil for their helpful technical review comments, questions, and suggestionsall of which shaped this book for the better.
Thanks to Rob Griffiths, Mark Rogerson, Samuel Ryzycki, David Webb, Lewis Wilkinson, Alex Windett, and Mike Wright for ensuring there was no chance I got too big for my football boots.
Finally, the amazing folks at Pragmatic. Thanks to Susannah Davidson Pfalzer for taking a chance on me and my daft idea. Thanks to Jackie Carter for her incredible patience in guiding a first-time author through the editing process, and for contributing much to the structure and readability of the book. And thanks to Andy and Dave for creating a truly brilliant publisher that Im proud to be even a tiny a part of.
Copyright 2015, The Pragmatic Bookshelf.
Introduction
Text is everywhere. Newspaper articles, database dumps, spreadsheets, the output of shell commands, keyboard input; its all text, and it can all be processed in the same fundamental way. Text has been called the universal interface, and since the early days of Unix in the 1960s this universal interface has survived and flourishedand with good reason.
Unlike binary formats, text has the pleasing quality of being readable by humans as well as computers, making it easy to debug and requiring no distinction between output thats for human consumption and output thats to be used as the input for another step in a process.
Processing text, then, is a valuable skill for any programmer todayjust as it was fifty years ago, and just as its likely to be fifty years hence. In this book I hope to provide a practical guide to all the major aspects of working with text, viewed through the lens of the Ruby programming languagea language that I think is ideally suited to this task.
About This Book
Processing text is generally concerned with three things. The first concern is acquiring the text to be processed and getting it into your program. This is the subject of Part I of this book, which deals with reading from plain text files, standard input, delimited files, and binary files such as PDFs and Word documents.
This first part is fundamentally an exploration of Rubys core and standard library, and whats possible with IO and its derived classes like File . Rubys history and design, and the high-level nature of these tasks, mean that we dont need to dip into third-party libraries much, but well use one in particularNokogiriwhen looking at scraping data from web pages.
The second concern is with actually processing the text once weve got it into the program. This usually means either extracting data from within the text, parsing it into a Ruby data structure, or transforming it into another format.
The most important subject in this second stage is, without a doubt, regular expressions. Well look at regular expression syntax, how Ruby uses regular expressions in particular, and, importantly, when not to use them and instead reach for solutions such as parsers.