Index
[]accessingApacheapplicationsarithmeticarraysattributessettings forauthenticationremote logins
Index
[]backwards compatibilitybinary filesbinary large objects [See BLOBs]storing inbuilding
Index
[]C extensionsC programming language[See also cPickle module] [See also cPickle module]cachingscripts [See CGI scripts]classclasses [See also metaclasses] [See also metaclasses]codecode objectscollections.dequeCOMcommandsCommon Gateway Interface [See CGI]Common Object Request Broker Architecture [See CORBA]compressionconvertingcookiescopying[See also pickling] [See also pickling]creating
Introduction
Credit: Fred L. Drake, Jr., PythonLabs
Text-processing applications form a substantial part of theapplication space for any scripting language, if only becauseeveryone can agree that text processing is useful. Everyone has bitsof text that need to be reformatted or transformed in various ways.The catch, of course, is that every application is just a little bitdifferent from every other application, so it can be difficult tofind just the right reusable code to work with different fileformats, no matter how similar they are.
What Is Text?
Soundslike an easy question, doesn't it? After all, weknow it when we see it, don't we? Text is a sequenceof characters, and it is distinguished from binary data by that veryfact. Binary data, after all, is a sequence of bytes.
shows just such a heuristic.
Python strings are immutable sequences of bytes or characters. Mostof the ways we create and process strings treat them as sequences ofcharacters, but many are just as applicable to sequences of bytes.Unicode strings are immutable sequences of Unicode characters:transformations of Unicode strings into and from plain strings usecodecs (coder-decoders) objects that embodyknowledge about the many standard ways in which sequences ofcharacters can be represented by sequences of bytes (also known asencodings and charactersets). Note that Unicode strings do not serve double duty as sequences of bytes. illustrate the fundamentalsof Unicode in Python.
Okay, let's assume that our application knows fromthe context that it's looking at text.That's usually the best approach becausethat's where external input comes into play.We're looking at a file either because it has awell-known name and defined format (common in the"Unix" world) or because it has awell-known filename extension that indicates the format of thecontents (common on Windows). But now we have a problem: we had touse the word format to make the previousparagraph meaningful. Wasn't text supposed to besimple?
Let's face it: there's no suchthing as "pure" text, and if therewere, we probably wouldn't care about it (with thepossible exception of applications in the field of computationallinguistics, where pure text may indeed sometimes be studied for itsown sake). What we want to deal with in our applications isinformation contained in text. The text we care about may containconfiguration data, commands to control or define processes,documents for human consumption, or even tabular data. Text thatcontains configuration data or a series of commands usually can beexpected to conform to a fairly strict syntax that can be checkedbefore relying on the information in the text. Informing the user ofan error in the input text is typically sufficient to deal withthings that aren't what we were expecting.
Documents intended for humans tend to be simple, but they vary widelyin detail. Since they are usually written in a natural language,their syntax and grammar can be difficult to check, at best.Different texts may use different character sets or encodings, and itcan be difficult or even impossible to tell which character set orencoding was used to create a text if that information is notavailable in addition to the text itself. It is, however, necessaryto support proper representation of natural-language documents.Natural-language text has structure as well, but the structures areoften less explicit in the text and require at least someunderstanding of the language in which the text was written.Characters make up words, which make up sentences, which make upparagraphs, and still larger structures may be present as well.Paragraphs alone can be particularly difficult to locate unless youknow what typographical conventions were used for a document: is eachline a paragraph, or can multiple lines make up a paragraph? If thelatter, how do we tell which lines are grouped together to make aparagraph? Paragraphs may be separated by blank lines, indentation,or some other special mark. See for an example of reading atext file as a sequence of paragraphs separated by blank lines.
Tabular data has many issues that are similar to the problemsassociated with natural-language text, but it adds a second dimensionto the input format: the text is no longer linearit is nolonger a sequence of characters, but rather a matrix of charactersfrom which individual blocks of text must be identified andorganized.
Basic Textual Operations
As with any other data format, weneed to do different things with text at different times. However,there are still three basic operations:
Parsing the data into a structure internal to our application
Transforming the input into something similar in some way, but withchanges of some kind
Generating completely new data
.
Transforming text from one format to another is more interesting whenviewed as text processing, which is what we usually think of firstwhen we talk about text. In this chapter, we'll takea look at some ways to approach transformations that can be appliedfor different purposes. Sometimes we'll work withtext stored in external files, and other times we'llsimply work with it as strings in memory.
The generation of textual data from application-specific datastructures is most easily performed using Python's print statement or the write method of a file or file-like object. This is often done using amethod of the application object or a function, which takes theoutput file as a parameter. The function can then use statements suchas these:
print >>thefile, sometextthefile.write(sometext)
which generate output to the appropriate file. However, thisisn't generally thought of as text processing, ashere there is no input text to be processed. Examples of using both print and write can of coursebe found throughout this book.
Sources of Text
Working with text stored as a string inmemory can be easy when the text is not too large. Operations thatsearch the text can operate over multiple lines very easily andquickly, and there's no need to worry aboutsearching for something that might cross a buffer boundary. Beingable to keep the text in memory as a simple string makes it very easyto take advantage of the built-in string operations available asmethods of the string object.
Another interesting source for textualdata comes to light when we consider the network. Text is oftenretrieved from the network using a socket. While we can always view asocket as a file (using the makefile method of thesocket object), the data that is retrieved over a socket may come inchunks, or we may have to wait for more data to arrive. The textualdata may not consist of all data until the end of the data stream, soa file object created with makefile may not beentirely appropriate to pass to text-processing code. When workingwith text from a network connection, we often need to read the datafrom the connection before passing it along for further processing.If the data is large, it can be handled by saving it to a file as itarrives and then using that file when performing text-processingoperations. More elaborate solutions can be built when the textprocessing needs to be started before all the data is available.Examples of parsers that are useful in such situations may be foundin the htmllib and HTMLParser modules in the standardlibrary.