Chapter 1. Mapping Foreclosures
Messy Address Parsing
To illustrate how to combine data from disparate sources for statistical analysis and visualization, lets focus on one of the messiest sources of data around: web pages.
The Philadelphia sheriffs office posts foreclosure auctions on its website each month. How do we collect this data, massage it into a reasonable form, and work with it? First, create a new folder (for example, ~/Rmashup) to contain our project files. It is helpful to change the R working directory to your newly created folder.
#In Unix/MacOS> setwd("~/Documents/Rmashup/")
#In Windows> setwd("C:/~/Rmashup/")
We can download this foreclosure listings web page from within R (or you may instead choose to save the raw HTML from your web browser):
> download.file(url="http://www.phillysheriff.com/properties.html",
destfile="properties.html")
Here is some of this web pages source HTML, with addresses highlighted:
6321 Farnsworth St. 62nd Ward 1,379.88 sq. ft. BRT# 621533500 Improvements: Residential Property
HOMER SIMPSON C.P. January Term, 2006 No. 002619 $27,537.87 Phelan Hallinan & Schmieg, L.L.P.
243-467
1402 E. Mt. Pleasant Ave. 50th Ward approximately 1,416 sq. ft. more or less BRT# 502440300 ...
The sheriffs raw HTML listings are inconsistently formatted, but with the right regular expression we can identify street addresses: notice how they appear alone on a line. Our goal is to submit viable addresses to the geocoder. Here are some typical addresses that our regular expression should match:
3509 N. Lee St. 2120-2128 E. Allegheny Ave. 7601 Crittenden St., #E-10 370 Tomlinson Place 2311 N. 33rd St. 6822-24 Old York Rd. 335 W. School House Lane
These are not addresses and should not be matched:
2,700 sq. ft. BRT# 124077100 Improvements: Residential Property C.P. June Term, 2009 No. 00575
R has built-in functions that allow the use of Perl-type regular expressions. For more info on regular expressions, see Mastering Regular Expressions (OReilly) and Regular Expression Pocket Reference (OReilly).
With some minor deletions to clean up address idiosyncrasies, we should be able to correctly identify street addresses from the mess of other data contained in properties.html . Well use a single regular expression pattern to do the cleanup. For clarity, we can break the pattern into the familiar elements of an address (number, name, suffix)
> stNum<-"^[0-9]{2,5}(\\-[0-9]+)?"
> stName<-"([NSEW]\\. )?[0-9A-Z ]+"
> stSuf<-"(St|Ave|Place|Blvd|Drive|Lane|Ln|Rd)(\\.?)$"
> myStPat<-paste(stNum,stName,stSuf,sep=" ")
Note the backslash characters themselves must be escaped with a backslash to avoid conflict with R syntax. Lets test this pattern against our examples using Rs grep()
function:
> grep(myStPat,"6822-24 Old York Rd.",perl=TRUE,value=FALSE,ignore.case=TRUE)
[1] 1
> grep(myStPat,"2,700 sq. ft. BRT# 124077100 Improvements: Residential Property",
perl=TRUE,value=FALSE,ignore.case=TRUE)
integer(0)
The result, [1] 1
, shows that the first of our target address strings matched; we tested only one string at a time. We also have to omit strings that we dont want with our address, such as extra punctuation (like quotes or commas), or sheriffs office designations that follow street names:
> badStrings<-"(\\r| a\\/?[kd]\\/?a.+$| - Premise.+$| assessed as.+$|,
Unit.+||Apt\\..+| #.+$|[,\"]|\\s+$)"
Test this against some examples using Rs gsub()
function:
> gsub(badStrings,'',"119 Hagy's Mill Rd. a/k/a 119 Spring Lane",
perl=TRUE)
[1] "119 Hagy's Mill Rd."
> gsub(badStrings,'',"3229 Hurley St. - Premise A",perl=TRUE)
[1] "3229 Hurley St."
Lets encapsulate this address parsing into a function that will accept an HTML file and return a vector, a one-dimensional ordered collection with a specific data type, in this case character
. Copy and paste this entire block into your R console:
#input:html filename
#returns:data frame of geocoded addresses that can be plotted by PBSmapping
getAddressesFromHTML<-function(myHTMLDoc){
myStreets<-vector(mode="character",0)
stNum<-"^[0-9]{2,5}(\\-[0-9]+)?"
stName<-"([NSEW]\\. )?([0-9A-Z ]+)"