• Complain

Li - Data Mashups in R

Here you can read online Li - Data Mashups in R full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. year: 2011, publisher: OReilly Media, Inc., genre: Home and family. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:

Romance novel Science fiction Adventure Detective Science History Home and family Prose Art Politics Computer Non-fiction Religion Business Children Humor

Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.

Li Data Mashups in R

Data Mashups in R: summary, description and annotation

We offer to read an annotation, description, summary or preface (depends on what the author of the book "Data Mashups in R" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.

How do you use R to import, manage, visualize, and analyze real-world data? With this short, hands-on tutorial, you learn how to collect online data, massage it into a reasonable form, and work with it using R facilities to interact with web servers, parse HTML and XML, and more. Rather than use canned sample data, youll plot and analyze current home foreclosure auctions in Philadelphia.

Data Mashups in R — read online for free the complete book (whole text) full work

Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "Data Mashups in R" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.

Light

Font size:

Reset

Interval:

Bookmark:

Make
Data Mashups in R
Jeremy Leipzig
Xiao-Yi Li
Published by OReilly Media

Beijing Cambridge Farnham Kln Sebastopol Tokyo SPECIAL OFFER Upgrade this - photo 1

Beijing Cambridge Farnham Kln Sebastopol Tokyo

SPECIAL OFFER: Upgrade this ebook with OReilly

for more information on this offer!

Please note that upgrade offers are not available from sample content.

Introduction

Programmers may spend a good part of their careers scripting code to conform to commercial statistics packages, visualization tools, and domain-specific third-party software. The same tasks can force end users to spend countless hours in copy-paste purgatory, each minor change necessitating another grueling round of formatting tabs and screenshots. Luckily, R scripting offers some reprieve. Because this open source project garners the support of a large community of package developers, the R statistical programming environment provides an amazing level of extensibility. Data from a multitude of sources can be imported into R and processed using R packages to aid statistical analysis and visualization. R scripts can also be configured to produce high-quality reports in an automated fashionsaving time, energy, and frustration.

This book will demonstrate how real-world data is imported, managed, visualized, and analyzed within R. Spatial mashups provide an excellent way to explore the capabilities of Rencompassing R packages, R syntax, and data structures. Instead of canned sample data, we will be plotting and analyzing actual current home foreclosure auctions. Through this exercise, we hope to provide an general idea of how the R environment works with R packages as well as its own capabilities in statistical analysis. We will be accessing spatial data in several formats (HTML, XML, shapefiles, and text) both locally and over the web, to produce a map of home foreclosures and perform statistical analysis on these events.

Chapter 1. Mapping Foreclosures
Messy Address Parsing

To illustrate how to combine data from disparate sources for statistical analysis and visualization, lets focus on one of the messiest sources of data around: web pages.

The Philadelphia sheriffs office posts foreclosure auctions on its website each month. How do we collect this data, massage it into a reasonable form, and work with it? First, create a new folder (for example, ~/Rmashup) to contain our project files. It is helpful to change the R working directory to your newly created folder.

#In Unix/MacOS> setwd("~/Documents/Rmashup/")#In Windows> setwd("C:/~/Rmashup/")

We can download this foreclosure listings web page from within R (or you may instead choose to save the raw HTML from your web browser):

> download.file(url="http://www.phillysheriff.com/properties.html",destfile="properties.html")

Here is some of this web pages source HTML, with addresses highlighted:

6321 Farnsworth St. 62nd Ward 1,379.88 sq. ft. BRT# 621533500 Improvements: Residential Property
HOMER SIMPSON C.P. January Term, 2006 No. 002619 $27,537.87 Phelan Hallinan & Schmieg, L.L.P.
243-467
1402 E. Mt. Pleasant Ave. 50th Ward approximately 1,416 sq. ft. more or less BRT# 502440300 ...

The sheriffs raw HTML listings are inconsistently formatted, but with the right regular expression we can identify street addresses: notice how they appear alone on a line. Our goal is to submit viable addresses to the geocoder. Here are some typical addresses that our regular expression should match:

3509 N. Lee St. 2120-2128 E. Allegheny Ave. 7601 Crittenden St., #E-10 370 Tomlinson Place 2311 N. 33rd St. 6822-24 Old York Rd. 335 W. School House Lane

These are not addresses and should not be matched:

2,700 sq. ft. BRT# 124077100 Improvements: Residential Property C.P. June Term, 2009 No. 00575

R has built-in functions that allow the use of Perl-type regular expressions. For more info on regular expressions, see Mastering Regular Expressions (OReilly) and Regular Expression Pocket Reference (OReilly).

With some minor deletions to clean up address idiosyncrasies, we should be able to correctly identify street addresses from the mess of other data contained in properties.html . Well use a single regular expression pattern to do the cleanup. For clarity, we can break the pattern into the familiar elements of an address (number, name, suffix)

> stNum<-"^[0-9]{2,5}(\\-[0-9]+)?"> stName<-"([NSEW]\\. )?[0-9A-Z ]+"> stSuf<-"(St|Ave|Place|Blvd|Drive|Lane|Ln|Rd)(\\.?)$"> myStPat<-paste(stNum,stName,stSuf,sep=" ")

Note the backslash characters themselves must be escaped with a backslash to avoid conflict with R syntax. Lets test this pattern against our examples using Rs grep() function:

> grep(myStPat,"6822-24 Old York Rd.",perl=TRUE,value=FALSE,ignore.case=TRUE) [1] 1 > grep(myStPat,"2,700 sq. ft. BRT# 124077100 Improvements: Residential Property",perl=TRUE,value=FALSE,ignore.case=TRUE) integer(0)

The result, [1] 1, shows that the first of our target address strings matched; we tested only one string at a time. We also have to omit strings that we dont want with our address, such as extra punctuation (like quotes or commas), or sheriffs office designations that follow street names:

> badStrings<-"(\\r| a\\/?[kd]\\/?a.+$| - Premise.+$| assessed as.+$|,Unit.+||Apt\\..+| #.+$|[,\"]|\\s+$)"

Test this against some examples using Rs gsub() function:

> gsub(badStrings,'',"119 Hagy's Mill Rd. a/k/a 119 Spring Lane",perl=TRUE) [1] "119 Hagy's Mill Rd." > gsub(badStrings,'',"3229 Hurley St. - Premise A",perl=TRUE) [1] "3229 Hurley St."

Lets encapsulate this address parsing into a function that will accept an HTML file and return a vector, a one-dimensional ordered collection with a specific data type, in this case character. Copy and paste this entire block into your R console:

#input:html filename#returns:data frame of geocoded addresses that can be plotted by PBSmappinggetAddressesFromHTML<-function(myHTMLDoc){myStreets<-vector(mode="character",0)stNum<-"^[0-9]{2,5}(\\-[0-9]+)?"stName<-"([NSEW]\\. )?([0-9A-Z ]+)"stSuf<-"(St|Ave|Place|Blvd|Drive|Lane|Ln|Rd)(\\.?)$"badStrings<-paste("(\\r| a\\/?[kd]\\/?a.+$| - Premise.+$| assessed as.+$|,","Unit.+||Apt\\..+| #.+$|[,\"]|\\s+$)")myStPat<-paste(stNum,stName,stSuf,sep=" ")for(line in readLines(myHTMLDoc)){line<-gsub(badStrings,'',line,perl=TRUE)matches<-grep(myStPat,line,perl=TRUE,value=FALSE,ignore.case=TRUE)if(length(matches)>0){myStreets<-append(myStreets,line)
Next page
Light

Font size:

Reset

Interval:

Bookmark:

Make

Similar books «Data Mashups in R»

Look at similar books to Data Mashups in R. We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.


Reviews about «Data Mashups in R»

Discussion, reviews of the book Data Mashups in R and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.