LitArk » Books » Home and family

Li - Data Mashups in R

Here you can read online Li - Data Mashups in R full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. year: 2011, publisher: OReilly Media, Inc., genre: Home and family. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:

Romance novel Science fiction Adventure Detective Science History Home and family Prose Art Politics Computer Non-fiction Religion Business Children Humor

Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.

Book:
Data Mashups in R
Author:
Li / XiaoYi / Leipzig / Jeremy
Publisher:
OReilly Media, Inc.
Genre:
Books / Home and family
Year:
2011
Rating:
5 / 5
Favourites:
Add to favourites
Your mark:
- 100
- 1
- 2
- 3
- 4
- 5

Description
Author's other books
Similar books

Data Mashups in R: summary, description and annotation

We offer to read an annotation, description, summary or preface (depends on what the author of the book "Data Mashups in R" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.

How do you use R to import, manage, visualize, and analyze real-world data? With this short, hands-on tutorial, you learn how to collect online data, massage it into a reasonable form, and work with it using R facilities to interact with web servers, parse HTML and XML, and more. Rather than use canned sample data, youll plot and analyze current home foreclosure auctions in Philadelphia.

Li: author's other books

Who wrote Data Mashups in R? Find out the surname, the name of the author of the book and a list of all author's works by series.

Data Mashups in R — read online for free the complete book (whole text) full work

Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "Data Mashups in R" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.

Light

Font size:

↓

↑

Reset

Interval:

↓

↑

Bookmark:

Make

Data Mashups in R

Jeremy Leipzig

Xiao-Yi Li

Published by OReilly Media

Beijing Cambridge Farnham Kln Sebastopol Tokyo SPECIAL OFFER Upgrade this - photo 1

Beijing Cambridge Farnham Kln Sebastopol Tokyo

SPECIAL OFFER: Upgrade this ebook with OReilly

for more information on this offer!

Please note that upgrade offers are not available from sample content.

Introduction

Programmers may spend a good part of their careers scripting code to conform to commercial statistics packages, visualization tools, and domain-specific third-party software. The same tasks can force end users to spend countless hours in copy-paste purgatory, each minor change necessitating another grueling round of formatting tabs and screenshots. Luckily, R scripting offers some reprieve. Because this open source project garners the support of a large community of package developers, the R statistical programming environment provides an amazing level of extensibility. Data from a multitude of sources can be imported into R and processed using R packages to aid statistical analysis and visualization. R scripts can also be configured to produce high-quality reports in an automated fashionsaving time, energy, and frustration.

This book will demonstrate how real-world data is imported, managed, visualized, and analyzed within R. Spatial mashups provide an excellent way to explore the capabilities of Rencompassing R packages, R syntax, and data structures. Instead of canned sample data, we will be plotting and analyzing actual current home foreclosure auctions. Through this exercise, we hope to provide an general idea of how the R environment works with R packages as well as its own capabilities in statistical analysis. We will be accessing spatial data in several formats (HTML, XML, shapefiles, and text) both locally and over the web, to produce a map of home foreclosures and perform statistical analysis on these events.

Chapter 1. Mapping Foreclosures

Messy Address Parsing

To illustrate how to combine data from disparate sources for statistical analysis and visualization, lets focus on one of the messiest sources of data around: web pages.

The Philadelphia sheriffs office posts foreclosure auctions on its website each month. How do we collect this data, massage it into a reasonable form, and work with it? First, create a new folder (for example, ~/Rmashup) to contain our project files. It is helpful to change the R working directory to your newly created folder.

#In Unix/MacOS> setwd("~/Documents/Rmashup/")#In Windows> setwd("C:/~/Rmashup/")

We can download this foreclosure listings web page from within R (or you may instead choose to save the raw HTML from your web browser):

> download.file(url="http://www.phillysheriff.com/properties.html",destfile="properties.html")

Here is some of this web pages source HTML, with addresses highlighted:

6321 Farnsworth St. 62nd Ward 1,379.88 sq. ft. BRT# 621533500 Improvements: Residential Property
HOMER SIMPSON C.P. January Term, 2006 No. 002619 $27,537.87 Phelan Hallinan & Schmieg, L.L.P.

243-467

1402 E. Mt. Pleasant Ave. 50th Ward approximately 1,416 sq. ft. more or less BRT# 502440300 ...

The sheriffs raw HTML listings are inconsistently formatted, but with the right regular expression we can identify street addresses: notice how they appear alone on a line. Our goal is to submit viable addresses to the geocoder. Here are some typical addresses that our regular expression should match:

3509 N. Lee St. 2120-2128 E. Allegheny Ave. 7601 Crittenden St., #E-10 370 Tomlinson Place 2311 N. 33rd St. 6822-24 Old York Rd. 335 W. School House Lane

These are not addresses and should not be matched:

2,700 sq. ft. BRT# 124077100 Improvements: Residential Property C.P. June Term, 2009 No. 00575

R has built-in functions that allow the use of Perl-type regular expressions. For more info on regular expressions, see Mastering Regular Expressions (OReilly) and Regular Expression Pocket Reference (OReilly).

With some minor deletions to clean up address idiosyncrasies, we should be able to correctly identify street addresses from the mess of other data contained in properties.html . Well use a single regular expression pattern to do the cleanup. For clarity, we can break the pattern into the familiar elements of an address (number, name, suffix)

> stNum<-"^[0-9]{2,5}(\\-[0-9]+)?"> stName<-"([NSEW]\\. )?[0-9A-Z ]+"> stSuf<-"(St|Ave|Place|Blvd|Drive|Lane|Ln|Rd)(\\.?)$"> myStPat<-paste(stNum,stName,stSuf,sep=" ")

Note the backslash characters themselves must be escaped with a backslash to avoid conflict with R syntax. Lets test this pattern against our examples using Rs grep() function:

> grep(myStPat,"6822-24 Old York Rd.",perl=TRUE,value=FALSE,ignore.case=TRUE) [1] 1 > grep(myStPat,"2,700 sq. ft. BRT# 124077100 Improvements: Residential Property",perl=TRUE,value=FALSE,ignore.case=TRUE) integer(0)

The result, [1] 1, shows that the first of our target address strings matched; we tested only one string at a time. We also have to omit strings that we dont want with our address, such as extra punctuation (like quotes or commas), or sheriffs office designations that follow street names:

> badStrings<-"(\\r| a\\/?[kd]\\/?a.+$| - Premise.+$| assessed as.+$|,Unit.+||Apt\\..+| #.+$|[,\"]|\\s+$)"

Test this against some examples using Rs gsub() function:

> gsub(badStrings,'',"119 Hagy's Mill Rd. a/k/a 119 Spring Lane",perl=TRUE) [1] "119 Hagy's Mill Rd." > gsub(badStrings,'',"3229 Hurley St. - Premise A",perl=TRUE) [1] "3229 Hurley St."

Lets encapsulate this address parsing into a function that will accept an HTML file and return a vector, a one-dimensional ordered collection with a specific data type, in this case character. Copy and paste this entire block into your R console:

#input:html filename#returns:data frame of geocoded addresses that can be plotted by PBSmappinggetAddressesFromHTML<-function(myHTMLDoc){myStreets<-vector(mode="character",0)stNum<-"^[0-9]{2,5}(\\-[0-9]+)?"stName<-"([NSEW]\\. )?([0-9A-Z ]+)"stSuf<-"(St|Ave|Place|Blvd|Drive|Lane|Ln|Rd)(\\.?)$"badStrings<-paste("(\\r| a\\/?[kd]\\/?a.+$| - Premise.+$| assessed as.+$|,","Unit.+||Apt\\..+| #.+$|[,\"]|\\s+$)")myStPat<-paste(stNum,stName,stSuf,sep=" ")for(line in readLines(myHTMLDoc)){line<-gsub(badStrings,'',line,perl=TRUE)matches<-grep(myStPat,line,perl=TRUE,value=FALSE,ignore.case=TRUE)if(length(matches)>0){myStreets<-append(myStreets,line)

Light

Font size:

↓

↑

Reset

Interval:

↓

↑

Bookmark:

Make

Similar books «Data Mashups in R»

Look at similar books to Data Mashups in R. We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.

Hadley Wickham

R for Data Science

Susan E. McGregor

Practical Python Data Wrangling and Data Quality

Michael Milton

Head First Data Analysis

Layton

Learning data mining with Python: use Python to manipulate data and build predictive models

Grolemund Garrett

R for data science: import, tidy, transform, visualize, and model data

Chiu

R for data science cookbook over 100 hands-on recipes to effectively solve real-world data problems using the most popular R packages and techniques

Gururajan Govindan

The Data Analysis Workshop: Solve business problems with state-of-the-art data analysis models, developing expert data analysis skills along the way

Alex Galea

The Applied Data Science Workshop - Second Edition: Get started with the applications of data science and techniques to explore and assess data effectively

James Church

Learning Haskell Data Analysis

Max Shron

Thinking with Data

Richard Cotton

Learning R

Jeremy Leipzig

Data Mashups in R

Reviews about «Data Mashups in R»

Discussion, reviews of the book Data Mashups in R and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.