• Complain

Jeremy Leipzig - Data Mashups in R

Here you can read online Jeremy Leipzig - Data Mashups in R full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. year: 2011, publisher: OReilly Media, genre: Home and family. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:

Romance novel Science fiction Adventure Detective Science History Home and family Prose Art Politics Computer Non-fiction Religion Business Children Humor

Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.

No cover

Data Mashups in R: summary, description and annotation

We offer to read an annotation, description, summary or preface (depends on what the author of the book "Data Mashups in R" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.

How do you use R to import, manage, visualize, and analyze real-world data? With this short, hands-on tutorial, you learn how to collect online data, massage it into a reasonable form, and work with it using R facilities to interact with web servers, parse HTML and XML, and more. Rather than use canned sample data, youll plot and analyze current home foreclosure auctions in Philadelphia. This practical mashup exercise shows you how to access spatial data in several formats locally and over the Web to produce a map of home foreclosures. Its an excellent way to explore how the R environment works with R packages and performs statistical analysis.Parse messy data from public foreclosure auction postings Plot the data using Rs PBSmapping package Import US Census data to add context to foreclosure data Use Rs lattice and latticeExtra packages for data visualization Create multidimensional correlation graphs with the pairs() scatterplot matrix package

Jeremy Leipzig: author's other books


Who wrote Data Mashups in R? Find out the surname, the name of the author of the book and a list of all author's works by series.

Data Mashups in R — read online for free the complete book (whole text) full work

Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "Data Mashups in R" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.

Light

Font size:

Reset

Interval:

Bookmark:

Make
Data Mashups in R
Jeremy Leipzig
Xiao-Yi Li
Editor
Mike Loukides

Copyright 2011 Jeremy Leipzig and Xiao-Yi Li

OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (.

Nutshell Handbook, the Nutshell Handbook logo, and the OReilly logo are registered trademarks of OReilly Media, Inc. Data Mashups in R , the image of a black-billed Australian bustard, and related trade dress are trademarks of OReilly Media, Inc.

Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and OReilly Media, Inc., was aware of a trademark claim, the designations have been printed in caps or initial caps.

While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.

OReilly Media Introduction Programmers may spend a good part of their - photo 1

O'Reilly Media

Introduction

Programmers may spend a good part of their careers scripting code to conform to commercial statistics packages, visualization tools, and domain-specific third-party software. The same tasks can force end users to spend countless hours in copy-paste purgatory, each minor change necessitating another grueling round of formatting tabs and screenshots. Luckily, R scripting offers some reprieve. Because this open source project garners the support of a large community of package developers, the R statistical programming environment provides an amazing level of extensibility. Data from a multitude of sources can be imported into R and processed using R packages to aid statistical analysis and visualization. R scripts can also be configured to produce high-quality reports in an automated fashionsaving time, energy, and frustration.

This book will demonstrate how real-world data is imported, managed, visualized, and analyzed within R. Spatial mashups provide an excellent way to explore the capabilities of Rencompassing R packages, R syntax, and data structures. Instead of canned sample data, we will be plotting and analyzing actual current home foreclosure auctions. Through this exercise, we hope to provide an general idea of how the R environment works with R packages as well as its own capabilities in statistical analysis. We will be accessing spatial data in several formats (HTML, XML, shapefiles, and text) both locally and over the web, to produce a map of home foreclosures and perform statistical analysis on these events.

Chapter 1. Mapping Foreclosures
Messy Address Parsing

To illustrate how to combine data from disparate sources for statistical analysis and visualization, lets focus on one of the messiest sources of data around: web pages.

The Philadelphia sheriffs office posts foreclosure auctions on its website each month. How do we collect this data, massage it into a reasonable form, and work with it? First, create a new folder (for example, ~/Rmashup) to contain our project files. It is helpful to change the R working directory to your newly created folder.

#In Unix/MacOS> setwd("~/Documents/Rmashup/")#In Windows> setwd("C:/~/Rmashup/")

We can download this foreclosure listings web page from within R (or you may instead choose to save the raw HTML from your web browser):

> download.file(url="http://www.phillysheriff.com/properties.html",destfile="properties.html")

Here is some of this web pages source HTML, with addresses highlighted:

6321 Farnsworth St. 62nd Ward 1,379.88 sq. ft. BRT# 621533500 Improvements: Residential Property
HOMER SIMPSON C.P. January Term, 2006 No. 002619 $27,537.87 Phelan Hallinan & Schmieg, L.L.P.
243-467
1402 E. Mt. Pleasant Ave. 50th Ward approximately 1,416 sq. ft. more or less BRT# 502440300 ...

The sheriffs raw HTML listings are inconsistently formatted, but with the right regular expression we can identify street addresses: notice how they appear alone on a line. Our goal is to submit viable addresses to the geocoder. Here are some typical addresses that our regular expression should match:

3509 N. Lee St. 2120-2128 E. Allegheny Ave. 7601 Crittenden St., #E-10 370 Tomlinson Place 2311 N. 33rd St. 6822-24 Old York Rd. 335 W. School House Lane

These are not addresses and should not be matched:

2,700 sq. ft. BRT# 124077100 Improvements: Residential Property C.P. June Term, 2009 No. 00575

R has built-in functions that allow the use of Perl-type regular expressions. For more info on regular expressions, see Mastering Regular Expressions (OReilly) and Regular Expression Pocket Reference (OReilly).

With some minor deletions to clean up address idiosyncrasies, we should be able to correctly identify street addresses from the mess of other data contained in properties.html . Well use a single regular expression pattern to do the cleanup. For clarity, we can break the pattern into the familiar elements of an address (number, name, suffix)

> stNum<-"^[0-9]{2,5}(\\-[0-9]+)?"> stName<-"([NSEW]\\. )?[0-9A-Z ]+"> stSuf<-"(St|Ave|Place|Blvd|Drive|Lane|Ln|Rd)(\\.?)$"> myStPat<-paste(stNum,stName,stSuf,sep=" ")

Note the backslash characters themselves must be escaped with a backslash to avoid conflict with R syntax. Lets test this pattern against our examples using Rs grep() function:

> grep(myStPat,"6822-24 Old York Rd.",perl=TRUE,value=FALSE,ignore.case=TRUE) [1] 1 > grep(myStPat,"2,700 sq. ft. BRT# 124077100 Improvements: Residential Property",perl=TRUE,value=FALSE,ignore.case=TRUE) integer(0)

The result, [1] 1, shows that the first of our target address strings matched; we tested only one string at a time. We also have to omit strings that we dont want with our address, such as extra punctuation (like quotes or commas), or sheriffs office designations that follow street names:

> badStrings<-"(\\r| a\\/?[kd]\\/?a.+$| - Premise.+$| assessed as.+$|,Unit.+||Apt\\..+| #.+$|[,\"]|\\s+$)"

Test this against some examples using Rs gsub() function:

> gsub(badStrings,'',"119 Hagy's Mill Rd. a/k/a 119 Spring Lane",perl=TRUE) [1] "119 Hagy's Mill Rd." > gsub(badStrings,'',"3229 Hurley St. - Premise A",perl=TRUE) [1] "3229 Hurley St."

Lets encapsulate this address parsing into a function that will accept an HTML file and return a vector, a one-dimensional ordered collection with a specific data type, in this case character. Copy and paste this entire block into your R console:

#input:html filename#returns:data frame of geocoded addresses that can be plotted by PBSmappinggetAddressesFromHTML<-function(myHTMLDoc){myStreets<-vector(mode="character",0)stNum<-"^[0-9]{2,5}(\\-[0-9]+)?"stName<-"([NSEW]\\. )?([0-9A-Z ]+)"
Next page
Light

Font size:

Reset

Interval:

Bookmark:

Make

Similar books «Data Mashups in R»

Look at similar books to Data Mashups in R. We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.


Reviews about «Data Mashups in R»

Discussion, reviews of the book Data Mashups in R and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.