• Complain

Seppe vanden Broucke - Practical Web Scraping for Data Science: Best Practices and Examples with Python

Here you can read online Seppe vanden Broucke - Practical Web Scraping for Data Science: Best Practices and Examples with Python full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. year: 2018, publisher: Apress, genre: Home and family. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:

Romance novel Science fiction Adventure Detective Science History Home and family Prose Art Politics Computer Non-fiction Religion Business Children Humor

Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.

Seppe vanden Broucke Practical Web Scraping for Data Science: Best Practices and Examples with Python

Practical Web Scraping for Data Science: Best Practices and Examples with Python: summary, description and annotation

We offer to read an annotation, description, summary or preface (depends on what the author of the book "Practical Web Scraping for Data Science: Best Practices and Examples with Python" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.

This book provides a complete and modern guide to web scraping, using Python as the programming language, without glossing over important details or best practices. Written with a data science audience in mind, the book explores both scraping and the larger context of web technologies in which it operates, to ensure full understanding. The authors recommend web scraping as a powerful tool for any data scientists arsenal, as many data science projects start by obtaining an appropriate data set.

Starting with a brief overview on scraping and real-life use cases, the authors explore the core concepts of HTTP, HTML, and CSS to provide a solid foundation. Along with a quick Python primer, they cover Selenium for JavaScript-heavy sites, and web crawling in detail. The book finishes with a recap of best practices and a collection of examples that bring together everything youve learned and illustrate various data science use cases.

What Youll Learn

  • Leverage well-established best practices and commonly-used Python packages
  • Handle todays web, including JavaScript, cookies, and common web scraping mitigation techniques
  • Understand the managerial and legal concerns regarding web scraping
Who This Book is For
A data science oriented audience that is probably already familiar with Python or another programming language or analytical toolkit (R, SAS, SPSS, etc). Students or instructors in university courses may also benefit. Readers unfamiliar with Python will appreciate a quick Python primer in chapter 1 to catch up with the basics and provide pointers to other guides as well.

Seppe vanden Broucke: author's other books


Who wrote Practical Web Scraping for Data Science: Best Practices and Examples with Python? Find out the surname, the name of the author of the book and a list of all author's works by series.

Practical Web Scraping for Data Science: Best Practices and Examples with Python — read online for free the complete book (whole text) full work

Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "Practical Web Scraping for Data Science: Best Practices and Examples with Python" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.

Light

Font size:

Reset

Interval:

Bookmark:

Make
Part I
Web Scraping Basics
Seppe vanden Broucke and Bart Baesens 2018
Seppe vanden Broucke and Bart Baesens Practical Web Scraping for Data Science
1. Introduction
Seppe vanden Broucke 1 and Bart Baesens 2
(1)
KU Leuven, Leuven, Belgium
(2)
Dept of Decision Sci & Info Managem, KU Leuven Dept of Decision Sci & Info Managem, Leuven, Belgium
In this chapter, we introduce you to the concept of web scraping and highlight why the practice is useful to data scientists. After illustrating some interesting recent use cases of web scraping across various fields and industry sectors, we make sure youre ready to get started with web scraping by preparing your programming environment.
1.1 What Is Web Scraping?
Web scraping (also called web harvesting, web data extraction, or even web data mining), can be defined as the construction of an agent to download, parse, and organize data from the web in an automated manner. Or, in other words: instead of a human end user clicking away in a web browser and copy-pasting interesting parts into, say, a spreadsheet, web scraping offloads this task to a computer program that can execute it much faster, and more correctly, than a human can.
The automated gathering of data from the Internet is probably as old as the Internet itself, and the term scraping has been around for much longer than the web. Before web scraping became popularized as a term, a practice known as screen scraping was already well-established as a way to extract data from a visual representation which in the early days of computing (think 1960s-80s) often boiled down to simple, text-based terminals. Just as today, people in those days were also interested in scraping large amounts of text from such terminals and storing this data for later use.
1.1.1 Why Web Scraping for Data Science?
When surfing the web using a normal web browser, youve probably encountered multiple sites where you considered the possibility of gathering, storing, and analyzing the data presented on the sites pages. Especially for data scientists, whose raw material is data, the web exposes a lot of interesting opportunities :
  • There might be an interesting table on a Wikipedia page (or pages) you want to retrieve to perform some statistical analysis.
  • Perhaps you want to get a list of reviews from a movie site to perform text mining, create a recommendation engine, or build a predictive model to spot fake reviews.
  • You might wish to get a listing of properties on a real-estate site to build an appealing geo-visualization.
  • Youd like to gather additional features to enrich your data set based on information found on the web, say, weather information to forecast, for example, soft drink sales.
  • You might be wondering about doing social network analytics using profile data found on a web forum.
  • It might be interesting to monitor a news site for trending new stories on a particular topic of interest.
The web contains lots of interesting data sources that provide a treasure trove for all sorts of interesting things. Sadly, the current unstructured nature of the web does not always make it easy to gather or export this data in an easy manner. Web browsers are very good at showing images, displaying animations, and laying out websites in a way that is visually appealing to humans, but they do not expose a simple way to export their data, at least not in most cases. Instead of viewing the web page by page through your web browsers window, wouldnt it be nice to be able to automatically gather a rich data set? This is exactly where web scraping enters the picture.
If you know your way around the web a bit, youll probably be wondering: Isnt this exactly what Application Programming Interface (APIs) are for? Indeed, many websites nowadays provide such an API that provides a means for the outside world to access their data repository in a structured way meant to be consumed and accessed by computer programs, not humans (although the programs are written by humans, of course). Twitter, Facebook, LinkedIn, and Google, for instance, all provide such APIs in order to search and post tweets, get a list of your friends and their likes, see who youre connected with, and so on. So why, then, would we still need web scraping? The point is that APIs are great means to access data sources, provided the website at hand provides one to begin with and that the API exposes the functionality you want. The general rule of thumb is to look for an API first and use that if you can, before setting off to build a web scraper to gather the data. For instance, you can easily use Twitters API to get a list of recent tweets, instead of reinventing the wheel yourself. Nevertheless, there are still various reasons why web scraping might be preferable over the use of an API:
  • The website you want to extract data from does not provide an API.
  • The API provided is not free (whereas the website is).
  • The API provided is rate limited: meaning you can only access it a number of certain times per second, per day,
  • The API does not expose all the data you wish to obtain (whereas the website does).
In all of these cases, the usage of web scraping might come in handy. The fact remains that if you can view some data in your web browser, you will be able to access and retrieve it through a program. If you can access it through a program, the data can be stored, cleaned, and used in any way.
1.1.2 Who Is Using Web Scraping?
There are many practical applications of having access to and gathering data on the web, many of which fall in the realm of data science. The following list outlines some interesting real-life use cases :
  • Many of Googles products have benefited from Googles core business of crawling the web. Google Translate, for instance, utilizes text stored on the web to train and improve itself.
  • Scraping is being applied a lot in HR and employee analytics . The San Francisco-based hiQ startup specializes in selling employee analyses by collecting and examining public profile information, for instance, from LinkedIn (who was not happy about this but was so far unable to prevent this practice following a court case; see https://www.bloomberg.com/news/features/2017-11-15/the-brutal-fight-to-mine-your-data-and-sell-it-to-your-boss ).
  • Digital marketeers and digital artists often use data from the web for all sorts of interesting and creative projects. We Feel Fine by Jonathan Harris and Sep Kamvar, for instance, scraped various blog sites for phrases starting with I feel, the results of which could then visualize how the world was feeling throughout the day.
  • In another study, messages scraped from Twitter, blogs, and other social media were scraped to construct a data set that was used to build a predictive model toward identifying patterns of depression and suicidal thoughts. This might be an invaluable tool for aid providers, though of course it warrants a thorough consideration of privacy related issues as well (see https://www.sas.com/en_ca/insights/articles/analytics/using-big-data-to-predictsuicide-risk-canada.html ).
  • Emmanuel Sales also scraped Twitter, though here with the goal to make sense of his own social circle and time line of posts (see https://emsal.me/blog/4 ). An interesting observation here is that the author first considered using Twitters API, but found that Twitter heavily rate limits doing this: if you want to get a users follow list, then you can only do so 15 times every 15 minutes, which is pretty unwieldy to work with.
Next page
Light

Font size:

Reset

Interval:

Bookmark:

Make

Similar books «Practical Web Scraping for Data Science: Best Practices and Examples with Python»

Look at similar books to Practical Web Scraping for Data Science: Best Practices and Examples with Python. We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.


Reviews about «Practical Web Scraping for Data Science: Best Practices and Examples with Python»

Discussion, reviews of the book Practical Web Scraping for Data Science: Best Practices and Examples with Python and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.