1.1 Introduction
The degree of structure of Web content is the determining factor for the types of functionality that search engines can provide. The more well structured the Web content is, the easier it is for search engines to understand Web content and provide advanced functionality, such as faceted filtering or the aggregation of content from multiple Web sites, based on this understanding.
Today, most Web sites are generated from structured data that is stored in relational databases. Thus, it does not require too much extra effort for Web sites to publish this structured data directly on the Web in addition to HTML pages, and thus help search engines to understand Web content and provide improved functionality.
An early approach to realize this idea and help search engines to understand Web content is Microformats,] as an alternative, more generic language for embedding any type of data into HTML pages.
Today, major search engines such as Google, Yahoo, and Bing extract Microformat and RDFa data describing products, reviews, persons, events, and recipes from Web pages and use the extracted data to improve the users search experience. The search engines have started to aggregate structured data from different Web sites and augment their search results with these aggregated information units in the form of rich snippets which combine, for instance, data describing a product with reviews of the product from different sites.
The support of Microformats and RDFa by major data consumers, such as Google, Yahoo! Microsoft, and Facebook, has led to a sharp increase in the number of Web sites that embed structured data into HTML pages. According to statistics presented by Yahoo!, the number of Web pages containing RDFa data has increased by 510% between 2009 and 2010. As of October 2010, 430 million Web pages contained RDFa markup, while over 300 million of pages contained microformat data [will be equally understood by all three search engines. This move toward standardization is likely to further increase the amount of structured data being published on the Web.
Parallel to the different techniques to embed structured data into HTML pages, a set of best practices for publishing structured data directly on the Web has gotten considerable traction: Linked Data [].
This chapter gives an overview of the topology of the Web of Data that has been created by publishing data on the Web using the Microformats, RDFa, Microdata, and Linked Data publishing techniques. Section discusses Linked Data and gives an overview of the Linked Data deployment on the Web. For each of the four techniques, we:
Summarize the main features and give an overview of the history of the technique
Provide a syntax example which shows how data describing a person is published on the Web using the technique
Present deployment statistics showing the amounts and types of data currently published using the specific technique
The syntax examples highlight how the different techniques handle (1) the identification of entities; (2) the representation of type informatione.g., that an entity is a person; (3) the representation of literal property values, such as the name of the person; and (4) the representation of relationships between entities, such as that Peter Smith knows Paula Jones.
In order to provide an entry point for experimentation as well as for the evaluation of search engines that facilitate Web data, Sect. gives an overview of large-scale datasets that have been crawled from the Web of Data and are publicly available for download.
1.2 Microformats
Microformats(also referred to as
data:image/s3,"s3://crabby-images/8baca/8baca74ebd46183ae991a31a69bb65f3fea420bd" alt="Picture 1"
) are community-driven vocabulary agreements for providing semantic markup on Web pages. The motto of the Microformats community is designed for humans first, machines second. Each Microformat defines a vocabulary and a syntax for applying the vocabulary to describe the content on Web pages. A Microformats syntax commonly specifies which properties are required or optional and which classes should be nested under one another.
Microformats emerged as a community effort, in contrast to other semantic mark-up technologies which sought the route of a standardization body. Early contributors to Microformats include Kevin Marks, Tantek elik, and Mark Pilgrim, among others. The first implementations of Microformats date back to 2003,.
It is argued that the simplistic approach offered by Microformats eases the learning curve and therefore lowers the entry barrier for newcomers. On the other hand, due to the lack of a unified syntax for all microformats, consuming structured data from Microformats requires the development of specialized parsers for each format. This is a reflection of the Microformats approach to address specific use cases, in contrast to RDFa and Microdata (presented in the following sections) which support the representation of any kind of data.
1.2.1 Microformats Syntax
Microformats consist of a definition of a vocabulary (names for classes and properties), as well as a set of rules (e.g., required properties, correct nesting of elements). These rules largely rely on existing HTML/XHTML attributes for inserting markup. One example is the HTML attribute class , commonly used as a style sheet selector, which is reused in microformats for describing properties and types of entities.
Figure shows the properties url (Peters home page) and fn (Peters full name). The markup also states that Peter knows Paula through the use of the properties met and acquaintance defined in the XFN microformat. An hCard parser should be aware that the url property refers to the value of the href attribute, while fn refers to the value of the child text of the HTML element. Such parsing instructions are described within each microformat specification.