1.1 Information Search in Times of Big Data
Information search constitutes an integral part of almost everybodys everyday life. Todays web search engines achieve to rank the most relevant result highest for a large fraction of the information needs implied by search queries. Following Manning et al. (2008), an information need can be seen as a topic about which a user desires to know more. A result is relevant if it yields information that helps to fulfill the information need at hand.
Fig. 1.1
Screenshot of Pentaho Big Data Analytics as an example for an enterprise software. The shown heat grid visualizes the vehicle sales of a company.
Instead of directly providing relevant information , however, state-of-the-art web search engines mostly return only links to web pages that may contain relevant information , often thousands or millions of them. This can make search time-consuming or even unsuccessful for queries where relevant information has to be derived (e.g. for the query locations of search companies ), should be aggregated (e.g. user opinions on bing ), seems like a needle in a haystack (e.g. "if it isnt on google it doesnt exist" original source ), and so forth.
For enterprise environments, big data analytics applications aim to infer such high-quality information in the sense of relations, patterns, and hidden facts from vast amounts of data (Davenport 2012). Figure As with this software, big data analytics is still only on the verge of including unstructured texts into analysis, though such texts are assumed to make up 95 % of all enterprise-relevant data (HP Labs 2010). To provide answers to a wide spectrum of information needs , relevant texts must be filtered and relevant information must be identified in these texts. We hence argue that search engines and big data analytics applications need to perform more text mining .
1.1.1 Text Mining to the Rescue
Text mining brings together techniques from the research fields of information retrieval , data mining , and natural language processing in order to infer structured high-quality information from usually large numbers of unstructured texts (Ananiadou and McNaught 2005). While information retrieval deals, at its heart, with indexing and searching unstructured texts, data mining targets at the discovery of patterns in structured data. Natural language processing , finally, is concerned with algorithms and engineering issues for the understanding and generation of speech and human-readable text (Tsujii 2011). It bridges the gap between the other fields by converting unstructured into structured information. Text mining is studied within the broad interdisciplinary field of computational linguistics , as it addresses computational approaches from computer science to the processing of data and information while operationalizing findings from linguistics.
According to Sarawagi (2008), the most important text mining techniques for identifying and filtering relevant texts and information within the three fields refer to the areas of information extraction and text classification . The former aims at extracting entities , relations between entities , and events the entities participate in from mostly unstructured text. The latter denotes the task of assigning a text to one or more predefined categories, such as topics , genres , or sentiment polarities . Information extraction , text classification , and similar tasks are considered in both natural language processing and information retrieval . In this book, we summarize these tasks under the term text analysis . All text analyses have in common that they can significantly increase the velocity of information search in many situations.
In our past research project InfexBA was to classify and summarize opinions on products and their features found in large numbers of review texts. To this end, we analyzed the sequence of local sentiment on certain product features found in each of the reviews in order to account for the argumentation of texts.
Fig. 1.2
Google result page for the query Charles Babbage , showing an example of directly providing relevant information instead of returning only web links.
Of course, major search engines already use text analysis when addressing information needs (Pasca 2011). E.g., a Google search in late 2014 for Charles Babbage , the author of this chapters introductory quote, led to the results in Fig. But, in accordance with the quote of Babbage, the benefit of text mining arising from the increase of velocity becomes more striking when turning from predefined text analyses in frequent use to arbitrary and more complex text analysis processes .
1.2 A Need for Efficient and Robust Text Analysis Pipelines
Text mining deals with tasks that often entail complex text analysis processes , consisting of several interdependent steps that aim to infer sophisticated information types from collections and streams of natural language input texts (cf. Chap. for details). In the mentioned project InfexBA , different entity types (e.g. organization names) and event types (e.g. forecasts) had to be extracted from input texts and correctly brought into relation , before they could be normalized and aggregated. Such steps require syntactic annotations of texts, e.g. part-of-speech tags and parse tree labels (Sarawagi 2008). These in turn can only be added to a text that is segmented into lexical units, e.g. into tokens and sentences. Similarly, text classification often relies on so called features (Manning et al. 2008) that are derived from lexical and syntactic annotations or even from entities , like in ArguAna .
To realize the steps of a text analysis process , text analysis algorithms are employed that annotate new information types in a text or that classify, relate , normalize , or filter previously annotated information. Such algorithms perform analyses of different computational cost, ranging from the typically cheap evaluation of single rules and regular expressions, over the matching of lexicon terms and the statistical classification of text fragments, to complex syntactic analyses like dependency parsing (Bohnet 2010). Because of the interdependencies between analyses, the standard way to realize a text analysis process is in the form of a text analysis pipeline , which sequentially applies each employed text analysis algorithm to its input.