The first forty years of life give us the text; the next thirty supply the commentary on it.Arthur Schopenhauer
1.1 Introduction
The extraction of useful insights from text with various types of statistical algorithms is referred to as text mining , text analytics , or machine learning from text . The choice of terminology largely depends on the base community of the practitioner. This book will use these terms interchangeably. Text analytics has become increasingly popular in recent years because of the ubiquity of text data on the Web, social networks, emails, digital libraries, and chat sites. Some common examples of sources of text are as follows:
Digital libraries: Electronic content has outstripped the production of printed books and research papers in recent years. This phenomenon has led to the proliferation of digital libraries, which can be mined for useful insights. Some areas of research such as biomedical text mining specifically leverage the content of such libraries.
Electronic news: An increasing trend in recent years has been the de-emphasis of printed newspapers and a move towards electronic news dissemination. This trend creates a massive stream of news documents that can be analyzed for important events and insights. In some cases, such as Google news, the articles are indexed by topic and recommended to readers based on past behavior or specified interests.
Web and Web-enabled applications: The Web is a vast repository of documents that is further enriched with links and other types of side information. Web documents are also referred to as hypertext . The additional side information available with hypertext can be useful in the knowledge discovery process. In addition, many Web-enabled applications, such as social networks, chat boards, and bulletin boards, are a significant source of text for analysis.
Numerous applications exist in the context of the types of insights one of trying to discover from a text collection. Some examples are as follows:
Search engines are used to index the Web and enable users to discover Web pages of interest. A significant amount of work has been done on crawling, indexing, and ranking tools for text data.
Text mining tools are often used to filter spam or identify interests of users in particular topics. In some cases, email providers might use the information mined from text data for advertising purposes.
Text mining is used by news portals to organize news items into relevant categories. Large collections of documents are often analyzed to discover relevant topics of interest. These learned categories are then used to categorize incoming streams of documents into relevant categories.
Recommender systems use text mining techniques to infer interests of users in specific items, news articles, or other content. These learned interests are used to recommend news articles or other content to users.
The Web enables users to express their interests, opinions, and sentiments in various ways. This has led to the important area of opinion mining and sentiment analysis. Such opinion mining and sentiment analysis techniques are used by marketing companies to make business decisions.
The area of text mining is closely related to that of information retrieval , although the latter topic focuses on the database management issues rather than the mining issues. Because of the close relationship between the two areas, this book will also discuss some of the information retrieval aspects that are either considered seminal or are closely related to text mining.
The ordering of words in a document provides a semantic meaning that cannot be inferred from a representation based on only the frequencies of words in that document. Nevertheless, it is still possible to make many types of useful predictions without inferring the semantic meaning. There are two feature representations that are popularly used in mining applications:
Text as a bag-of-words: This is the most commonly used representation for text mining. In this case, the ordering of the words is not used in the mining process. The set of words in a document is converted into a sparse multidimensional representation , which is leveraged for mining purposes. Therefore, the universe of words (or terms ) corresponds to the dimensions (or features ) in this representation. For many applications such as classification, topic-modeling, and recommender systems, this type of representation is sufficient.
Text as a set of sequences: In this case, the individual sentences in a document are extracted as strings or sequences. Therefore, the ordering of words matters in this representation, although the ordering is often localized within sentence or paragraph boundaries. A document is often treated as a set of independent and smaller units (e.g., sentences or paragraphs). This approach is used by applications that require greater semantic interpretation of the document content. This area is closely related to that of language modeling and natural language processing . The latter is often treated as a distinct field in its own right.
Text mining has traditionally focused on the first type of representation, although recent years have seen an increasing amount of attention on the second representation. This is primarily because of the increasing importance of artificial intelligence applications in which the language semantics, reasoning, and understanding are required. For example, question-answering systems have become increasingly popular in recent years, which require a greater degree of understanding and reasoning.
It is important to be cognizant of the sparse and high-dimensional characteristics of text when treating it as a multidimensional data set. This is because the dimensionality of the data depends on the number of words which is typically large. Furthermore, most of the word frequencies (i.e., feature values) are zero because documents contain small subsets of the vocabulary. Therefore, multidimensional mining methods need to be cognizant of the sparse and high-dimensional nature of the text representation for best results. The sparsity is not always a disadvantage. In fact, some models, such as the linear support vector machines discussed in Chap., are inherently suited to sparse and high-dimensional data.
This book will cover a wide variety of text mining algorithms, such as latent factor modeling, clustering, classification, retrieval, and various Web applications. The discussion in most of the chapters is self-sufficient, and it does not assume a background in data mining or machine learning other than a basic understanding of linear algebra and probability. In this chapter, we will provide an overview of the various topics covered in this book, and also provide a mapping of these topics to the different chapters.