Extracting informative features and inferring knowledge from big data is considered as an essential objective of data analytics. Particularly, learning feature representations plays a major role, as it mitigates the gap between low-level observed data and high-level semantic knowledge. To deal with the high-dimensional and large-scale data, intelligent and efficient representation learning models are highly desired. In the past decades, data mining and machine learning researchers have made significant progress toward modeling data for different analytics tasks. Two critical issues should be carefully addressed when designing the models. The first one is scalability . Data collected in real-world applications are keep growing in terms of data size and dimensionality. For instance, according to some statistics on social networks, every minute, over 400h of videos are uploaded to YouTube, and more than 50,000 photos are presented to Instagram. A successful data mining model would be able to deal with a large amount of high-dimensional data efficiently. The second issue is model robustness . Many traditional models, especially the statistical learning based ones, pose strong assumptions on the underlying distribution of data. However, the data captured in real-world might be corrupted or contaminated with severe noise, which violates the underlying assumptions. Therefore, it is of great importance to develop robust data mining models that can learn robust data representations from data with uncertainty. This book presents the concepts, models, and applications of robust data representations.
1.1 What Are Robust Data Representations?
To understand the robustness of data representations, we first discuss the possible types of uncertainty that might be found in real-world data. In particular, we consider the uncertain data observations in a general sense. The uncertainty might be:
Gaussian noise;
Random corruptions;
Missing values in data, due to data loss during transmission, etc.
Outliers or anomalies;
Uncertainty within one modality;
Uncertainty across multiple modalities.
The first four types are well aligned with the traditional interpretations of data uncertainty in the literature, while the last two are considered as special cases of uncertainty in a more general sense. In common settings of data analytics, one object usually has multiple instances, such as multiple face images of the same person. If the multiple instances are from the same modality, variations on appearance may introduce uncertain information. For example, face images from the same person may have expression variations or illumination changes. For the last one, if the object is captured as multiple modalities using different types of sensors, the variations across different modalities would introduce another level of uncertainty.
It has been extensively demonstrated that exploiting the low-dimensional structure from high-dimensional data will greatly benefit the data analytics tasks. Particularly, recent advances on low-rank and sparse modeling have shown promising performance on recovering clean data from noisy observations, by discovering the low-dimensional subspace structures. This observation motivate us to develop new models for extracting robust data representations. The research objectives are twofold: (1) learning robust data representations from data with uncertainty, by exploiting the low-dimensional subspace structures; (2) evaluating the performance of the learned representations on a wide rage of real-world data analytics tasks.
Four categories of data representations are studied in this book, including graph , subspace , dictionary and latent factor . Robust data representations have been developed under each of the four categories. First, two novel graph construction schemes are introduced, by integrating the low-rank modeling with graph sparsification strategies. Each sample is represented in the low-rank coding space. And it is revealed that the similarity measurement in the low-rank coding space is more robust than that in the original sample space. The robust graphs could greatly enhance the performance of graph based clustering and semi-supervised classification. Second, low-dimensional discriminative subspaces are learned in single-view and multi-view scenarios, respectively. A single-view robust subspace discovery model is motivated from low-rank modeling and Fisher criterion, and it is able to accurately classify the noisy images. In addition, a multi-view subspace learning model is designed for extracting compact features from multimodal time series data, which leverages a shared latent space and fuses information from multiple data views. Third, dictionary serves as expressive bases for characterizing visual data. A robust dictionary learning method is designed to transfer knowledge from source domain to a target domain with limited training samples. A cross-view dictionary learning framework is presented to model the view consistency and extract robust features for images from two camera views. Fourth, latent factors, as compact representations of high-dimensional features, are extracted for the tasks of response prediction and collaborative filtering.
From the perspective of machine learning paradigms, this book covers clustering, semi-supervised learning, classification, multi-view learning, time-series modeling, graph mining, subspace learning, dictionary learning, transfer learning, and deep learning. The proposed models have obtained remarkable improvements on many real-world data analytics tasks, including image clustering, object recognition, face recognition, kinship verification, recommender system, outlier detection, person re-identification, and response prediction.
1.2 Organization of the Book
The rest of this book is organized as follows.
Part I focus on developing the robust representation models by learning robust graphs, robust subspaces, and robust dictionary. It consists of the following five chapters. Chapter presents the fundamentals of robust representations, which covers the overviews of existing representation learning and robust representation methods. The advantages and disadvantages of these existing methods are also discussed.