Julin Luengo , Diego Garca-Gil , Sergio Ramrez-Gallego , Salvador Garca and Francisco Herrera
Big Data Preprocessing
Enabling Smart Data
Julin Luengo
Department of Computer Science and AI, University of Granada, Granada, Spain
Diego Garca-Gil
Department of Computer Science and AI, University of Granada, Granada, Spain
Sergio Ramrez-Gallego
DOCOMO Digital Espaa, Madrid, Madrid, Spain
Salvador Garca
Department of Computer Science and AI, University of Granada, Granada, Spain
Francisco Herrera
Department of Computer Science and AI, University of Granada, Granada, Spain
ISBN 978-3-030-39104-1 e-ISBN 978-3-030-39105-8
https://doi.org/10.1007/978-3-030-39105-8
Springer Nature Switzerland AG 2020
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG.
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
This book is dedicated to all the people with whom we have worked over the years and who have made it possible to reach this moment. Thanks to the members of the research institute Andalusian Research Institute in Data Science and Computational Intelligence.
To our families.
Preface
The massive growth in the scale of data has been observed in recent years, being a key factor of the Big Data scenario. Big Data can be defined as high volume, velocity, and variety of data that require a new high-performance processing. Addressing Big Data is a challenging and time-demanding task that requires a large computational infrastructure to ensure successful data processing and analysis. Being a very common scenario in real-life applications, the interest of researchers and practitioners on the topic has grown significantly during these years. Among Big Data disciplines, data mining is a key topic, enabling the user to extract knowledge from enormous amounts of raw data. However, this raw data is not always in the best condition to be treated, analyzed, and surveyed. The application of preprocessing techniques is a must in real-world applications, to ensure quality data, Smart Data, for a proper treatment and analysis. The term Smart Data refers to the challenge of transforming raw data into quality data that can be appropriately exploited to obtain valuable insights.
This book aims at offering a general and comprehensible overview of data preprocessing in Big Data, enabling Smart Data. It contains a comprehensive description of the topic and focuses on its main features and the most relevant proposed solutions. Additionally, it considers the different scenarios in Big Data for which the application of data preprocessing techniques can suppose a real challenge. Data preprocessing is a multifaceted discipline that includes data preparation, compounded by integration, cleaning, normalization, and transformation of data; data reduction tasks such as feature selection, instance selection, and discretization; and resampling techniques to deal with imbalanced data.
This book stresses the gap with standard data preprocessing techniques and their Big Data equivalents, showing the challenging difficulties in their development for the latter. It also covers the different approaches that have been traditionally applied and the latest proposals in Big Data preprocessing. Specifically, it reviews data reduction methods, imperfect data approaches, discretization techniques, and imbalanced data preprocessing solutions. Finally, this book describes the most popular Big Data libraries for machine learning, focusing on their data preprocessing algorithms and utilities.
Julin Luengo
Diego Garca-Gil
Sergio Ramrez-Gallego
Salvador Garca
Francisco Herrera
Granada, Spain Granada, Spain Madrid, Spain Granada, Spain Granada, Spain
June 2019
Acronyms
BSP
Bulk Synchronous Parallel
DAG
Directed Acyclic Graph
DM
Data Mining
FS
Feature Selection
HDFS
Hadoop Distributed File System
HPC
High-Performance Computing
IG
Instance Generation
IS
Instance Selection
KNN
K-Nearest Neighbors
ML
Machine Learning
MPI
Message Passing Interface
MV
Missing Values
PCA
Principal Components Analysis
PG
Prototype Generation
PR
Prototype Reduction
PS
Prototype Selection
RDD
Resilient Distributed Dataset
SVM
Support Vector Machine
UCI
UC Irvine Machine Learning Repository
YARN
Yet Another Resource Negotiator
Contents
Springer Nature Switzerland AG 2020
J. Luengo et al. Big Data Preprocessing https://doi.org/10.1007/978-3-030-39105-8_1
1. Introduction
Julin Luengo
(1)
Department of Computer Science and AI, University of Granada, Granada, Spain
(2)
DOCOMO Digital Espaa, Madrid, Madrid, Spain
1.1 Big Data
We are immersed in the Information Age where vast amounts of data are available. Petabytes of data are generated and stored everyday, resulting in a humongous
volume of information; this information arrives at high
velocity and its processing requires real-time processing; this information can be found in many formats, like structured, semi-structured, or unstructured data, implying
variety; it also needs to be cleaned in order to maintain
veracity; finally, this information must provide
value to the organization. These five concepts are one of the most extended definitions of Big Data [. While the volume, velocity, and variety aspects refer to the data generation process and how to capture and store the data, veracity and value aspects deal with the quality and the usefulness of the data. These two last aspects become crucial in any Big Data process, where the extraction of useful and valuable knowledge is strongly influenced by the quality of the used data.