Big Data, Artificial Intelligence and Data Analysis Set
coordinated by
Jacques Janssen
Volume 4
Advances in Data Science
Symbolic, Complex and Network Data
Edited by
Edwin Diday
Rong Guan
Gilbert Saporta
Huiwen Wang
First published 2020 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:
ISTE Ltd
27-37 St Georges Road
London SW19 4EU
UK
www.iste.co.uk
John Wiley & Sons, Inc.
111 River Street
Hoboken, NJ 07030
USA
www.wiley.com
ISTE Ltd 2020
The rights of Edwin Diday, Rong Guan, Gilbert Saporta and Huiwen Wang to be identified as the authors of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988.
Library of Congress Control Number: 2019951813
British Library Cataloguing-in-Publication Data
A CIP record for this book is available from the British Library
ISBN 978-1-78630-576-3
Preface
This book contains a selection of papers presented at two recent international workshops devoted to progress in the analysis of complex data.
The first workshop, ADS16, short for Advances in Data Science, was held in October 2016 at Beihang University, Beijing, China, at the initiative of Professor Huiwen Wang.
The second workshop, entitled Data Science: New Data and Classes, was held a few months later in January 2017 at Paris-Dauphine University, Paris, France, at the invitation of Professor Edwin Diday.
Several members of the Scientific Committees and participants were common to both. Each workshop gathered about 50 participants by invitation only.
After the workshops, we decided that some papers presented deserved to be made available to a wider audience, and we asked authors to prepare revised versions of their papers. Most of them agreed and the 10 papers collected in this volume were part of a blind review by referees, revised, and finally edited.
The papers are grouped into four sections: symbolic data, complex data, network data, and clustering.
For their dedication, we thank Paula Brito, Francisco de A.T. de Carvalho, Jie Gu, George Hbrail, Yves Lechevallier, Wen Long, Monique Noirhomme, Francesco Palumbo, Ming Ye, and Jichang Zhao.
We would also like to thank the sponsors of both meetings:
- ADS16, Beijing: School of Economics and Management, and the Complex Data Analysis Research Center of Beihang University, School of Statistics and Mathematics of Central University of Finance and Economics. The Beijing workshop Advances in Data Science was financially supported by the NFSC Major International Joint Research Project (Grant number 71420107025), co-organized by Professor Huiwen Wang and Professor Gilbert Saporta.
- Data Science: New Data and Classes, Paris: Lamsade and Ceremade Labs of Paris-Dauphine University, the French Statistical Society (SfdS), the French Speaking Society for Classification (SFC) , and the Society for Knowledge Discovery (EGC).
Edwin DIDAY
Rong GUAN
Gilbert SAPORTA
Huiwen WANG
October 2019
Part 1
Symbolic Data
Explanatory Tools for Machine Learning in the Symbolic Data Analysis Framework
The aim of this chapter is mainly to give explanatory tools for the understanding of standard, complex and big data. First, we recall some basic notions in Data Science: what are complex data? What are classes and classes of complex data? Which kind of internal class variability can be considered? Then, we define symbolic data and symbolic data tables, which express the within variability of classes, and we give some advantages of such kind of class description. Often in practice the classes are given. When they are not given, clustering can be used to build them by the Dynamic Clustering method (DCM) from which DCM regression, DCM canonical analysis, DCM mixture decomposition, and the like can be obtained. The description of these class yields by aggregation to a symbolic data table. We say that the description of a class is much more explanatory when it is described by symbolic variables (closer from the natural language of the users), and then by its usual analytical multidimensional description. The explanatory and characteristic power of classes can then be measured by criteria based on the symbolic data description of these classes and induce a way for comparing clustering methods by their explanatory power. These criteria are defined in a Symbolic Data Analysis framework for categorical variables, based on three random variables defined on the ground population. Tools are then given for ranking individuals, classes and their symbolic descriptive variables from the more toward the less characteristic. These characteristics are not only explanatory but can also express the concordance or the discordance of a class with the other classes. We suggest several directions of research mainly on parametric aspects of these criteria and on improving the explanatory power of Machine Learning tools. We finally present the conclusion and the wide domain of potential applications in socio demography, medicine, web security and so on.
1.1. Introduction
A Data Scientist is someone who is able to extract new knowledge from Standard, Big and Complex Data. Here we consider complex data as data that cannot be expressed in terms of a standard data table, where units are described by quantitative and qualitative variables. Complex data happen in case of unstructured data, unpaired samples, and multisource data (as mixture of numerical, textual, image and social networks data). The aggregation, fusion, and summarization of such data can be done into classes of row units that are considered as new units. Classes can be obtained by unsupervised learning, giving a concise and structured view on the data. In supervised learning, classes are used in order to provide efficient rules for the allocation of new units to a class. A third way is to consider classes as new units described by symbolic variables whose values are symbols as: intervals, probability distributions, weighted sequences of numbers or categories, functions, and the like, in order to express their within-class variability. For example, Regions express the variability of their inhabitant, Companies express the variability of their web intrusion, and Species express the variability of their specimen. One of the advantages of this approach is that unstructured data and unpaired samples at the level of row units become structured and paired at the classes level (see ).
Three principles guide this chapter in conformity with the Data Science framework. First, new tools are needed to transform huge data bases intended for management to data bases usable for Data Science tools. This transformation leads to the construction of new statistical units-described by aggregated data in terms of symbols as singlevalued dataare not suitable because they cannot incorporate theadditional information on data structure available in symbolic data. Second, we work on the symbolic data as they are given in data bases and not as we wish that they be given. For example, if the data contain intervals, we work on them even if the within-interval uniformity is statistically not satisfactory. Moreover, by considering MinMax intervals, we can obtain useful knowledge, complementary to the one given without the uniformity assumption. Hence, considering that the MinMax or interquartile where the aim is to extract useful knowledge from the data and not only to infer models (even if inferring models like in standard statistics, can for sure give complementary knowledge). Third, by using marginal description of classes by vectors of univariate symbols, rather than joint symbolic description by multivariate symbols, 99% of the users would say that a joint distribution describing a class often contains too much low or 0 values and so has a poor explanatory power in comparison with marginal distributions describing the same class. For example, having 10 variables of 5 categories each, the joint multivariate distribution leads to a sparse symbolic data table where the classes are described by a unique bar chart symbolic variable value containing 510 categories and taking for each class 510 low or 0 values. On the other hand, the 10 marginal bar chart symbolic variables values describe the classes by vectors of 10 bar charts of 5 categories each, easy to interpret and to compare between classes. Nevertheless, a compromise can be obtained by considering joints instead of marginal between the more dependent variables.
Next page