Big Data, Artificial Intelligence and Data Analysis Set
coordinated by
Jacques Janssen
Volume 10
Data Analysis and Related Applications 2
Multivariate, Health and Demographic Data Analysis
Edited by
Konstantinos N. Zafeiris
Christos H. Skiadas
Yiannis Dimotikalis
Alex Karagrigoriou
Christiana Karagrigoriou-Vonta
First published 2022 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:
ISTE Ltd
27-37 St Georges Road
London SW19 4EU
UK
www.iste.co.uk
John Wiley & Sons, Inc.
111 River Street
Hoboken, NJ 07030
USA
www.wiley.com
ISTE Ltd 2022
The rights of Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis, Alex Karagrigoriou and Christiana Karagrigoriou-Vonta to be identified as the authors of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s), contributor(s) or editor(s) and do not necessarily reflect the views of ISTE Group.
Library of Congress Control Number: 2022938776
British Library Cataloguing-in-Publication Data
A CIP record for this book is available from the British Library
ISBN 978-1-78630-772-9
Preface
This book is a collective work with contributions by leading experts on Data Analysis and Related Applications: Theory and Practice.
The field of data analysis has grown enormously over recent decades due to the rapid growth of the computer industry, the continuous development of innovative algorithmic techniques and recent advances in statistical tools and methods. Due to the wide applicability of data analysis, a collective work is always needed to bring all recent developments in the field, from all areas of science and engineering, under a single umbrella.
The contributions to this collective work are by a number of leading scientists, analysts, engineers, demographers, health experts, mathematicians and statisticians who have been working on the front end of data analysis. The chapters included in this collective volume represent a cross-section of current concerns and research interests in the scientific areas mentioned. The material is divided into three parts and 24 chapters in a form that will provide the reader with both methodological and practical information on data analytic methods, models and techniques, together with a wide range of appropriate applications.
focuses mainly on multivariate data analysis and related fields, with eight chapters covering clustering techniques, regression modeling, contingency tables, stochastic and financial analysis, classification methods, employment patterns and job insecurity, and dynamic optimization.
focuses mainly on health data analysis and related fields, with six chapters covering service quality, statistical quality control, bibliometric quality, lean management, prediction methods and ecosystem-based practices.
focuses mainly on demographic data analysis and related fields, with 10 chapters covering retirement age, population aging, pension reform, force of mortality, demographic policies, excess mortality, neonatal mortality, alcohol-related mortality, prediction and forecasting methods, statistics of extremes and life expectancy.
Konstantinos N. ZAFEIRIS
Yiannis DIMOTIKALIS
Christos H. SKIADAS
Alex KARAGRIGORIOU
Christiana KARAGRIGORIOU-VONTA
April 2022
PART 1
A Topological Clustering of Variables
The clustering of objects (individuals or variables) is one of the most used approaches to exploring multivariate data. The two most common unsupervised clustering strategies are hierarchical ascending clustering (HAC) and k-means partitioning used to identify groups of similar objects in a dataset to divide it into homogeneous groups.
The proposed topological clustering of variables, called TCV, studies a homogeneous set of variables defined on the same set of individuals, based on the notion of neighborhood graphs, some of these variables being more-or-less correlated or linked according to the type quantitative or qualitative of the variables. This topological data analysis approach can then be useful for dimension reduction and variable selection. It is a topological hierarchical clustering analysis of a set of variables which can be quantitative, qualitative or a mixture of both. It arranges variables into homogeneous groups according to their correlations or associations studied in a topological context of principal component analysis (PCA) or multiple correspondence analysis (MCA). The proposed TCV is adapted to the type of data considered; its principle is presented and illustrated using simple real datasets with quantitative, qualitative and mixed variables. The results of these illustrative examples are compared to those of other variables clustering approaches.
1.1. Introduction
The objective of this chapter is to propose a new approach for classifying variables. This is a topological approach that is different from those that already exist and with which it is compared.
Besides the classical and well-known methods devoted to the clustering of objects, there are some approaches specifically devoted to the clustering of variables, the Varclus classification procedure (SAS Institute Inc. 2011) implemented in the SAS software, the ClustOfVar approach (Chavent et al. 2012), the CVLC approach (Vigneau and Qannari 2003; Vigneau et al. 2006) for clustering variables around latent components and the Clustatis approach (Llobell and Qannari 2019), but as far as we know, no approach is proposed in a topological context.
A clustering of variables can also be considered as a dimension reduction approach, like a factor analysis. The purpose of the classification of variables is to group together the variables strongly related to each other, i.e. to separate the variables into classes of variables. It will be possible to summarize each class of variables by a single quantitative synthetic variable.
The interest here is to understand the structures underlying the data, to constitute a summary of the information carried by the data or to detect redundancies, for example with a view to reducing the number of variables in another process.
The objective of the clustering of variables is to obtain linked and redundant classes of variables. Specific algorithms have thus been developed for the clustering of variables. To create profiles from variables grouped in a questionnaire, we can achieve this using two main types of methods: non-hierarchical clustering such as k-means or dynamic clusters, and hierarchical clustering of the ascending or descending type.
Similarity measures play an important role in many areas of data analysis. The results of any operation involving structuring, clustering or classifying objects are strongly dependent on the proximity measure chosen.
Next page