Data Analysis and Data Mining
Data Analysis and Data Mining
An Introduction
ADELCHI AZZALINI
AND
BRUNO SCARPA
Oxford University Press, Inc., publishes works that further
Oxford Universitys objective of excellence
in research, scholarship, and education.
Oxford New York
Auckland Cape Town Dar es Salaam Hong Kong Karachi
Kuala Lumpur Madrid Melbourne Mexico City Nairobi
New Delhi Shanghai Taipei Toronto
With offices in
Argentina Austria Brazil Chile Czech Republic France Greece
Guatemala Hungary Italy Japan Poland Portugal Singapore
South Korea Switzerland Thailand Turkey Ukraine Vietnam
Copyright 2012 by Oxford University Press
Published by Oxford University Press, Inc.
198 Madison Avenue, New York, New York 10016
www.oup.com
Oxford is a registered trademark of Oxford University Press
All rights reserved. No part of this publication may be reproduced,
stored in a retrieval system, or transmitted, in any form or by any means,
electronic, mechanical, photocopying, recording, or otherwise,
without the prior permission of Oxford University Press.
Library of Congress Cataloging-in-Publication Data
Azzalini, Adelchi. [Analisi dei dati e data mining. English]
Data analysis and data mining: an Introduction /
Adelchi Azzalini, Bruno Scarpa; [text revised by Gabriel Walton].
p. cm.
Includes bibliographical references and index.
ISBN 978-0-19-976710-6
1. Data mining. I. Scarpa, Bruno.
II. Walton, Gabriel. III. Title.
QA76.9.D343A9913 2012
006.312dc23 2011026997
9780199767106
English translation by Adelchi Azzalini, Bruno Scarpa and Anne Coghlan.
Text revised by Gabriel Walton.
First published in Italian as Analisi dei dati e data mining, 2004, Springer-Verlag
Italia (ITALY)
9 8 7 6 5 4 3 2 1
Printed in the United States of America
on acid-free paper
CONTENTS
PREFACE
When well-meaning university professors start out with the laudable aim of writing up their lecture notes for their students, they run the risk of embarking on a whole volume.
We followed this classic pattern when we started jointly to teach a course entitled Data analysis and data mining at the School of Statistical Sciences, University of Padua, Italy.
Our interest in this field had started long before the course was launched, while both of us were following different professional paths: academia for one of us (A. A.) and the business and professional fields for the other (B. S.). In these two environments, we faced the rapid development of a field connected with data analysis according to at least two features: the size of available data sets, as both number of units and number of variables recorded; and the problem that data are often collected without respect for the procedures required by statistical science. Thanks to the growing popularity of large databases with low marginal costs for additional data, one of the most common areas in which this situation is encountered is that of data analysis as a decision-support tool for business management. At the same time, the two problems call for a somewhat different methodology with respect to more classical statistical applications, thus giving this area its own specific nature. This is the setting usually called data mining.
Located at the point where statistics, computer science, and machine learning intersect, this broad field is attracting increasing interest from scientists and practitioners eager to apply the new methods to real-life problems. This interest is emerging even in areas such as business management, which are traditionally less directly connected to scientific developments.
Within this context, there are few works available if the methodology for data analysis must be inspired by and not simply illustrated with the aid of real-life problems. This limited availability of suitable teaching materials was an important reason for writing this work. Following this primary idea, methodological tools are illustrated with the aid of real data, accompanied wherever possible by some motivating background.
Because many of the topics presented here only appeared relatively recently, many professionals who gained university qualifications some years ago did not have the opportunity to study them. We therefore hope this work will be useful for these readers as well.
Although not directly linked to a specific computer package, the approach adopted here moves naturally toward a flexible computational environment, in which data analysis is not driven by an intelligent program but lies in the hands of a human being. The specific tool for actual computation is the Renvironment.
All that remains is to thank our colleagues Antonella Capitanio, Gianfranco Galmacci, Elena Stanghellini, and Nicola Torelli, for their comments on the manuscript. We also thank our students, some for their stimulating remarks and discussions and others for having led us to make an extra effort for clarity and simplicity of exposition.
Padua, April 2004
Adelchi Azzalini and Bruno Scarpa
PREFACE TO THE ENGLISH EDITION
This work, now translated into English, is the updated version of the first edition, which appeared in Italian (Azzalini & Scarp. 2004).
The new material is of two types. First, we present some new concepts and methods aimed at improving the coverage of the field, without attempting to be exhaustive in an area that is becoming increasingly vast. Second, we add more case studies. The work maintains its character as a first course in data analysis, and we assume standard knowledge of statistics at graduate level.
Complementary materials (data sets, R scripts) are available at: http://azzalini.stat.unipd.it/Book-DM/ .
A major effort in this project was its translation into English, and we are very grateful to Gabriel Walton for her invaluable help in the revision stage.
Padua, April 2011
Adelchi Azzalini and Bruno Scarpa
1
Introduction
He who loves practice without theory
is like the sailor who boards ship without a rudder and compass
and never knows where he may cast.
L EONARDO DA VINCI
1.1 NEW P ROBLEMS AND NEW OPPORTUNITIES
1.1.1 Data, More Data, and Data Mines
An important phase of technological innovation associated with the rise and rapid development of computer technology came into existence only a few decades ago. It brought about a revolution in the way people work, first in the field of science and then in many others, from technology to business, as well as in day-to-day life. For several years another aspect of technological innovation also developed, and, although not independent of the development of computers, it was given its own autonomy: large, sometimes enormous, masses of information on a whole range of subjects suddenly became available simply and cheaply. This was due first to the development of automatic methods for collecting data and then to improvements in electronic systems of information storage and major reductions in their costs.
This evolution was not specifically related to one invention but was the consequence of many innovative elements which have jointly contributed to the creation of what is sometimes called the information society. In this context, new avenues of opportunity and ways of working have been opened up that are very different from those used in the past. To illustrate the nature of this phenomenon, we list a few typical examples.
Next page