Handbook of Statistical Analysis and Data Mining Applications
Second Edition
Robert Nisbet, Ph.D.
University of California, Predictive Analytics Certificate Program, Santa Barbara, Goleta, California, USA
Gary Miner, Ph.D.
University of California, Predictive Analytics Certificate Program, Tulsa, Oklahoma and Rome, Georgia, USA
Ken Yale, D.D.S., J.D.
University of California, Predictive Analytics Certificate Program; and Chief Clinical Officer, Delta Dental Insurance, San Francisco, California, USA
Guest Authors of selected Chapters
John Elder IV, Ph.D.
Chairman of the Board, Elder Research, Inc., Charlottesville, Virginia, USA
Andy Peterson, Ph.D.
VP for Educational Innovation and Global Outreach, Western Seminary, Charlotte, North Carolina, USA
Copyright
Academic Press is an imprint of Elsevier
125 London Wall, London EC2Y 5AS, United Kingdom
525 B Street, Suite 1800, San Diego, CA 92101-4495, United States
50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States
The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom
2018 Elsevier Inc. All rights reserved.
No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publishers permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions.
This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein).
Notices
Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary.
Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility.
To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein.
Library of Congress Cataloging-in-Publication Data
A catalog record for this book is available from the Library of Congress
British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British Library
ISBN 978-0-12-416632-5
For information on all Academic Press publications visit our website at https://www.elsevier.com/books-and-journals
Publisher: Candice Janco
Acquisition Editor: Graham Nisbet
Editorial Project Manager: Susan Ikeda
Production Project Manager: Paul Prasad Chandramohan
Cover Designer: Alan Studholme
Typeset by SPi Global, India
List of Tutorials on the Elsevier Companion Web Page
Note : This list includes all the extra tutorials published with the 1st edition of this handbook (2009). These can be considered enrichment tutorials for readers of this 2nd edition. Since the 1st edition of the handbook will not be available after the release of the 2nd edition, these extra tutorials are carried over in their original format/versions of software, as they are still very useful in learning and understanding data mining and predictive analytics, and many readers will want to take advantage of them.
List of Extra Enrichment Tutorials that are only on the ELSEVIER COMPANION web page, with data sets as appropriate, for downloading and use by readers of this 2nd edition of handbook:
TUTORIAL OBoston Housing Using Regression Trees [Field: Demographics]
TUTORIAL PCancer Gene [Field: Medical Informatics & Bioinformatics]
TUTORIAL QClustering of Shoppers [Field: CRMClustering Techniques]
TUTORIAL RCredit Risk using Discriminant Analysis [Field: FinancialBanking]
TUTORIAL SData Preparation and Transformation [Field: Data Analysis]
TUTORIAL TModel Deployment on New Data [Field: Deployment of Predictive Models]
TUTORIAL VHeart Disease Visual Data Mining Methods [Field: Medical Informatics]
TUTORIAL WDiabetes Control in Patients [Field: Medical Informatics]
TUTORIAL XIndependent Component Analysis [Field: Separating Competing Signals]
TUTORIAL YNTSB Aircraft Accidents Reports [Field: EngineeringAir TravelText Mining]
TUTORIAL ZObesity Control in Children [Field: Preventive Health Care]
TUTORIAL AARandom Forests Example [Field: StatisticsData Mining]
TUTORIAL BBResponse Optimization [Field: Data MiningResponse Optimization]
TUTORIAL CCDiagnostic Tooling and Data Mining: Semiconductor Industry [Field: IndustryQuality Control]
TUTORIAL DDTitanicSurvivors of Ship Sinking [Field: Sociology]
TUTORIAL EECensus Data Analysis [Field: DemographyCensus]
TUTORIAL FFLinear & Logistic RegressionOzone Data [Field: Environment]
TUTORIAL GGR-Language IntegrationDISEASE SURVIVAL ANALYSIS Case Study [Field: Survival AnalysisMedical Informatics]
TUTORIAL HHSocial Networks Among Community Organizations [Field: Social NetworksSociology & Medical Informatics]
TUTORIAL IINairobi, Kenya Baboon Project: Social Networking Among Baboon Populations in Kenya on the Laikipia Plateau [Field: Social Networks]
TUTORIAL JJJackknife and Bootstrap Data Miner Workspace and MACRO [Field: Statistics Resampling Methods]
TUTORIAL KKDahlia Mosaic Virus: A DNA Microarray Analysis of 10 Cultivars from a Single Source: Dahlia Garden in Prague, Czech Republic [Field: Bioinformatics]
The final companion site URL will be https://www.elsevier.com/books-and-journals/book-companion/9780124166325.
Foreword 1 for 1st Edition
This book will help the novice user become familiar with data mining. Basically, data mining is doing data analysis (or statistics) on data sets (often large) that have been obtained from potentially many sources. As such, the miner may not have control of the input data, but must rely on sources that have gathered the data. As such, there are problems that every data miner must be aware of as he or she begins (or completes) a mining operation. I strongly resonated to the material on The Top 10 Data Mining Mistakes, which give a worthwhile checklist:
Ensure you have a response variable and predictor variablesand that they are correctly measured.
Beware of overfitting. With scads of variables, it is easy with most statistical programs to fit incredibly complex models, but they cannot be reproduced. It is good to save part of the sample to use to test the model. Various methods are offered in this book.