Copyright 2014 by John Wiley & Sons, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our website at www.wiley.com.
Library of Congress Cataloging-in-Publication Data:
Larose, Daniel T.
Discovering knowledge in data : an introduction to data mining / Daniel T. Larose and Chantal D. Larose. Second edition.
pages cm
Includes index.
ISBN 978-0-470-90874-7 (hardback)
1. Data mining. I. Larose, Chantal D. II. Title.
QA76.9.D343L38 2014
006.3'12dc23
2013046021
WILEY SERIES ON METHODS AND APPLICATIONS IN DATA MINING
Series Editor: Daniel T. Larose
Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition Daniel T. Larose and Chantal D. Larose
Data Mining for Genomics and Proteomics: Analysis of Gene and Protein Expression Data Darius M. Dziuda
Knowledge Discovery with Support Vector Machines Lutz Hamel
Data-Mining on the Web: Uncovering Patterns in Web Content, Structure, and Usage Zdravko Markov and Daniel Larose
Data Mining Methods and Models Daniel Larose
Practical Text Mining with Perl Roger Bilisoly
Preface
What is Data Mining?
According to the Gartner Group,
Data mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques.
Today, there are a variety of terms used to describe this process, including analytics, predictive analytics, big data, machine learning, and knowledge discovery in databases. But these terms all share in common the objective of mining actionable nuggets of knowledge from large data sets. We shall therefore use the term data mining to represent this process throughout this text.
Why is This Book Needed?
Humans are inundated with data in most fields. Unfortunately, these valuable data, which cost firms millions to collect and collate, are languishing in warehouses and repositories. The problem is that there are not enough trained human analysts available who are skilled at translating all of these data into knowledge, and thence up the taxonomy tree into wisdom. This is why this book is needed.
The McKinsey Global Institute reports:
There will be a shortage of talent necessary for organizations to take advantage of big data. A significant constraint on realizing value from big data will be a shortage of talent, particularly of people with deep expertise in statistics and machine learning, and the managers and analysts who know how to operate companies by using insights from big data. We project that demand for deep analytical positions in a big data world could exceed the supply being produced on current trends by 140,000 to 190,000 positions. In addition, we project a need for 1.5 million additional managers and analysts in the United States who can ask the right questions and consume the results of the analysis of big data effectively.
This book is an attempt to help alleviate this critical shortage of data analysts. Discovering Knowledge in Data: An Introduction to Data Mining provides readers with:
- The models and techniques to uncover hidden nuggets of information,
- The insight into how the data mining algorithms really work, and
- The experience of actually performing data mining on large data sets.
Data mining is becoming more widespread everyday, because it empowers companies to uncover profitable patterns and trends from their existing databases. Companies and institutions have spent millions of dollars to collect megabytes and terabytes of data, but are not taking advantage of the valuable and actionable information hidden deep within their data repositories. However, as the practice of data mining becomes more widespread, companies which do not apply these techniques are in danger of falling behind, and losing market share, because their competitors are applying data mining, and thereby gaining the competitive edge.
In Discovering Knowledge in Data, the step-by-step, hands-on solutions of real-world business problems, using widely available data mining techniques applied to real-world data sets, will appeal to managers, CIOs, CEOs, CFOs, and others who need to keep abreast of the latest methods for enhancing return-on-investment.
What's New for the Second Edition?
The second edition of Discovery Knowledge in Data is enhanced with an abundance of new material and useful features, including:
- Nearly 100 pages of new material.
- Three new chapters:
- Chapter 5: Multivariate Statistical Analysis covers the hypothesis tests used for verifying whether data partitions are valid, along with analysis of variance, multiple regression, and other topics.
- Chapter 6: Preparing to Model the Data introduces a new formula for balancing the training data set, and examines the importance of establishing baseline performance, among other topics.
- Chapter 13: Imputation of Missing Data addresses one of the most overlooked issues in data analysis, and shows how to impute missing values for continuous variables and for categorical variables, as well as how to handle patterns in missingness.
- The R Zone. In most chapters of this book, the reader will find The R Zone, which provides the actual R code needed to obtain the results shown in the chapter, along with screen shots of some of the output, using
Next page