Commercial Data Mining
Processing, Analysis and Modeling for Predictive Analytics Projects
David Nettleton
Copyright
Acquiring Editor: Andrea Dierna
Editorial Project Manager: Kaitlin Herbert
Project Manager: Punithavathy Govindaradjane
Designer: Matthew Limbert
Morgan Kaufmann is an imprint of Elsevier
225 Wyman Street, Waltham, MA 02451, USA
Copyright 2014 Elsevier Inc. All rights reserved
No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publishers permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions.
This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein).
Notices
Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods or professional practices, may become necessary.
Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information or methods described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility.
To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein.
Library of Congress Cataloging-in-Publication Data
Nettleton, David, 1963
Commercial data mining : processing, analysis and modeling for predictive analytics projects / David Nettleton.
pages cm
Includes bibliographical references and index.
ISBN 978-0-12-416602-8 (paperback)
1. Data mining. 2. ManagementMathematical models. 3. ManagementData processing. I. Title.
HD30.25.N48 2014
658.056312dc23
2013045341
British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British Library
ISBN: 978-0-12-416602-8
Printed and bound in the United States of America
14 15 16 17 18 10 9 8 7 6 5 4 3 2 1
For information on all Morgan Kaufmann publications Visit our Website at www.mkp.com
Acknowledgments
Dr. David Nettleton is a contract researcher at the Pompeu Fabra University, Barcelona, Catalonia, Spain, and at the IIIA-CSIC, Bellaterra, Catalonia, Spain.
I would first like to thank the reviewers of the book proposal: Xavier Navarro Arnal, Dr. Anton Dries (KU Leuven, Belgium), Tim Holden (Imagination Technologies, United Kingdom) and Colin Shearer (Advanced Analytic Solutions, IBM). Second, I would like to thank the reviewers of the book manuscript: Dr. Vicen Torra (IIIA-CSIC, Catalonia, Spain), Tim Holden, Dr. Anton Dries, Joan Gmez Escofet (Autonomous University of Barcelona, Catalonia, Spain), and the anonymous reviewers for their suggestions, corrections, and constructive criticisms, which have helped enhance the content.
Third, I would like to thank the following people and companies for their permission to adapt and include material for the book: IBM, Newprosoft, Professor Ian Witten, Dr. Ricardo Baeza-Yates (Yahoo! Research and Pompeu Fabra University, Catalonia, Spain), Dr. Joan Codina-Filba (Pompeu Fabra University, Catalonia, Spain), Dra. Liliana Caldern Benavides (Autonomous University of Bucaramanga, Columbia), and Nstor Martinez.
I would also like to thank Kaitlin Herbert and Andrea Dierna of Morgan Kaufmann Publishers for their belief in this project and their support throughout the preparation of the book.
Finally, I would like to thank my supporting institutions: Pompeu Fabra University, Catalonia, Spain, and the IIIA-CSIC, Catalonia, Spain.
Chapter 1
Introduction
Abstract
The introduction commences with an overview of the readership, scope, and reason for the book, with reference to the complete cycle of a data mining project. Then a brief summary of each chapter is given, and finally some reading recommendations are provided.
Keywords
overview
chapter summaries
data mining
analysis
data
project cycle
This book is intended to benefit a wide audience, from those who have limited experience in commercial data analysis to those who already analyze commercial data, offering a vision of the whole process and its related topics. The author includes material from over 20 years of professional business experience as well as a diversity of research projects he was involved in, in order to enrich the content and give an original approach to commercial data analysis. In the appendix, practical case studies derived from real-world projects are used to illustrate the concepts and techniques that are explained throughout the book. Numerous references are included for those readers who wish to go into greater depth about a given topic.
Many of the methods, techniques, and ideas presented, such as data quality, data mart, customer relationship management, data sources, and Internet searches, can be applied by small business owners, freelance professionals, or medium to large-sized companies. The reader will see that it is not a prerequisite to have large volumes of data, and many tools used for data analysis are available for a nominal cost.
Although the steps in can be carried out sequentially, note that, in practice, aspects such as data sources, data representation, and data quality are often carried out in parallel and reiteratively. This also applies to the variable/factor selection, analysis, and modeling steps. However, note that the better each step is performed, the fewer iterations will be necessary.
In order to obtain meaningful results, data analysis requires an attention to detail, an adequate project definition, meticulous preparation of the data, investigative capacity, patience, rigor, and objectives that are well defined from the beginning. If these requirements are taken together as a starting point, then a basis can be built from which a data warehouse is converted into a high-value asset. One of the motivators for data analysis is to realize a return on investment for the database infrastructures that many businesses have installed. Another is to gain competitive leverage and insight for products and services by better understanding the marketplace, including customer and competitor behavior.
The analysis and comprehension of business data are fundamental parts of all organizations. Monitoring national economies and retail sales tendencies depend on data analysis, as does measuring the profitability, costs, and competitiveness of commercial organizations and businesses. Analyzing customer data has become easier due to data management infrastructures that separate the operational data from the analytical data, and from Internet applications and cloud computing, which facilitate the gathering of large-volume historical data logs.