Data Mining For Dummies
Published by: John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030-5774, www.wiley.com
Copyright 2014 by John Wiley & Sons, Inc., Hoboken, New Jersey
Media and software compilation copyright 2014 by John Wiley & Sons, Inc. All rights reserved.
Published simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the Publisher. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at www.wiley.com/go/permissions .
Trademarks: Wiley, For Dummies, the Dummies Man logo, Dummies.com, Making Everything Easier, and related trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc. and may not be used without written permission. Samsung and Galaxy S are registered trademarks of Samsung Electronics Co. Ltd. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.
LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY : THE PUBLISHER AND THE AUTHOR MAKE NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE ACCURACY OR COMPLETENESS OF THE CONTENTS OF THIS WORK AND SPECIFICALLY DISCLAIM ALL WARRANTIES, INCLUDING WITHOUT LIMITATION WARRANTIES OF FITNESS FOR A PARTICULAR PURPOSE. NO WARRANTY MAY BE CREATED OR EXTENDED BY SALES OR PROMOTIONAL MATERIALS. THE ADVICE AND STRATEGIES CONTAINED HEREIN MAY NOT BE SUITABLE FOR EVERY SITUATION. THIS WORK IS SOLD WITH THE UNDERSTANDING THAT THE PUBLISHER IS NOT ENGAGED IN RENDERING LEGAL, ACCOUNTING, OR OTHER PROFESSIONAL SERVICES. IF PROFESSIONAL ASSISTANCE IS REQUIRED, THE SERVICES OF A COMPETENT PROFESSIONAL PERSON SHOULD BE SOUGHT. NEITHER THE PUBLISHER NOR THE AUTHOR SHALL BE LIABLE FOR DAMAGES ARISING HEREFROM. THE FACT THAT AN ORGANIZATION OR WEBSITE IS REFERRED TO IN THIS WORK AS A CITATION AND/OR A POTENTIAL SOURCE OF FURTHER INFORMATION DOES NOT MEAN THAT THE AUTHOR OR THE PUBLISHER ENDORSES THE INFORMATION THE ORGANIZATION OR WEBSITE MAY PROVIDE OR RECOMMENDATIONS IT MAY MAKE. FURTHER, READERS SHOULD BE AWARE THAT INTERNET WEBSITES LISTED IN THIS WORK MAY HAVE CHANGED OR DISAPPEARED BETWEEN WHEN THIS WORK WAS WRITTEN AND WHEN IT IS READ.
For general information on our other products and services, please contact our Customer Care Department within the U.S. at 877-762-2974, outside the U.S. at 317-572-3993, or fax 317-572-4002. For technical support, please visit www.wiley.com/techsupport .
Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com . For more information about Wiley products, visit www.wiley.com .
Library of Congress Control Number: 2014935519
ISBN 978-1-118-89317-3 (pbk); ISBN 978-1-118-89316-6 (ebk); ISBN 978-1-118-89319-7 (ebk)
Manufactured in the United States of America
10 9 8 7 6 5 4 3 2 1
Appendix A
Glossary
analysis: Thoughtful investigation of real-world systems.
analytics: Analysis that involves math. (This term is used very differently by different people, and may refer to anything from simple historical data summaries to highly complex predictive models. Always ask questions!)
association rules: Tools for identifying combinations of items often found together. The most common use of association rules is for market basket analysis.
assumption: Something presumed to be true. Assumptions are the basis of all statistical analysis. (It is important that the analyst choose methods based only on assumptions that are reasonable for the application.)
average: Any measure that describes the middle (more formally, central tendency or location) of a distribution. In analytics, the term average usually refers to the mean, but may refer to median or mode.
Bayesian network: A type of neural network. The Bayesian network is based on the fundamentals of probability theory. (See also neural network.)
binary: Having exactly two alternative states.
binning: Organizing data into groups. This may be done for ease of analysis, or to protect privacy.
causation: The act of producing an effect or making something happen. The phrase correlation does not imply causation means that the fact that two things are observed to happen together is not enough to prove that one caused the other.
Chi-square: A test statistic, probably the most widely used of all statistical hypothesis testing methods. Typically used in combination with cross-tabulation tables.
Chi-squared Automatic Interaction Detector (CHAID): A type of decision tree. CHAID is based on the chi-square statistic and tests of independence between categorical variables.
classification: Techniques for organizing data into groups associated with a particular outcome, such as the likelihood to purchase a product or earn a college degree.
Classification and Regression Tree (C&RT): A type of decision tree. C&RT is based on linear regression methods.
cluster analysis (clustering): Techniques for organizing data into groups of similar cases.
coding: In text analysis, categorization of text based on its meaning. These categorizations can be used in the same ways as any other categorical variable. Historically done manually, automated coding processes are now becoming available.
correlation: Association in the values of two or more variables.
Cross-Industry Standard Process for Data Mining (CRISP-DM): Just what it says, or as the folks from the CRISP-DM project put it, an industry- and tool-neutral data-mining process model.
crosstabulation (crosstabs): Summarizing interactions of categorical variables in a table.
dashboard: A predefined report for online viewing, usually consisting of simple tables and graphs, with some options for user interaction. Dashboards are usually designed for use by business managers to support the decision-making processes.
data mining: An umbrella term for analytic techniques that facilitate fast pattern discovery and model building, particularly with large datasets.
dataset: A collection of related measurements. In the data-mining context, this usually refers to an organized electronic file or database containing records of routine business activity or other information relevant to a particular data-mining project.
decision tree: A family of classification methods whose results are usually represented in a tree-like graph.
dependent variable: In a model, a variable whose value directly depends on the values of other (independent) variables. The dependent variable is usually the element that data miners try to predict or control. (See also independent variable.)
forecasting: Predicting future values of some variable. Forecasting methods are often used for prediction of sales, prices, or other economic measures.
frequency: The number of times a specific value occurs within a dataset.
Next page