Data Science Data Science Concepts and Practice Second Edition Vijay Kotu Bala Deshpande Morgan Kaufmann is an imprint of Elsevier 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States Copyright r 2019 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publishers permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing.
As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress ISBN: 978-0-12-814761-0 For Information on all Morgan Kaufmann publications visit our website at https://www.elsevier.com/books-and-journals Publisher: Jonathan Simpson Acquisition Editor: Glyn Jones Editorial Project Manager: Ana Claudia Abad Garcia Production Project Manager: Sreejith Viswanathan Cover Designer: Greg Harris Typeset by MPS Limited, Chennai, India Dedication To all the mothers in our lives Foreword A lot has happened since the first edition of this book was published in 2014. There is hardly a day where there is no news on data science, machine learning, or artificial intelligence in the media.
It is interesting that many of those news articles have a skeptical, if not an even negative tone. All this underlines two things: data science and machine learning are finally becom ing mainstream. And people know shockingly little about it. Readers of this book will certainly do better in this regard. It continues to be a valuable resource to not only educate about how to use data science in practice, but also how the fundamental concepts work. Data science and machine learning are fast-moving fields which is why this second edition reflects a lot of the changes in our field.
While we used to talk a lot about data mining and predictive analytics only a couple of years ago, we have now settled on the term data science for the broader field. And even more importantly: it is now commonly understood that machine learning is at the core of many current technological breakthroughs. These are truly exciting times for all the people working in our field then! I have seen situations where data science and machine learning had an incredible impact. But I have also seen situations where this was not the case. What was the difference? In most cases where organizations fail with data science and machine learning is, they had used those techniques in the wrong context. Data science models are not very helpful if you only have one big decision you need to make.
Analytics can still help you in such cases by giving you easier access to the data you need to make this decision. Or by presenting the data in a consumable fashion. But at the end of the day, those single big decisions are often strategic. Building a machine learning model to help you make this decision is not worth doing. And often they also do not yield better results than just making the decision on your own. xi xii Foreword Here is where data science and machine learning can truly help: these advanced models deliver the most value whenever you need to make lots of similar decisions quickly.
Good examples for this are: G Defining the price of a product in markets with rapidly changing demands. G Making offers for cross-selling in an E-Commerce platform. G Approving credit or not. G Detecting customers with a high risk for churn. G Stopping fraudulent transactions. G And many others.
You can see that a human being who would have access to all relevant data could make those decisions in a matter of seconds or minutes. Only that they cant without data science, since they would need to make this type of decision millions of times, every day. Consider sifting through your customer base of 50 million clients every day to identify those with a high churn risk. Impossible for any human being. But no problem at all for a machine learn ing model. So, the biggest value of artificial intelligence and machine learning is not to support us with those big strategic decisions.
Machine learning delivers most value when we operationalize models and automate millions of decisions. One of the shortest descriptions of this phenomenon comes from Andrew Ng, who is a well-known researcher in the field of AI. Andrew describes what AI can do as follows: If a typical person can do a mental task with less than one second of thought, we can probably automate it using AI either now or in the near future. I agree with him on this characterization. And I like that Andrew puts the emphasis on automation and operationalization of those modelsbecause this is where the biggest value is. The only thing I disagree with is the time unit he chose.
It is safe to already go with a minute instead of a second. However, the quick pace of changes as well as the ubiquity of data science also underlines the importance of laying the right foundations. Keep in mind that machine learning is not completely new. It has been an active field of research since the 1950s. Some of the algorithms used today have even been around for more than 200 years now. And the first deep learning models were developed in the 1960s with the term deep learning being coined in 1984.
Those algorithms are well understood now. And under standing their basic concepts will help you to pick the right algorithm for the right task. To support you with this, some additional chapters on deep learning and rec ommendation systems have been added to the book. Another focus area is Foreword xiii using text analytics and natural language processing. It became clear in the past years that the most successful predictive models have been using unstructured input data in addition to the more traditional tabular formats. Finally, expansion of Time Series Forecasting should get you started on one of the most widely applied data science techniques in the business.
More algorithms could mean that there is a risk of increased complexity. But thanks to the simplicity of the RapidMiner platform and the many practical examples throughout the book this is not the case here. We continue our journey towards the democratization of data science and machine learning. This journey continues until data science and machine learning are as ubiqui tous as data visualization or Excel. Of course, we cannot magically transform everybody into a data scientist overnight, but we can give people the tools to help them on their personal path of development. This book is the only tour guide you need on this journey.
Next page