This edition first published 2017 2017 John Wiley & Sons, Inc.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of Field Cady to be identified as the author(s) of this work has been asserted in accordance with law.
Registered Offices
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
Editorial Office
111 River Street, Hoboken, NJ 07030, USA
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of Warranty
MATLAB is a trademark of The MathWorks, Inc. and is used with permission. The Math Works does not warrant the accuracy of the text or exercises in this book. This work's use or discussion of MATLAB software or related products does not constitute endorsement or sponsorship by The MathWorks of a particular pedagogical approach or particular use of the MATLAB software. While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Library of Congress Cataloguing-in-Publication Data
Names: Cady, Field, 1984- author.
Title: The data science handbook / Field Cady.
Description: Hoboken, NJ: John Wiley & Sons, Inc., 2017. | Includes bibliographical references and index.
Identifiers: LCCN 2016043329 (print) | LCCN 2016046263 (ebook) | ISBN9781119092940 (cloth) | ISBN 9781119092933 (pdf) | ISBN 9781119092926(epub)
Subjects: LCSH: Databases-Handbooks, manuals, etc. | Statistics-Dataprocessing-Handbooks, manuals, etc. | Big data-Handbooks, manuals, etc. | Information theory-Handbooks, manuals, etc.
Classification: LCC QA76.9.D32 C33 2017 (print) | LCC QA76.9.D32 (ebook) | DDC 005.74-dc23
LC record available at https://lccn.loc.gov/2016043329
Cover image: /Gettyimages
Cover design by Wiley
To my wife, Ryna. Thank you honey, for your support and for always believing in me.
Preface
This book was written to solve a problem. The people who I interview for data science jobs have sterling mathematical pedigrees, but most of them are unable to write a simple script that computes Fibonacci numbers (in case you aren't familiar with Fibonacci numbers, this takes about five lines of code). On the other side, employers tend to view data scientists as either mysterious wizards or used-car salesmen (and when data scientists can't be trusted to write a basic script, the latter impression has some merit!). These problems reflect a fundamental misunderstanding, by all parties, of what data science is (and isn't) and what skills its practitioners need.
When I first got into data science, I was part of that problem. Years of doing academic physics had trained me to solve problems in a way that was long on abstract theory but short on common sense or flexibility. Mercifully, I also knew how to code (thanks, Google internships!), and this let me limp along while I picked up the skills and mindsets that actually mattered.
Since leaving academia, I have done data science consulting for companies of every stripe. This includes web traffic analysis for tiny start-ups, manufacturing optimizations for Fortune 100 giants, and everything in between. The problems to solve are always unique, but the skills required to solve them are strikingly universal. They are an eclectic mix of computer programming, mathematics, and business savvy. They are rarely found together in one person, but in truth they can be learned by anybody.
A few interviews I have given stand out in my mind. The candidate was smart and knowledgeable, but the interview made it painfully clear that they were unprepared for the daily work of a data scientist. What do you do as an interviewer when the candidate starts apologizing for wasting your time? We ended up filling the hour with a crash course on what they were missing and how they could go out and fill the gaps in their knowledge. They went out, learned what they needed to, and are now successful data scientists.
I wrote this book in an attempt to help people like that out, by condensing data science's various skill sets into a single, coherent volume. It is hands-on and to the point: ideal for somebody who needs to come up to speed quickly or solve a problem on a tight deadline. The educational system has not yet caught up to the demands of this new and exciting field, and my hope is that this book will help you bridge the gap.
Field Cady
September 2016
Redmond, Washington
Chapter 1
Introduction: Becoming a Unicorn
Data science is a very popular term these days, and it gets applied to so many things that its meaning has become very vague. So I'd like to start this book by giving you the definition that I use. I've found that this one gets right to the heart of what sets it apart from other disciplines. Here goes:
Data science means doing analytics work that, for one reason or another, requires a substantial amount of software engineering skills.
Sometimes, the final deliverable is the kind of thing a statistician or business analyst might provide, but achieving that goal demands software skills that your typical analyst simply doesn't have. For example, a dataset might be so large that you need to use distributed computing to analyze it or so convoluted in its format that many lines of code are required to parse it. In many cases, data scientists also have to write big chunks of production software that implement their analytics ideas in real time. In practice, there are usually other differences as well. For example, data scientists usually have to extract features from raw data, which means that they tackle very open-ended problems such as how to quantify the spamminess of an e-mail.
It's very hard to find people who can construct good statistical models, hack quality software, and relate this all in a meaningful way to business problems. It's a lot of hats to wear! These individuals are so rare that recruiters often call them unicorns.
Next page