Agile Data Science 2.0
by Russell Jurney
Copyright 2017 Data Syndrome LLC. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .
- Editor: Shannon Cutt
- Production Editor: Shiny Kalapurakkel
- Copyeditor: Rachel Head
- Proofreader: Kim Cofer
- Indexer: Lucie Haskins
- Interior Designer: David Futato
- Cover Designer: Karen Montgomery
- Illustrator: Rebecca Demarest
Revision History for the First Edition
- 2017-05-26: First Release
The OReilly logo is a registered trademark of OReilly Media, Inc. Agile Data Science 2.0, the cover image, and related trade dress are trademarks of OReilly Media, Inc.
While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
978-1-491-96011-0
[LSI]
Preface
I wrote the first edition of this book while disabled from a car accident after which I developed chronic pain and lost partial use of my hands. Unable to chop vegetables, I wrote it from bed and the couch on an iPad to get over a failed project that haunted me called Career Explorer. Having been injured weeks before the ship date, getting the product over the line, staying up for days and doing whatever it took, became a traumatic experience. During the project, we made many mistakes I knew not to make, and I was continuously frustrated. The product bombed. A sense of failure routinely bugged me while I was stuck, horizontal on my back most of the time with intractable chronic pain. Also suffering from a heart condition, missing a third of my heartbeats, I developed dementia. My mind sank to a dark place. I could not easily find a way out. I had to find a way to fix things, to grapple with failure. Strange to say that to fix myself, I wrote a book. I needed to write directions I could give to teammates to make my next project a success. I needed to get this story out of me. More than that, I thought I could bring meaning back to my life, most of which had been shed by disability, by helping others. By doing something for the greater good. I wanted to ensure that others did not repeat my mistakes. I thought that was worth doing. There was a problem this project illustrated that was bigger than me. Most research sits on a shelf and never gets into the hands of people it can benefit. This book is a prescription and methodology for doing applied research that makes it into the world in the form of a product.
This may sound quite dramatic, but I wanted to put the first edition in personal context before introducing the second. Although it was important to me, of course, the first edition of this book was only a small contribution to the emerging field of data science. But Im proud ofit. I found salvation in its pages, it made me feel right again, and in time I recovered from illness and found a sense of accomplishment that replaced the sting of failure. So thats the first edition.
In this second edition, I hope to do more. Put simply, I want to take a budding data scientist and accelerate her into an analytics application developer. In doing so, I draw from and reflect upon my experience building analytics applications at three Hadoop shops and one Spark shop. I hope this new edition will become the go-to guide for readers to rapidly learn how to build analytics applications on data of any size, using the lingua franca of data science, Python, and the platform of choice, Spark.
Spark has replaced Hadoop/MapReduce as the default way to process data at scale, so we adopt Spark for this new edition. In addition, the theory and process of the Agile Data Science methodology have been updated to reflect an increased understanding of working in teams. It is hoped that readers of the first edition will become readers of the second. It is also hoped that this book will serve Spark users better than the original served Hadoop users.
Agile Data Science has two goals: to provide a how-to guide for building analytics applications with data of any size using Python and Spark, and to help product teams collaborate on building analytics applications in an agile manner that will ensure success.
Agile Data Science Mailing List
You can learn the latest on Agile Data Science on .
I maintain a web page for this book that contains the latest updates and related material for readers of the book.
Data Syndrome, Product Analytics Consultancy
I have founded a consultancy called Data Syndrome (.
Data Syndrome offers a video course, .
Figure P-1. Data Syndrome
Figure P-2. Realtime Predictive Analytics video course
Live Training
Data Syndrome is developing a complete curriculum for live big data training for data science and data engineering teams. Current course offerings are customizable for your needs and include:
Agile Data Science
A three-day course covering the construction of full-stack analytics applications. Similar in content to this book, this course trains data scientists to be full-stack application developers.
Realtime Predictive Analytics
A one-day, six-hour course covering the construction of entire realtime predictive systems using Kafka and Spark Streaming with a web application frontend.
Introduction to PySpark
A one-day, three-hour course introducing students to basic data processing with Spark through the Python interface, PySpark. Culminates in the construction of a classifier model to predict flight delays using Spark MLlib.
For more information, visit .
Who This Book Is For
Agile Data Science is intended to help beginners and budding data scientists to become productive members of data science and analytics teams. It aims to help engineers, analysts, and data scientists work with big data in an agile way using Hadoop. It introduces an agile methodology well suited for big data.
This book is targeted at programmers with some exposure to developing software and working with data. Designers and product managers might particularly enjoy Chapters , which will serve as an introduction to the agile process without focusing on running code.
Agile Data Science assumes you are working in a *nix environment. Examples for Windows users arent available, but are possible via Cygwin.
How This Book Is Organized