Butch Quinto
Next-Generation Machine Learning with Spark
Covers XGBoost, LightGBM, Spark NLP, Distributed Deep Learning with Keras, and More
Butch Quinto
Carson, CA, USA
Any source code or other supplementary material referenced by the author in this book is available to readers on GitHub via the books product page, located at www.apress.com/9781484256688 . For more detailed information, please visit http://www.apress.com/source-code .
ISBN 978-1-4842-5668-8 e-ISBN 978-1-4842-5669-5
https://doi.org/10.1007/978-1-4842-5669-5
Butch Quinto 2020
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.
Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.
While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein.
Distributed to the book trade worldwide by Springer Science+Business Media New York, 1 New York Plaza, New York, NY 10004. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail orders-ny@springer-sbm.com, or visit www.springeronline.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.
This book is dedicated to my wife Aileen; my children, Matthew, Timothy, and Olivia; my sisters, Kat and Kristel; and my parents, Edgar and Cynthia.
Introduction
This book provides an accessible introduction to Spark and Spark MLlib. However, this is not yet another book on the standard Spark MLlib algorithms. This book focuses on powerful third-party machine learning algorithms and libraries beyond what is available in the standard Spark MLlib library. Some of the advanced topics I cover include XGBoost4J-Spark, LightGBM on Spark, Isolation Forest, Spark NLP, Stanford CoreNLP, Alluxio, Distributed Deep Learning with Keras and Spark using Elephas and Distributed Keras, and more.
I assume no prior experience with Spark and Spark MLlib. However, some knowledge of machine learning, Scala, and Python is helpful if you want to follow the examples in this book. I highly recommend you work through the examples and experiment with the code samples to get the most out of this book. Chapterprovides an introduction to Spark and Spark MLlib. If you want to jump right into more advanced topics, feel free to go straight to the chapter that interests you. This book is for practitioners. I tried to keep the book as simple and practical as possible, focusing on a hands-on approach rather than concentrating on theory (even though there is also plenty of that in this book). If you need a more thorough introduction to machine learning, I suggest you use a companion reference such asAn Introduction to Statistical Learningby Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani (Springer, 2017) andThe Elements of Statistical Learningby Trevor Hastie, Robert Tibshirani, and Jerome Friedman (Springer, 2016). For more information on Spark MLlib, consult Apache SparksMachine Learning Library (MLlib) Guideonline. For a thorough treatment of deep learning, I recommendDeep Learningby Ian Goodfellow, Yoshua Bengio, and Aaron Courville (MIT Press, 2016).
Acknowledgments
I would like to thank everyone at Apress, particularly Rita Fernando Kim, Laura C. Berendson, and Susan McDermott for all the help and support in getting this book published. It was a pleasure working with the Apress team. Several people have contributed to this book directly and indirectly. Thanks to Matei Zaharia, Joeri Hermans, Max Pumperla, Fangzhou Yang, Alejandro Correa Bahnsen, Zygmunt Zawadzki, and Irfan Elahi. Thanks to Databricks and the entire Apache Spark, ML, and AI community. A special acknowledgment to Kat, Kristel, Edgar, and Cynthia for the encouragement and support. Last but not least, thanks to my wife Aileen and children Matthew, Timothy, and Olivia.
Table of Contents
About the Author
Butch Quinto
is Founder and Chief AI Officer at Intelvi AI, an artificial intelligence company that develops cutting-edge solutions for the defense, industrial, and transportation industries. As Chief AI Officer, Butch heads strategy, innovation, research, and development. Previously, he was the Director of Artificial Intelligence at a leading technology firm and Chief Data Officer at an AI start-up. As Director of Analytics at Deloitte, he led the development of several enterprise-grade AI and IoT solutions as well as strategy, business development, and venture capital due diligence. Butch has more than 20 years of experience in various technology and leadership roles in several industries including banking and finance, telecommunications, government, utilities, transportation, e-commerce, retail, manufacturing, and bioinformatics. He is also the author ofNext-Generation Big Data(Apress, 2018) and a member of the Association for the Advancement of Artificial Intelligence and the American Association for the Advancement of Science.
About the Technical Reviewer
Irfan Elahi
has years of multidisciplinary experience in data science and machine learning. He has worked in a number of verticals such as consultancy firms, his own start-ups, and academia research lab. Over the years he has worked on a number of data science and machine learning projects in different niches such as telecommunication, retail, Web, public sector, and energy with the goal to enable businesses to derive immense value from their data assets.