Fast Data Processing with Spark Second Edition Perform real-time analytics using Spark in a fast, distributed, and scalable way Krishna Sankar Holden Karau
BIRMINGHAM - MUMBAI Fast Data Processing with Spark Second Edition Copyright 2015 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information. First published: October 2013 Second edition: March 2015 Production reference: 1250315 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78439-257-4 www.packtpub.com Credits Authors Copy Editor Krishna Sankar Hiral Bhat Holden Karau Project Coordinator Reviewers Neha Bhatnagar Robin East Toni Verbeiren Proofreaders Lijie Xu Maria Gould Ameesha Green Commissioning Editor Joanna McMahon Akram Hussain Indexer Acquisition Editors Tejal Soni Shaon Basu Kunal Parikh Production Coordinator Nilesh R. Mohite Content Development Editor Arvind Koul Cover Work Nilesh R. Mohite Technical Editors Madhunikita Sunil Chindarkar Taabish Khan About the Authors Krishna Sankar is a chief data scientist at http://www.blackarrow.tv/ , where he focuses on optimizing user experiences via inference, intelligence, and interfaces.
His earlier roles include principal architect, data scientist at Tata America Intl, director of a data science and bioinformatics start-up, and a distinguished engineer at Cisco. He has spoken at various conferences, such as Strata-Sparkcamp, OSCON, Pycon, and Pydata about predicting NFL ( http://goo.gl/movfds ), Spark ( http://goo.gl/E4kqMD ), data science ( http://goo.gl/9pyJMH ), machine learning ( http://goo.gl/SXF53n ), and social media analysis ( http://goo.gl/D9YpVQ ). He was a guest lecturer at Naval Postgraduate School, Monterey. His blogs can be found at https://doubleclix.wordpress.com/ . His other passion is Lego Robotics. You can fnd him at the St.
Louis FLL World Competition as the robots design judge. The credit goes to my coauthor, Holden Karau, the reviewers, and the editors at Packt Publishing. Holden wrote the frst edition, and I hope I was able to contribute to the same depth. I am deeply thankful to the reviewers Lijie, Robin, and Toni. They spent time diligently reviewing the material and code. They have added lots of insightful tips to the text, which I have gratefully included.
In addition, their sharp eyes caught tons of errors in the code and text. Thanks to Arvind Koul, who has been the chief force behind the book. A great editor is absolutely essential for the completion of a book, and I was lucky to have Arvind. I also want to thank the editors at Packt Publishing: Anila, Madhunikita, Milton, Neha, and Shaon, with whom I had the fortune to work with at various stages. The guidance and wisdom from Joe Matarese, my boss at http://www.blackarrow. tv/ , and from Paco Nathan at Databricks are invaluable.
My spouse, Usha and son Kaushik, were always with me, cheering me on for any endeavor that I embark uponmostly successful, like this book, and occasionally foolhardy efforts! I dedicate this book to my mom, who unfortunately passed away last month; she was always proud to see her eldest son as an author. Holden Karau is a software development engineer and is active in the open source sphere. She has worked on a variety of search, classifcation, and distributed systems problems at Databricks, Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a bachelor's of mathematics degree in computer science. Other than software, she enjoys playing with fre and hula hoops, and welding. About the Reviewers Robin East has served a wide range of roles covering operations research, fnance, IT system development, and data science.
In the 1980s, he was developing credit scoring models using data science and big data before anyone (including himself) had even heard of those terms! In the last 15 years, he has worked with numerous large organizations, implementing enterprise content search applications, content intelligence systems, and big data processing systems. He has created numerous solutions, ranging from swaps and derivatives in the banking sector to fashion analytics in the retail sector. Robin became interested in Apache Spark after realizing the limitations of the traditional MapReduce model with respect to running iterative machine learning models. His focus is now on trying to further extend the Spark machine learning libraries, and also on teaching how Spark can be used in data science and data analytics through his blog, Machine Learning at Speed ( http://mlspeed. wordpress.com ). Before NoSQL databases became the rage, he was an expert on tuning Oracle databases and extracting maximum performance from EMC Documentum systems.
This work took him to clients around the world and led him to create the open source profling tool called DFCprof that is used by hundreds of EMC users to track down performance problems. For many years, he maintained the popular Documentum internals and tuning blog, Inside Documentum ( http://robineast. wordpress.com ), and contributed hundreds of posts to EMC support forums. These community efforts bore fruit in the form of the award of EMC MVP and acceptance into the EMC Elect program. Toni Verbeiren graduated as a PhD in theoretical physics in 2003. He used to work on models of artifcial neural networks, entailing mathematics, statistics, simulations, (lots of) data, and numerical computations.
Since then, he has been active in the industry in diverse domains and roles: infrastructure management and deployment, service management, IT management, ICT/business alignment, and enterprise architecture. Around 2010, Toni started picking up his earlier passion, which was then named data science. The combination of data and common sense can be a very powerful basis to make decisions and analyze risk. Toni is active as an owner and consultant at Data Intuitive ( http://www.data-intuitive.com/ ) in everything related to big data science and its applications to decision and risk management. He is currently involved in Exascience Life Lab ( http://www.exascience.com/ ) and the Visual Data Analysis Lab ( http://vda-lab. be/ ), which is concerned with scaling up visual analysis of biological and chemical data.
I'd like to thank various employers, clients, and colleagues for the insight and wisdom they shared with me. I'm grateful to the Belgian and Flemish governments (FWO, IWT) for fnancial support of the aforementioned academic projects. Lijie Xu is a PhD student at the Institute of Software, Chinese Academy of Sciences. His research interests focus on distributed systems and large-scale data analysis. He has both academic and industrial experience in Microsoft Research Asia, Alibaba Taobao, and Tencent. www.PacktPub.com Support fles, eBooks, discount offers, and more For support fles and downloads related to your book, please visit www.PacktPub.com . www.PacktPub.com Support fles, eBooks, discount offers, and more For support fles and downloads related to your book, please visit www.PacktPub.com .