Scala and Spark for Big Data Analytics
Tame big data with Scala and Apache Spark!
Md. Rezaul Karim
Sridhar Alla
BIRMINGHAM - MUMBAI
Scala and Spark for Big Data Analytics
Copyright 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: July 2017
Production reference: 1210717
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-78528-084-9
www.packtpub.com
Credits
Authors Md. Rezaul Karim Sridhar Alla | Copy Editor Safis Editing
|
Reviewers Andrea Bessi
Sumit Pal | Project Coordinator Ulhas Kambali
|
Commissioning Editor Aaron Lazar
| Proofreader Safis Editing
|
Acquisition Editor Nitin Dasan | Indexer Rekha Nair |
Content Development Editor Vikas Tiwari
| Cover Work Melwyn Dsa
|
Technical Editor Subhalaxmi Nadar
| Production Coordinator Melwyn Dsa
|
About the Authors
Md. Rezaul Karim is a research scientist at Fraunhofer FIT, Germany. He is also a PhD candidate at RWTH Aachen University, Aachen, Germany. He holds a BSc and an MSc in computer science. Before joining Fraunhofer FIT, he had been working as a researcher at the Insight Centre for data analytics, Ireland. Previously, he worked as a lead engineer with Samsung Electronics' distributed R&D centers in Korea, India, Vietnam, Turkey, and Bangladesh. Earlier, he worked as a research assistant in the Database Lab at Kyung Hee University, Korea, and as an R&D engineer with BMTech21 Worldwide, Korea. Even before that, he worked as a software engineer with i2SoftTechnology, Dhaka, Bangladesh.
He has more than 8 years of experience in the area of research and development, with a solid knowledge of algorithms and data structures in C/C++, Java, Scala, R, and Python-focused big data technologies: Spark, Kafka, DC/OS, Docker, Mesos, Zeppelin, Hadoop, and MapReduce, and deep learning technologies: TensorFlow, DeepLearning4j, and H2O-Sparking Water. His research interests include machine learning, deep learning, semantic web, linked data, big data, and bioinformatics. He is the author of the following book titles with Packt:
- Large-Scale Machine Learning with Spark
- Deep Learning with TensorFlow
I am very grateful to my parents, who have always encouraged me to pursue knowledge. I also want to thank my wife Saroar, son Shadman, elder brother Mamtaz, elder sister Josna, and friends, who have endured my long monologues about the subjects in this book, and have always been encouraging and listening to me. Writing this book was made easier by the amazing efforts of the open source community and the great documentation of many projects out there related to Apache Spark and Scala. Further more, I would like to thank the acquisition, content development, and technical editors of Packt (and others who were involved in this book title) for their sincere cooperation and coordination. Additionally, without the work of numerous researchers and data analytics practitioners who shared their expertise in publications, lectures, and source code, this book might not exist at all!
Sridhar Alla is a big data expert helping small and big companies solve complex problems, such as data warehousing, governance, security, real-time processing, high-frequency trading, and establishing large-scale data science practices. He is an agile practitioner as well as a certified agile DevOps practitioner and implementer. He started his career as a storage software engineer at Network Appliance, Sunnyvale, and then worked as the chief technology officer at a cyber security firm, eIQNetworks, Boston. His job profile includes the role of the director of data science and engineering at Comcast, Philadelphia. He is an avid presenter at numerous Strata, Hadoop World, Spark Summit, and other conferences. He also provides onsite/online training on several technologies. He has several patents filed in the US PTO on large-scale computing and distributed systems. He holds a bachelors degree in computer science from JNTU, Hyderabad, India, and lives with his wife in New Jersey.
Sridhar has over 18 years of experience writing code in Scala, Java, C, C++, Python, R and Go. He also has extensive hands-on knowledge of Spark, Hadoop, Cassandra, HBase, MongoDB, Riak, Redis, Zeppelin, Mesos, Docker, Kafka, ElasticSearch, Solr, H2O, machine learning, text analytics, distributed computing and high performance computing.
I would like to thank my wonderful wife, Rosie Sarkaria, for all the love and patience during the many months I spent writing this book as well as reviewing countless edits I made. I would also like to thank my parents Ravi and Lakshmi Alla all the support and encouragement they continue to bestow upon me. I am very grateful to the many friends especially Abrar Hashmi, Christian Ludwig who helped me bounce ideas and get clarity on the various topics. Writing this book was not possible without the fantastic larger Apache community and Databricks folks who are making Spark so powerful and elegant. Further, I would like to thank the acquisition, content development and technical editors of Packt Publishing (and others who were involved in this book title) for their sincere cooperation and coordination.
About the Reviewers
Andre Baianov is an economist-turned-software developer, with a keen interest in data
science. After a bachelor's thesis on data mining and a master's thesis on business
intelligence, he started working with Scala and Apache Spark in 2015. He is currently
working as a consultant for national and international clients, helping them build
reactive architectures, machine learning frameworks, and functional programming
backends.
To my wife: beneath our superficial differences, we share the same soul.
Sumit Pal is a published author with Apress for SQL on Big Data - Technology, Architecture and Innovations and SQL on Big Data - Technology, Architecture and Innovations. He has more than 22 years of experience in the software industry in various roles, spanning companies from start-ups to enterprises.