Copyright 2015
All rights reserved. No portion of this book may be reproduced mechanically, electronically, or by any other means, including photocopying without the permission of the publisher
LEARN SPARK IN A DAY
The Ultimate Crash Course to Learning the Basics of Spark in No Time
Disclaimer
The information provided in this book is designed to provide helpful information on the subjects discussed. The authors books are only meant to provide the reader with the basics knowledge of a certain topic, without any warranties regarding whether the student will, or will not, be able to incorporate and apply all the information provided. Although the writer will make his best effort share his insights, learning is a difficult task and each person needs a different timeframe to fully incorporate a new topic. This book, nor any of the authors books constitute a promise that the reader will learn a certain topic within a certain timeframe.
Table of Contents
Ignite Solutions with Apache Spark An Introduction
Time to Stop Worrying Lets Play the Magic with Apache Spark
Installation of Apache Spark on Cluster
Is Scala Better Than Python for Apache Spark
Why Developers Love to Work with Apache Spark?
Apache Spark Resource Administration and YARN Application Models
Apache Spark on Yarn
REPL with Apache Spark Shell
Spark Vs Hadoop
Faster Application Development with Apache Spark
Spark GraphX and Its Importance
Lets Manipulate Structured Data with the Help of Spark SQL
Big Streaming Data with Apache Spark
Lets Build Lambda Architecture with the Help of Spark Streaming
When Things Go Wrong Get Help of Spark (Lambda Architecture)
Chapter 1: Ignite Solutions with Apache Spark An Introduction
Millions and millions of people are communicating through massively connected networks, generating vast amounts of data daily. With the advancement of technologies, researchers can now extract a huge amount of sample data within a few hours. These kind of applications leave us with one major concern. How we are supposed to process and analyse such a vast amount of large-scale data and co-op up with their speed to provide better and faster solutions?
In this chapter, let us get introduced to Apache Spark: an open source framework for big data analysis, look upon what features it comes with and install and perform a simple analysis with the framework as an example.
How Spark Ignited
Originated in the AMP lab of the University of California, Berkley in 2009 and merged with Apache projects in a year later, Apache Spark emerged as a fast and convenient solution to perform complex analysis of large-scale data. A significant advantage of Apache Spark over already existing technologies like Hadoop and Storm is that it comes as a complete solution to analyze data coming from different sources like real time or batch processing in various formats such as images, texts, graphs, etc. We will be comparing it with Hadoop: an already existing solution for Mapreduce later in this chapter.
Not only MapReduce procedures, Spark framework comes with tools for machine learning, data streaming, processing graph data and also running SQL queries. These functions can be either performed individually or combined to be run in a pipeline, according to user requirement.
Some of the eye catching features of Apache Spark are that it enables users to query data easily with its inbuilt set of high level operators which is over 80 in number and allows programming in Java, Python or Scala. It can also speed up an application running in a Hadoop cluster in 100 times in-memory and 10 times on disk.
Next, Let us see more details about interesting features Spark brings to the users.
Apache Spark Support for Large-scale Data Analysis
Compared to existing technologies, Spark significantly speeds up the analysis and generates results in real time while effectively storing data in-memory. Spark can perform both in-memory and on-disk and hence will effectively handle the cases of where in-memory is insufficient to handle complete data set or they do not fit in the total of all the cluster memories. Execution in in-memory or on-disk memory can be adjusted according to your application requirements. Since, Spark performs many operations, keeping intermediate results in-memory, it enables to achieve improved performance if you are constantly using the same set of data.
Apache Spark has optimized Mapreduce functions, reducing the computational cost involved in processing the data and also enables the user to optimize data processing pipelines by supporting lazy-evaluation. As mentioned above, Spark supports additional functionalities than MapReduce for data processing and analysis and has improved performance in generating arbitrary operator graphs as well.
Spark was originally written in Scala language and runs inside a Java Virtual Machine (JVM). A command line tool is available for Scala and Python and it provides programming interfaces in Scala, Java and Python languages. In addition to these three languages, applications running with Spark framework can be also implemented in R and Clojure.
Comparison with Hadoop
As far as big data analysis is considered, Hadoop has been a promising solution for over a decade. With its added features and optimizations, Apache Spark comes as a promising alternative for HadoopMapReduce. Lets dig deeper to see how.
Hadoop would be a perfect for a solution including sequential data processing, where each step would consist of a Map and a Reduce function with a single pass over the input. But for certain applications involving several passes over the input, it comes with the cost of maintaining the MapReduce operations at each step of the computation. This workflow of Hadoop MapReduce requires constant writing to and reading from the cluster memories and may significantly slow down the system with added time taken for these memory operations. Some other considerations which make using Hadoop painful are the complexity of configuring and maintaining the clusters and integration of different third party tools based on what the application includes, whether it includes machine learning or processing of streamed data, etc.
With its optimized design, Apache Spark executes on a set up similar to Hadoop Distributed File System and has achieved improved performance over the existing, while providing added functionalities. With its functions implemented using Directed Acyclic Graphs, Spark enables to develop pipelines involving multiprocessing of the data. It shares the data across these data structures in-memory and allows to process the same set of data in parallel. Spark comes with utilities to develop applications including different types of data analysis and processing and hence comes as a comprehensive solution as illustrated in the next section.
Apache Spark Libraries
In addition to its core, Spark also provides some useful set of libraries for huge data analytics. A summary of some important libraries is presented in the table below.
Apache Spark Library | Usage |
Spark Streaming | Facilitates real time streamed data processing in a micro batch analysis manner, using a DStream: A series of realtime Resilient Distributed DataSets(RDD) |
Spark SQL | Enables to execute queries on structured data within Spark application itself or through JDBC and ODBC connectors if necessary. It also enables working with data in different formats such as JSON, Apache Parquet, etc. to perform a query. |