LitArk » Books » Computer

Anirudh Kala - Optimizing Databricks Workloads: Harness the power of Apache Spark in Azure and maximize the performance of modern big data workloads

Here you can read online Anirudh Kala - Optimizing Databricks Workloads: Harness the power of Apache Spark in Azure and maximize the performance of modern big data workloads full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. year: 2021, publisher: Packt Publishing, genre: Computer. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:

Romance novel Science fiction Adventure Detective Science History Home and family Prose Art Politics Computer Non-fiction Religion Business Children Humor

Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.

Book:
Optimizing Databricks Workloads: Harness the power of Apache Spark in Azure and maximize the performance of modern big data workloads
Author:
Anirudh Kala / Anshul Bhatnagar / Sarthak Sarbahi
Publisher:
Packt Publishing
Genre:
Books / Computer
Year:
2021
Rating:
3 / 5
Favourites:
Add to favourites
Your mark:
- 60
- 1
- 2
- 3
- 4
- 5

Description
Author's other books
Similar books

Optimizing Databricks Workloads: Harness the power of Apache Spark in Azure and maximize the performance of modern big data workloads: summary, description and annotation

We offer to read an annotation, description, summary or preface (depends on what the author of the book "Optimizing Databricks Workloads: Harness the power of Apache Spark in Azure and maximize the performance of modern big data workloads" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.

Accelerate computations and make the most of your data effectively and efficiently on Databricks

Key Features

Understand Spark optimizations for big data workloads and maximizing performance
Build efficient big data engineering pipelines with Databricks and Delta Lake
Efficiently manage Spark clusters for big data processing

Book Description

Databricks is an industry-leading, cloud-based platform for data analytics, data science, and data engineering supporting thousands of organizations across the world in their data journey. It is a fast, easy, and collaborative Apache Spark-based big data analytics platform for data science and data engineering in the cloud.

In Optimizing Databricks Workloads, you will get started with a brief introduction to Azure Databricks and quickly begin to understand the important optimization techniques. The book covers how to select the optimal Spark cluster configuration for running big data processing and workloads in Databricks, some very useful optimization techniques for Spark DataFrames, best practices for optimizing Delta Lake, and techniques to optimize Spark jobs through Spark core. It contains an opportunity to learn about some of the real-world scenarios where optimizing workloads in Databricks has helped organizations increase performance and save costs across various domains.

By the end of this book, you will be prepared with the necessary toolkit to speed up your Spark jobs and process your data more efficiently.

What you will learn

Get to grips with Spark fundamentals and the Databricks platform
Process big data using the Spark DataFrame API with Delta Lake
Analyze data using graph processing in Databricks
Use MLflow to manage machine learning life cycles in Databricks
Find out how to choose the right cluster configuration for your workloads
Explore file compaction and clustering methods to tune Delta tables
Discover advanced optimization techniques to speed up Spark jobs

Who this book is for

This book is for data engineers, data scientists, and cloud architects who have working knowledge of Spark/Databricks and some basic understanding of data engineering principles. Readers will need to have a working knowledge of Python, and some experience of SQL in PySpark and Spark SQL is beneficial.

Table of Contents

Discovering Databricks
Batch and Real-Time Processing in Databricks
Learning about Machine Learning and Graph Processing in Databricks
Managing Spark Clusters
Big Data Analytics
Databricks Delta Lake
Spark Core
Case Studies

Anirudh Kala: author's other books

Who wrote Optimizing Databricks Workloads: Harness the power of Apache Spark in Azure and maximize the performance of modern big data workloads? Find out the surname, the name of the author of the book and a list of all author's works by series.

Optimizing Databricks Workloads: Harness the power of Apache Spark in Azure and maximize the performance of modern big data workloads — read online for free the complete book (whole text) full work

Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "Optimizing Databricks Workloads: Harness the power of Apache Spark in Azure and maximize the performance of modern big data workloads" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.

Light

Font size:

↓

↑

Reset

Interval:

↓

↑

Bookmark:

Make

Chapter 1: Discovering Databricks

The original creators of Apache Spark established Databricks to solve the world's toughest data problems. Databricks was launched as a Spark-based unified data analytics platform in the cloud.

In this chapter, we will begin by understanding the internal architecture of Apache Spark. This will be followed by an introduction to the basic components of Databricks. The following topics will be covered in this chapter:

Introducing Spark fundamentals
Introducing Databricks
Learning about Delta Lake

Technical requirements

For this chapter, you will need the following:

An Azure subscription
Azure Databricks

Please refer to the code sample from: Code samples from https://github.com/PacktPublishing/Optimizing-Databricks-Workload/tree/main/Chapter01

Introducing Spark fundamentals

Spark is a distributed data processing framework capable of analyzing large datasets. At its very core, it consists of the following:

DataFrames: Fundamental data structures consisting of rows and columns.
Machine Learning (ML): Spark ML provides ML algorithms for processing big data.
Graph processing: GraphX helps to analyze relationships between objects.
Streaming: Spark's Structured Streaming helps to process real-time data.
Spark SQL: A SQL to Spark engine with query plans and a cost-based optimizer.

DataFrames in Spark are built on top of Resilient Distributed Datasets (RDDs), which are now treated as the assembly language of the Spark ecosystem. Spark is compatible with various programming languages Scala, Python, R, Java, and SQL.

Spark encompasses an architecture with one driver node and multiple worker nodes. The driver and worker nodes together constitute a Spark cluster. Under the hood, these nodes are based in Java Virtual Machines (JVMs). The driver is responsible for assigning and coordinating work between the workers.

Figure 1.1 Spark architecture driver and workers

The worker nodes have executors running inside each of them, which host the Spark program. Each executor consists of one or more slots that act as the compute resource. Each slot can process a single unit of work at a time.

Figure 1.2 Spark architecture executors and slots

Every executor reserves memory for two purposes:

Cache
Computation

The cache section of the memory is used to store the DataFrames in a compressed format (called caching), while the compute section is utilized for data processing (aggregations, joins, and so on). For resource allocation, Spark can be used with a cluster manager that is responsible for provisioning the nodes of the cluster. Databricks has an in-built cluster manager as part of its overall offering.

Note

Executor slots are also called cores or threads.

Spark supports parallelism in two ways:

Vertical parallelism: Scaling the number of slots in the executors
Horizontal parallelism: Scaling the number of executors in a Spark cluster

Spark processes the data by breaking it down into chunks called partitions. These partitions are usually 128 MB blocks that are read by the executors and assigned to them by the driver. The size and the number of partitions are decided by the driver node. While writing Spark code, we come across two functionalities, transformations and actions. Transformations instruct the Spark cluster to perform changes to the DataFrame. These are further categorized into narrow transformations and wide transformations. Wide transformations lead to the shuffling of data as data requires movement across executors, whereas narrow transformations do not lead to re-partitioning across executors.

Running these transformations does not make the Spark cluster do anything. It is only when an action is called that the Spark cluster begins execution, hence the saying Spark is lazy. Before executing an action, all that Spark does is make a data processing plan. We call this plan the Directed Acyclic Graph (DAG). The DAG consists of various transformations such as read, filter, and join and is triggered by an action.

Figure 1.3 Transformations and actions

Every time a DAG is triggered by an action, a Spark job gets created. A Spark job is further broken down into stages. The number of stages depends on the number of times a shuffle occurs. All narrow transformations occur in one stage while wide transformations lead to the formation of new stages. Each stage comprises of one or more tasks and each task processes one partition of data in the slots. For wide transformations, the stage execution time is determined by its slowest running task. This is not the case with narrow transformations.

At any moment, one or more tasks run parallelly across the cluster. Every time a Spark cluster is set up, it leads to the creation of a Spark session. This Spark session provides entry into the Spark program and is accessible with the spark keyword.

Sometimes, a few tasks process small partitions while others process larger chunks, we call this data skewing. This skewing of data should always be avoided if you hope to run efficient Spark jobs. In a broad execution, the stage is determined by its slowest task, so if a task is slow, the overall stage is slow and everything waits for that to finish. Also, whenever a wide transformation is run, the number of partitions of the data in the cluster changes to 200. This is a default setting, but can be modified using Spark configuration.

Figure 1.4 Jobs, stages, and tasks

As a rule of thumb, the total number of partitions should always be in the multiples of the total slots in the cluster. For instance, if a cluster has 16 slots and the data has 16 partitions, then each slot receives 1 task that processes 1 partition. But instead, if there are 15 partitions, then 1 slot will remain empty. This leads to the state of cluster underutilization. In the case of 17 partitions, a job will take twice the time to complete as it will wait for that 1 extra task to finish processing.

Let's move on from Spark for now and get acquainted with Databricks.

Introducing Databricks

Databricks provides a collaborative platform for data engineering and data science. Powered by the potential of Apache Spark, Databricks helps enable ML at scale. It has also revolutionized the existing data lakes by introducing the

Light

Font size:

↓

↑

Reset

Interval:

↓

↑

Bookmark:

Make

Similar books «Optimizing Databricks Workloads: Harness the power of Apache Spark in Azure and maximize the performance of modern big data workloads»

Look at similar books to Optimizing Databricks Workloads: Harness the power of Apache Spark in Azure and maximize the performance of modern big data workloads. We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.

Vihag Gupta

Business Intelligence with Databricks SQL: Concepts, tools, and techniques for scaling business intelligence on the data lakehouse

Ron LEsteve

The Azure Data Lakehouse Toolkit: Building and Scaling Data Lakehouses on Azure with Delta Lake, Apache Spark, Databricks, Synapse Analytics, and Snowflake

Alan Bernardo Palacio

Distributed Data Systems with Azure Databricks: Create, deploy, and manage enterprise data pipelines

Manoj Kukreja

Data Engineering with Apache Spark, Delta Lake, and Lakehouse: Create scalable pipelines that ingest, curate, and aggregate complex data in a timely and secure way

Vlad Riscutia

Data Engineering on Azure

Ahmad Osama

Azure Data Engineering Cookbook: Design and implement batch and streaming analytics using Azure Cloud Services

Tejada

Mastering Azure Analytics architecting in the cloud with Azure Data Lake, HDInsight, and Spark

Karau Holden

High performance Spark: best practices for scaling and optimizing Apache Spark

Jules S. Damji

Learning Spark: Lightning-Fast Data Analytics

Robert Ilijason

Beginning Apache Spark Using Azure Databricks: Unleashing Large Cluster Analytics in the Cloud

Hien Luu

Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning library

Venkat Ankam

Big Data Analytics

Reviews about «Optimizing Databricks Workloads: Harness the power of Apache Spark in Azure and maximize the performance of modern big data workloads»

Discussion, reviews of the book Optimizing Databricks Workloads: Harness the power of Apache Spark in Azure and maximize the performance of modern big data workloads and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.