LitArk » Books » Computer

Jan Lukavský - Building Big Data Pipelines with Apache Beam: Use a single programming model for both batch and stream data processing

Here you can read online Jan Lukavský - Building Big Data Pipelines with Apache Beam: Use a single programming model for both batch and stream data processing full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. year: 2022, publisher: Packt Publishing, genre: Computer. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:

Romance novel Science fiction Adventure Detective Science History Home and family Prose Art Politics Computer Non-fiction Religion Business Children Humor

Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.

Book:
Building Big Data Pipelines with Apache Beam: Use a single programming model for both batch and stream data processing
Author:
Jan Lukavsk
Publisher:
Packt Publishing
Genre:
Books / Computer
Year:
2022
Rating:
3 / 5
Favourites:
Add to favourites
Your mark:
- 60
- 1
- 2
- 3
- 4
- 5

Description
Author's other books
Similar books

Building Big Data Pipelines with Apache Beam: Use a single programming model for both batch and stream data processing: summary, description and annotation

We offer to read an annotation, description, summary or preface (depends on what the author of the book "Building Big Data Pipelines with Apache Beam: Use a single programming model for both batch and stream data processing" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.

Implement, run, operate, and test data processing pipelines using Apache Beam

Key Features

Understand how to improve usability and productivity when implementing Beam pipelines
Learn how to use stateful processing to implement complex use cases using Apache Beam
Implement, test, and run Apache Beam pipelines with the help of expert tips and techniques

Book Description

Apache Beam is an open source unified programming model for implementing and executing data processing pipelines, including Extract, Transform, and Load (ETL), batch, and stream processing.

This book will help you to confidently build data processing pipelines with Apache Beam. Youll start with an overview of Apache Beam and understand how to use it to implement basic pipelines. Youll also learn how to test and run the pipelines efficiently. As you progress, youll explore how to structure your code for reusability and also use various Domain Specific Languages (DSLs). Later chapters will show you how to use schemas and query your data using (streaming) SQL. Finally, youll understand advanced Apache Beam concepts, such as implementing your own I/O connectors.

By the end of this book, youll have gained a deep understanding of the Apache Beam model and be able to apply it to solve problems.

What you will learn

Understand the core concepts and architecture of Apache Beam
Implement stateless and stateful data processing pipelines
Use state and timers for processing real-time event processing
Structure your code for reusability
Use streaming SQL to process real-time data for increasing productivity and data accessibility
Run a pipeline using a portable runner and implement data processing using the Apache Beam Python SDK
Implement Apache Beam I/O connectors using the Splittable DoFn API

Who this book is for

This book is for data engineers, data scientists, and data analysts who want to learn how Apache Beam works. Intermediate-level knowledge of the Java programming language is assumed.

Table of Contents

Introduction to Data Processing with Apache Beam
Implementing, Testing, and Deploying Basic Pipelines
Implementing Pipelines Using Stateful Processing
Structuring Code for Reusability
Using SQL for Pipeline Implementation
Using Your Preferred Language with Portability
Extending Apache Beams I/O Connectors
Understanding How Runners Execute Pipelines

Jan Lukavský: author's other books

Who wrote Building Big Data Pipelines with Apache Beam: Use a single programming model for both batch and stream data processing? Find out the surname, the name of the author of the book and a list of all author's works by series.

Building Big Data Pipelines with Apache Beam: Use a single programming model for both batch and stream data processing — read online for free the complete book (whole text) full work

Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "Building Big Data Pipelines with Apache Beam: Use a single programming model for both batch and stream data processing" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.

Light

Font size:

↓

↑

Reset

Interval:

↓

↑

Bookmark:

Make

Table of Contents

Building Big Data Pipelines with Apache Beam

This is an Early Access product. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the content and extracts of this book may evolve as it is being developed to ensure it is up-to-date.

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, ortransmitted in any form or by any means, without the prior written permission of the publisher,except in the case of brief quotations embedded in critical articles or reviews.

The information contained in this book is sold without warranty, either express or implied. Neither the author nor Packt Publishing or its dealers and distributors will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Early Access Publication: Building Big Data Pipelines with Apache Beam

Early Access Production Reference: B16918

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK

ISBN: 978-1-80056-493-0

www.packt.com

Building Big Data Pipelines with Apache Beam: Use a single programming model for both batch and streaming use cases

Welcome to Packt Early Access. Were giving you an exclusive preview of this book before it goes on sale. It can take many months to write a book, but our authors have cutting-edge information to share with you today. Early Access gives you an insight into the latest developments by making chapter drafts available. The chapters may be a little rough around the edges right now, but our authors will update them over time. Youll be notified when a new version is ready.

This title is in development, with more chapters still to be written, which means you have the opportunity to have your say about the content. We want to publish books that provide useful information to you and other customers, so well send questionnaires out to you regularly. All feedback is helpful, so please be open about your thoughts and opinions. Our editors will work their magic on the text of the book, so wed like your input on the technical elements and your experience as a reader. Well also provide frequent updates on how our authors have changed their chapters based on your feedback.

You can dip in and out of this book or follow along from start to finish; Early Access is designed to be flexible. We hope you enjoy getting to know more about the process of writing a Packt book. Join the exploration of new topics by contributing your ideas and see them come to life in print.

Introduction to Data Processing with Apache Beam
Implementing, Testing, and Deploying Basic Pipelines
Implementing Pipelines using Stateful Processing
Structuring Code for Reusability
Using SQL for Pipeline Implementation
Using Your Preferred Language with Portability
Extending Apache Beam's IO Connectors
Understanding How Runners Execute Pipelines

1Introduction to Data Processing with Apache Beam

Data. Big Data. Real-time Data. Data Streams. Many buzzwords to describe many things, and yet they have many common properties. Many mind-blowing applications arise from successful application of theoretically simple logic take data and produce knowledge. But even such a simple-sounding task turns out to be difficult when the amount of data needed to produce knowledge is ... huge, and still growing. Given the vast volumes of data produced by humanity every day, which tools to choose to turn our simple logic into scalable solutions? Into solutions that protect our investment into creating the data extraction logic, even in the presence of new requirements arising, or changing on a daily basis, and new data processing technologies being created? This book focuses on describing why Apache Beam might be a good solution to these challenges and provides a guide through the learning process.

In this chapter, we will cover the following topics:

Why Apache Beam?
Writing your first Pipeline
Running a Pipeline against streaming data
Measuring event time progress inside data streams
Assigning data to windows
Defining the lifecycle of a state in terms of windows
Unifying batch and streaming data processing

Technical requirements

In this first chapter we will introduce some elementary Pipelines written using Apache Beams Java SDK. We will use the code located on GitHub, at: https://github.com/PacktPublishing/Building-Big-Data-Pipelines-with-Apache-Beam

We will need the following tools:

JDK11, possibly OpenJDK11 installed, JAVA_HOME set appropriately
git
Bash

Important note

Although it is possible to run many tools in this book using Windows shell, throughout this book we will focus on using Bash scripting only. We hope Windows users will be able to run Bash using virtualization or The Windows Subsystem for Linux or any similar technology.

First of all, we need to clone the repository.

We create a suitable directory for that and run:
$ git clone https://github.com/PacktPublishing/Building-Big-Data-Pipelines-with-Apache-Beam.git
This will result in a directory Building-Big-Data-Pipelines-with-Apache-Beam being create in the working directory. We then run in this newly created directory:
$ ./mvnw clean install

Throughout this book, we will use the notation, that the character $ denotes a Bash shell., therefore $ ./mvnw clean install would mean to run command ./mvnw in the top-level directory of the git clone (named Building-Big-Data-Pipelines-with-Apache-Beam). By using chapter1$ ../mvnw clean install we mean to run the specified command in subdirectory called chapter1.

Why Apache Beam?

There are several basic questions one might ask when considering a new technology to learn and apply in practice. Two fundamental questions might be:

What problem that Im struggling with can the new technology help me solve?
What would be the costs?

Each sound technology has a well-defined selling point, something that can justify its existence in the presence of other competitive technologies. In the case of Apache Beam, this selling point could be reduced to single word: portability. Apache Beam is portable on several layers, namely:

Portability of Apache Beam Pipelines between multiple Runners (by Runner we mean a technology, that actually executes distributed computation described by a Pipelines author).
Portability of Data Processing Model of Apache Beam to various programming languages.
Portability of Data Processing logic between Bounded and Unbounded Data.

Each of these points deserves a few words of explanation. By Runner portability, we mean the possibility to run existing Pipelines written in one of the supported programming languages (for instance Java, Python, Go, Scala or even SQL) against a data processing engine that can be chosen

Light

Font size:

↓

↑

Reset

Interval:

↓

↑

Bookmark:

Make

Similar books «Building Big Data Pipelines with Apache Beam: Use a single programming model for both batch and stream data processing»

Look at similar books to Building Big Data Pipelines with Apache Beam: Use a single programming model for both batch and stream data processing. We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.

Raul Estrada

Apache Kafka Quick Start Guide: Leverage Apache Kafka 2.0 to simplify real-time data processing for distributed applications

Manoj Kukreja

Data Engineering with Apache Spark, Delta Lake, and Lakehouse: Create scalable pipelines that ingest, curate, and aggregate complex data in a timely and secure way

Hien Luu

Beginning Apache Spark 3: With DataFrame, Spark SQL, Structured Streaming, and Spark Machine Learning Library

Chris Fregly

Data Science on AWS: Implementing End-to-End, Continuous AI and Machine Learning Pipelines

Hueske Fabian

Stream Processing with Apache Flink

Romeo Kienzler

Apache Spark 2: Data Processing and Real-Time Analytics: Master complex big data processing, stream analytics, and machine learning with Apache Spark

Hannes Hapke

Building Machine Learning Pipelines: Automating Model Life Cycles with TensorFlow

Fabian Hueske

Stream Processing with Apache Flink: Fundamentals, Implementation, and Operation of Streaming Applications

Gerard Maas

Stream Processing with Apache Spark: Mastering Structured Streaming and Spark Streaming

Gates

Programming Pig

Arun C. Murthy

Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2

Alan Gates

Programming Pig

Reviews about «Building Big Data Pipelines with Apache Beam: Use a single programming model for both batch and stream data processing»

Discussion, reviews of the book Building Big Data Pipelines with Apache Beam: Use a single programming model for both batch and stream data processing and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.