Kafka Connect
by Mickael Maison and Kate Stanley
Copyright 2023 Mickael Maison and Kate Stanley. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com.
- Acquisitions Editor: Jessica Haberman
- Development Editor: Jeff Bleiel
- Production Editor: Gregory Hyman
- Interior Designer: David Futato
- Cover Designer: Karen Montgomery
- Illustrator: Kate Dullea
- October 2023: First Edition
Revision History for the Early Release
- 2022-02-18: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781098126537 for release details.
The OReilly logo is a registered trademark of OReilly Media, Inc. Kafka Connect, the cover image, and related trade dress are trademarks of OReilly Media, Inc.
The views expressed in this work are those of the authors and do not represent the publishers views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
This work is part of a collaboration between OReilly and Red Hat. See our statement of editorial independence.
978-1-098-12653-7
Chapter 1. Apache Kafka Basics
A Note for Early Release Readers
With Early Release ebooks, you get books in their earliest formthe authors raw and unedited content as they writeso you can take advantage of these technologies long before the official release of these titles.
This will be the second chapter of the final book.
If you have comments about how we might improve the content and/or examples in this book, or if you notice missing material within this chapter, please reach out to the authors at .
Connect is one of the components of the Apache Kafka project. While you dont need to be a Kafka expert to use Connect, its useful to have a basic understanding of the main concepts in order to build reliable data pipelines.
In this chapter, we will give a quick overview of Kafka and you will learn the basics in order to fully understand the rest of this book. (If you already have a good understanding of Kafka, you can skip this chapter and go directly to Chapter 3.) We will explain what Kafka is, its use cases and briefly introduce some of its inner workings. Finally we will discuss the different Kafka clients, including Kafka Streams, and show you how to run them against a local Kafka cluster.
If you want a deeper dive into Apache Kafka, we recommend you take a look at the book Kafka, the Definitive Guide.
A Distributed Event Streaming Platform
On the official website, Kafka is described as an open-source distributed event streaming platform. While its a technically accurate description, for most people its not immediately clear what that means, what Kafka is and what you can use it for. Lets first look at the individual words of that description separately and explain what they mean.
Open Source
The project was originally created at LinkedIn where they needed a performant and flexible messaging system to process the very large amount of data generated by their users. It was released as an open source project in 2010 and it joined the Apache Foundation in 2011. This means all the code of Apache Kafka is publicly available and can be freely used and shared as long as the Apache License 2.0 is respected.
Note
The Apache Foundation is a nonprofit corporation created in 1999 whose objective is to support open source projects. It provides infrastructure, tools, processes and legal support to projects to help them develop and succeed. It is the worlds largest open source foundation and as of 2021, it supports over 300 projects totalling over 200 million lines of code.
The source code of Kafka is not only available, but the protocols used by clients and servers are also documented . This allows third parties to write their own compatible clients and tools. Its also noteworthy that the development of Kafka happens in the open. All discussions (new features, bugs, fixes, releases) happen on public mailing lists and any changes that may impact users have to be voted on by the community.
This also means Apache Kafka is not controlled by a single company that can change the terms of use, arbitrarily increase prices or simply disappear. Instead it is managed by an active group of diverse contributors. To date, Kafka has received contributions from over 800 different contributors. Out of this large group, a small subset (~50) are committers that can accept contributions and merge them into the Kafka codebase. Finally theres an even smaller group of people (25-30) called Project Management Committee (PMC) members that oversee the governance (they can elect new Committers and PMC members), set the technical direction of the project and ensure the community around the project stays healthy. You can find the current Committer and PMC member roster for Kafka on the website: https://kafka.apache.org/committers.
Distributed
Traditionally, enterprise software was deployed on few servers and each server was expensive and often used custom hardware. In the past 10 years, there has been a shift towards using off the shelf servers (with common hardware) that are cheaper and easily replaceable. This trend is highly visible with the huge popularity of cloud infrastructure services that allow you to provision standardized servers within minutes whenever needed.
Kafka is designed to be deployed over multiple servers. A server running Kafka is called a broker, and interconnected brokers form a cluster. Kafka is a distributed system as the system workload is shared across all the available brokers. In addition, brokers can be added to or removed from the cluster dynamically to increase or decrease the capacity. This horizontal scalability enables Kafka to offer high throughput while providing very low latencies. Small clusters with a handful of brokers can easily handle several hundreds of megabytes per second and several Internet giants, such as LinkedIn and Microsoft, have large Kafka clusters handling several trillion events per day (LinkedIn: https://engineering.linkedin.com/blog/2019/apache-kafka-trillion-messages; Microsoft: https://azure.microsoft.com/fr-fr/blog/processing-trillions-of-events-per-day-with-apache-kafka-on-azure/).
Finally distributed systems offer resilience to failures. Kafka is able to detect when brokers leave the cluster, due to an issue, or for scheduled maintenance. With appropriate configuration, Kafka is able to keep fully functional during these events by automatically distributing the workload on remaining brokers.