inside front cover
Apache Pulsar in Action
David Kjerrumgaard
Foreword by Matteo Merli
To comment go to liveBook
Manning
Shelter Island
For more information on this and other Manning titles go to
www.manning.com
Copyright
For online information and ordering of these and other Manning books, please visit www.manning.com. The publisher offers discounts on these books when ordered in quantity.
For more information, please contact
Special Sales Department
Manning Publications Co.
20 Baldwin Road
PO Box 761
Shelter Island, NY 11964
Email: orders@manning.com
2021 by Manning Publications Co. All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps.
Recognizing the importance of preserving what has been written, it is Mannings policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine.
| Manning Publications Co. 20 Baldwin Road Technical PO Box 761 Shelter Island, NY 11964 |
Development editor: | Karen Miller |
Technical development editor: | Alain Couniot |
Review editor: | Adriana Sabo |
Production editor: | Keri Hales |
Copy editor: | Christian Berk |
Proofreader: | Melody Dolab |
Technical proofreader: | Ninoslav erkez |
Typesetter: | Gordan Salinovi |
Cover designer: | Marija Tudor |
ISBN: 9781617296888
dedication
To my father, who promised to read this book even if he cant understand a word of it. May your heart be filled with pride every time you tell your friends that your son is a published author.
To my mother, who believed in me when no one else did and worked so hard to ensure that I had every opportunity in life to succeed. Your hard work, faith, and tenacity have been a source of inspiration for me. Thank you for passing these traits down to me, as they have been the foundation of all my achievements including this book.
front matter
foreword
Apache Pulsar in Action is the missing guide that will walk you through your journey with Apache Pulsar. It is a book that Id recommend to anyone, from developers starting to explore pub-sub messaging, to someone with messaging experience, up to experienced Pulsar power users.
The Apache Pulsar project was started at Yahoo! around 2012 with the mission of experimenting with a new architecture that would be able to solve the operational challenges of existing messaging platforms. This was also a time when some significant shifts in the world of data infrastructure were starting to become more visible. Application developers started to look more and more at scalable and reliable messaging as the core component for building the next generation of products. At the same time, companies started to see large-scale real-time streaming data analytics as an essential component and business advantage.
Pulsar was designed from the ground up with the objective of bridging these two worlds, pub-sub messaging and streaming analytics, that are too often isolated in different silos. We worked toward creating an infrastructure that would represent a next generation of real-time data platforms, where one single system would be able to support all the use cases throughout the entire life cycle of data events.
Over time, that vision has expanded further, as can be clearly seen from the wide range of components described in this book. The project has added support for lightweight processing with Pulsar Functions, the Pulsar IO connectors framework, support for data schema, and many other features. What has not changed is the ultimate goal of creating the most scalable, flexible, and reliable platform for real-time data, and allowing any user to process the data stored in Pulsar in the most convenient form.
I have known and worked with this books author, David Kjerrumgaard, for several years. Throughout this time, Ive seen his passion for working with the Pulsar community. He is always able to help users make sense of technical issues, as well as to show them how Pulsar fits into the bigger picture of solving their data problem.
I particularly appreciate how Pulsar in Action is able to seamlessly mix the theory and abstract concepts with the clarity of practical step-by-step examples, and how these examples are rooted in common use cases and messaging design patterns that will surely resonate with many readers. There is truly something for everyone, and everyone will be able to get acquainted with all the aspects and the possibilities that Pulsar offers.
Matteo Merli
CTO at StreamNative
Co-creator and PMC Chair of Apache Pulsar
preface
Back in 2012, the Yahoo! team was looking for a global, geo-replicated platform that could stream all of Yahoo!s messaging data between various apps such as Yahoo Mail and Yahoo Finance. At the time, there were generally two types of systems to handle in-motion data: message queues that handled mission-critical business events in real-time, and streaming systems that handled scalable data pipelines at scale. But there wasnt a platform that provided both capabilities that Yahoo required.
After vetting the messaging and streaming landscape, it became clear that existing technologies were not able to serve their needs, so the team at Yahoo! started working on building a unified messaging and streaming platform for in-motion data named Pulsar. After 4 years of operation across 10 datacenters processing billions of messages per day, Yahoo! decided to open source its messaging platform under the Apache license in 2016.
I first encountered Pulsar in the fall of 2017. I was leading the professional services team at Hortonworks focused on the streaming data platform known as Hortonworks Data Flow (HDF) that comprised Apache NiFi, Kafka, and Storm. It was my job to oversee the deployment of these technologies into a customers infrastructure and help them get started developing streaming applications.
The greatest challenge we faced when working with Kafka was helping our customers administer it properly, and specifically determining the proper number of partitions for a given topic to achieve a proper balance of speed and efficiency while allowing for future data growth. Those of you that are familiar with Kafka are painfully aware of the fact that this seemingly simple decision has a profound impact on the scalability of your topics, and the process of changing this value (even from 3 to 4) necessitates a rebalancing process that is slow and results in the rebalancing topic being unavailable for reading or writing during the entire process.