Apache Pulsar Archives - Kai Waehner

The Data Streaming Landscape 2025

Kai Waehner — Wed, 04 Dec 2024 13:49:37 +0000

Data streaming is a new software category. It has grown from niche adoption to becoming a fundamental part of modern data architecture. With real-time data processing transforming industries, the ecosystem of tools, platforms, and services has evolved significantly. This blog post explores the data streaming landscape of 2025, analyzing key players, trends, and market dynamics shaping this space.

The data streaming landscape of 2025 categorizes solutions by their adoption and completeness as fully-featured data streaming platforms, as well as their deployment models, which range from self-managed setups to BYOC (Bring Your Own Cloud) and fully managed cloud services like PaaS and SaaS. While Apache Kafka remains the backbone of this ecosystem, the landscape also includes stream processing engines like Apache Flink and competitive technologies such as Pulsar and Redpanda that are built on the Kafka protocol.

This blog also explores the latest market trends and provides an outlook for 2025 and beyond, highlighting potential new entrants and evolving use cases. By the end, you’ll gain a clear understanding of the data streaming platform landscape and its trajectory in the years to come.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download by free ebook about data streaming use cases and industry-specific success stories.

Data Streaming in 2025: The Rise of a New Software Category

Real-time data beats slow data. That’s true across almost all use cases in any industry. Event-driven applications powered by data streaming continuously process data from any data source. This approach increases the business value as the overall goal by increasing revenue, reducing cost, reducing risk, or improving the customer experience. And the event-driven architecture ensures a future-ready architecture.

Even top researchers and advisory firms such as Forrester, Gartner, and IDG recognize data streaming as a new software category. In December 2023, Forrester released “The Forrester Wave: Streaming Data Platforms, Q4 2023,” highlighting Microsoft, Google, and Confluent as leaders, followed by Oracle, Amazon, and Cloudera.

Data Streaming is NOT just another data integration tool. Plenty of software categories and related data platforms exist to process and analyze data. I explored in a few dedicated series how data streaming. differs:

The Business Value of Data Streaming

A new software category opens use cases and adds business value across all industries:

Source: Lyndon Hedderly (Confluent)

Adding business value is crucial for any enterprise. With so many potential use cases, it is no surprise that more and more software vendors add Kafka support to their products.

Search my blog for your favorite industry to find plenty of case studies and architectures. Or read about use cases for Apache Kafka across industries to get started.

The Data Streaming Landscape of 2025

Data Streaming is a separate software category of data platforms. Many software vendors built their entire businesses around this category. Several mature players in the data market added support for data streaming in their platforms or cloud service ecosystem. Various SaaS startups have emerged in this category in the last few years.

It all began with the open-source framework Apache Kafka, and today, the Kafka protocol is widely adopted across various implementations, including proprietary ones. What truly matters now is leveraging the capabilities of a complete data streaming platform—one that is fully compatible with the Kafka protocol. This includes built-in features like connectors, stream processing, security, data governance, and the elimination of self-management, reducing risks and operational effort.

The Kafka Protocol is the De Facto Standard of Data Streaming

Most software vendors use Kafka (or its protocol) at the core of their data streaming platforms. Apache Kafka has become the de facto standard for data streaming.

Additionally, “benchmarketing” (i.e., picking a sweet spot or niche scenario where you perform better than your competitor) is the favorite marketing technique to “prove” differentiators to the real Apache Kafka. Some vendors also present misleading cost-efficiency comparisons by excluding critical cloud costs such as data transfer or storage, giving an incomplete picture of the true expenses.

Apache Kafka vs. Data Streaming Platform

Many still use Kafka merely as a dumb ingestion pipeline, overlooking its potential to power sophisticated, real-time data streaming use cases. One reason is that Kafka alone lacks the full capabilities of a comprehensive data streaming platform.

A complete solution requires more than “just” Kafka. Apache Flink is becoming the de facto standard for stream processing. Data integration capabilities (connectors, clients, APIs), data governance, security, and critical 24/7 SLAs and support are important for many data streaming projects.

The Data Streaming Landscape 2025 summarizes the current status of relevant products and cloud services, focusing on deployment models and the adoption/completeness of the data streaming platforms.

Data Streaming Vendors and Categories for the 2025 Landscape

The data streaming landscape changed this year. As most solutions evolve, I do not distinguish anymore between Kafka, non-Kafka, and stream processing as categories. Instead, I look at the adoption and completeness to assess the maturity of a data streaming solution from an open-source framework to a complete platform.

The deployment models also changed in the 2025 landscape. Instead of categorizing it into Self Managed, Partially Managed, and Fully Managed, I sort as follows: Self Managed, Bring Your Own Cloud (BYOC), and Cloud. The Cloud category is separated into PaaS (Platform as a Service) and SaaS (Software as a Service) to indicate that many Kafka cloud offerings are still NOT fully managed!

Please note: Intentionally, this data streaming landscape is not a complete list of frameworks, cloud services, or vendors. It is also not an official research. There is no statistical evidence. You might miss your favorite technology in this diagram. Then I did not see it in my conversations with customers, prospects, partners, analysts, or the broader data streaming community.

Also, note that I focus on general data streaming infrastructure. Brilliant solutions exist for using and analyzing streaming data for specific scenarios, like time-series databases, machine learning engines, observability platforms, or purpose-built IoT solutions. These are usually complementary, often connected out of the box via a Kafka connector, or even built on top of a data streaming platform (invisible for the end user).

Adoption and Completeness of Data Streaming (X-Axis)

Data streaming is adopted more and more across all industries. The concept is not new. In “The Past, Present and Future of Stream Processing“, I explored how the data streaming journey started decades ago with research and the first purpose-built proprietary products for specific use cases like stock trading.

Open source stream processing frameworks emerged during the big data and Hadoop era to make at least the ingestion layer a bit more real-time. Is anyone still remembering (or even still using) Apache Storm?

Today, most enterprises are realizing the value of data streaming for both analytical and operational use cases across industries. The cloud has brought a transformative shift, enabling businesses to start streaming and processing data with just a click, using fully managed SaaS solutions and intuitive UIs. Complete data streaming platforms now offer many built-in features that users previously had to develop themselves, including connectors, encryption, access control, governance, data sharing, and more.

Capabilities of a Complete Data Streaming Platform

Data streaming vendors are on the way to building a complete Data Streaming Platform (DSP). Capabilities include:

Messaging (“Streaming”): Transfer messages in real-time and persist for durability, decoupling, and slow consumers (near real-time, batch, API, file).
Data Integration: Connect to any legacy and cloud-native sources and sinks.
Stream Processing: Correlate events with stateless and stateful transformation or business logic.
Data Governance: Ensure security, observability, data sovereignty, and compliance.
Developer Tooling: Enable flexibility for different personas such as software engineers, data scientists, and business analysts by providing different APIs (such as Java, Python, SQL, REST/HTTP), graphical user interfaces, and dashboards.
Operations Tooling and SaaS: Ease infrastructure management on premise respectively take over the entire operations burden in the public cloud with serverless offerings.
Uptime SLAs and Support: Provide the required guarantees and expertise for critical use cases.

Evolution from Open Source Adoption to a Data Streaming Organization

Modern data streaming is not just about adopting a product; it’s about transforming the way organizations operate and derive value from their data. Hence, the adoption goes beyond product features:

From open source and self-operations to enterprise-grade products and SaaS.
From human scale to automated, serverless elasticity with consumption-based pricing.
From dumb ingestion pipes to complex data pipelines and business applications.
From analytical workloads to critical transactional (and analytical) use cases.
From a single data streaming cluster to a powerful edge, hybrid, and multi-cloud architecture, including integration, migration, aggregation, and disaster recovery scenarios.
From wild adoption across business units with uncontrolled growth using various frameworks, cloud services, and integration tools to a center of excellence (CoE) with a strategic approach with standards, best practices, and knowledge sharing in an internal community.
From effortful and complex human management to enterprise-wide data governance, automation, and self-service APIs.

Data Streaming Deployment Models: Self-Managed vs. BYOC vs. Cloud (Y-Axis)

Different data streaming categories exist regarding the deployment model:

Self-Managed: Operate nodes like Kafka Broker, Kafka Connect, and Schema Registry by yourself with your favorite scripts and tools. This can be on-premise or in the public cloud in your VPC. Reduce the operations burden via a cloud-native platform (usually Kubernetes) and related operator tools that automate operations tasks like rolling upgrades or rebalancing Kafka Partitions.
Bring Your Own Cloud (BYOC): Allow organizations to host Kafka within their own cloud VPC. BYOC combines some of the benefits of cloud flexibility with enhanced security and control, while it outsources most of Kafka’s management to specialized vendors. The data plane is still customer-managed, but in contrast to self-managed Kafka, the customer does not need to worry about the complexity under the hood (like rebalancing, rolling upgrades, backups) – that is what cloud-native object storage and other magic code of the BYOC control plane service take over.
Cloud (PaaS or SaaS): Both PaaS and SaaS solutions operate within the cloud provider’s VPC. Fully managed SaaS for data streaming takes overall operational responsibilities, including scaling, failover handling, upgrades, and performance tuning, allowing users to focus solely on integration and business logic. In contrast, partially managed PaaS reduces the operational burden by automating certain tasks like rolling upgrades and rebalancing, but still requires some level of user involvement in managing the infrastructure. Fully Managed SaaS typically provides critical SLAs for support and uptime while partially managed PaaS cannot provide such guarantees.

Most organizations prefer SaaS for data streaming when business and technical requirements allow, as it minimizes operational complexity and maximizes scalability. Other deployment models are chosen when specific constraints or needs require them.

The Evolution of BYOC Kafka Cloud Services

Cloud and On-Premise deployment options are typically well understood, but BYOC (Bring Your Own Cloud) often requires more explanation due to its unique operating model and varying implementations across vendors.

In last year’s data streaming landscape 2024, I wrote the following about BYOC for Kafka:

“I do NOT believe in this approach as too many questions and challenges exist with BYOC regarding security, support, and SLAs in the case of P1 and P2 tickets and outages. Hence, I put this in the category of self-managed. That is what it is, even though the vendor touches your infrastructure. In the end, it is your risk because you have to and want to control your environment.”

This statement made sense because BYOC vendors at that time required access to the customer VPC and offered a shared support model. While this is still true for some BYOC solutions, my mind changed with the innovation of BYOC by one emerging vendor: WarpStream.

WarpStream’s BYOC Operating Model with Stateless Agents in the Customer VPC

WarpStream published a new operating model for BYOC: The customer only deploys stateless agents in its VPC and provides an object storage bucket to store the data. The control plane and metadata store are fully managed by the vendor as SaaS and the vendor takes over all the complexity.

Source: Confluent

With this innovation, BYOC is now a worthwhile third deployment option besides a self-managed and fully managed data streaming platform. It brings several benefits:

No access is needed by the BYOC cloud vendor to the customer VPC. The data plane (i.e., the “brokers” in the customer VPC) is stateless. The metadata/consensus is in the control plane (i.e., the cloud service in the WarpStream VPC).
The architecture solves sovereignty challenges and is a great fit for security and compliance requirements.
The cost of the BYOC offering is cheaper than self-managed Apache Kafka because it is built with cloud-native concepts and technologies in mind (e.g., zero disks and zero interzone networking fees, leveraging cloud object storage such as Amazon S3, Azure Blog Storage, or Google Cloud Storage).
The stateless architecture in the customer VPC makes autoscaling and elasticity very easy to implement/configure.

When to use BYOC?

WarpStream introduced an innovative share-nothing operating model that makes BYOC practical, secure, and cost-efficient. With that being said, I still recommend only looking at BYOC options for Apache Kafka in the public cloud if a fully managed and serverless data streaming platform does NOT work for you because of cost, security, or compliance reasons! When it comes to simplicity and ease of operation, nothing beats a fully managed cloud service.

And please keep in mind that NOT every BYOC cloud service provides these characteristics and benefits. Make sure to make a proper evaluation of your favorite solutions. For more details, look at my blog post: “Deployment Options for Apache Kafka: Self-Managed, Fully-Managed / Serverless and BYOC (Bring Your Own Cloud)“.

Changes in the Data Streaming Landscape from 2024 to 2025

My goal is NOT a growing landscape with tens or even hundreds of vendors and cloud services. Plenty of these pictures exist. Instead, I focus on a few technologies, vendors, and cloud offerings that I see in the field, with adoption by the broader open-source and cloud community.

I already discussed the important conceptual changes in the data streaming landscape:

Deployment Model: From self-managed, partially managed, and fully managed to self-managed, BYOC and cloud.
Streaming Categories: From different streaming categories to a single category for all data streaming platforms sorted by adoption and completeness.

Additionally, every year I modified the list of solutions compared to the data streaming landscape 2024 published one year ago.

Additions to the Data Streaming Landscape 2025

The following data streaming services were added:

Alibaba (Cloud): Confluent Data Streaming Service on Alibaba Cloud is an OEM partnership between Alibaba Cloud and Confluent to offer a fully managed SaaS in Mainland China. The service was announced end of 2021 and sees more and more traction in Asia. Alibaba is the contractor and first-level support for the end user.
Google Managed Service for Kafka (Cloud): Google announced this Kafka PaaS recently. The strategy looks very similar to Amazon’s MSK. Even the shortcut is the same: MSK. I explored when (not) to choose Google’s Kafka cloud service after the announcement. The service is still in preview, but available to a huge customer base already.
Oracle Streaming with Apache Kafka (Cloud): A partially managed Apache Kafka PaaS on Oracle Cloud Infrastructure (OCI). The service is in early access, but available to a huge customer base already.
WarpStream (BYOC): WarpStream was acquired by Confluent. It is still included with its logo as Confluent continues to keep the brand and solution separated (at least for now).

Removals from the Data Streaming Landscape 2025

There are NO REMOVALS this year, BUT I was close to removing two technologies:

Apache Pulsar and StreamNative: I was close to removing Apache Pulsar as I see zero traction in the market. Around 2020, Pulsar had some traction but focused too much on Kafka FUD instead of building a vibrant community. While Kafka simplified its architecture (ZooKeeper removal), Pulsar still includes three distributed systems (ZooKeeper or alternatives like etcd, BookKeeper, and Pulsar Broker). It also pivots to the Kafka protocol trying to get some more traction again. But it seems to be too late.
ksqlDB (formerly called KSQL): The product is feature complete. While it is still supported by Confluent, it will not get any new features. ksqlDB is still a great piece of software for some (!) Kafka-native stream processing projects but might be removed in the future. Confluent co-founder and Kafka co-creator Jay Kreps commented on X (former Twitter): “Confluent went through a set of experiments in this area. What we learned is that for *platform* layers you want a clean separation. We learned this the hard way: our source available stream processing layer KSQL, lost to open-source Apache Flink. We pivoted to Flink.“

Vendor Overview for Data Streaming Platforms

All vendors of the landscape do some kind of data streaming. However, the offerings differ a lot in adoption, completeness, and vision. And many solutions are not available everywhere but only in one public cloud or only as self-managed. For detailed product information and experiences, the vendor websites and other blogs/conference talks are the best and most up-to-date resources. The following is just a summary to get an overview.

Before we do the deep dive, here again, the entire data streaming landscape for 2025:

Self-Managed Data Streaming with Open Source and Proprietary Products

The following list describes the open-source frameworks and proprietary products for self-managed data streaming deployments (in order of adoption and completeness):

Apache Pulsar: A competitor to Apache Kafka. Similar story and use cases, but different architecture. Kafka is a single distributed cluster – after removing the ZooKeeper dependency in 2022. Pulsar is three (!) distributed clusters: Pulsar brokers, ZooKeeper, and BookKeeper. Pulsar vs. Kafka explored the differences. And Kafka catches up to some missing features like Queues for Kafka.
StreamNative: The primary vendor behind Apache Pulsar. Not much market traction.
ksqlDB (usually called KSQL, even after Confluent’s rebranding): An abstraction layer on top of Kafka Streams to provide stream processing with streaming SQL. Hence, also Kafka-native. It comes with a Confluent Community License and is free to use. Sweet spot: Streaming ETL.
Redpanda: Implements the Kafka protocol with C++. Trying out different market strategies to define Redpanda as an alternative to a Kafka-native offering. Still in the early stage in the maturity curve. Adding tons of (immature) product features in parallel to find the right market fit in a growing Kafka market. Recently acquired Benthos to provide connectivity to data sources and sinks (similar to Kafka Connect).
Ververica: Well-known Flink company. Acquired by Chinese tech giant Alibaba in 2019. Not much traction in Europe and the US. Sweet spot: Flink in Mainland China.
Apache Flink: Becoming the de facto standard for stream processing. Open-source implementation. Provides advanced features including a powerful scalable compute engine, freedom of choice for developers between SQL, Java, and Python, APIs for Complex Event Processing (CEP), and unified APIs for stream and batch workloads.
Spark Streaming: The streaming part of the open-source big data processing framework Apache Spark. The enormous installed base of Spark clusters in enterprises broadens adoption thanks to solutions from Cloudera, Databricks, and the cloud service providers. Sweet spot: Analytics in (micro)batches with data stored at rest in the data lake/lakehouse.
Apache Kafka: The de facto standard for data streaming. Open-source implementation with a vast community. Almost all vendors rely on (parts of) this project. Often underestimated: Kafka includes data integration and stream processing capabilities with Kafka Connect and Kafka Streams, making even the open-source Apache download already more powerful than many other data streaming frameworks and products.
IBM / Red Hat AMQ Streams: Provides Kafka as self-managed Kafka on Kubernetes via OpenShift. Kafka is part of the integration portfolio that includes other open-source frameworks like Apache Camel. Sweet spot: Existing IBM customers.
Cloudera: Provides Kafka, Flink, and other open-source data and analytics frameworks as a self-managed offering. The main strategy is offering one product with a vast combination of many open-source frameworks that can be deployed on any infrastructure. Sweet spot: Analytics.
Confluent Platform: Focuses on a complete data streaming platform including Kafka and Flink, and various advanced data streaming capabilities for operations, integration, governance, and security. Sweet spot: Unifying operational and analytical workloads, and combination with the fully managed cloud service.

Data Streaming with Bring Your Own Cloud (BYOC)

BYOC is an emerging category and is mainly used for specific challenges such as strict data security and compliance requirements. The following vendors provide dedicated BYOC offerings for data streaming (in order of adoption and completeness)

WarpStream (Confluent): A new entrant into the data streaming market. The cloud service is a Kafka-compatible data streaming platform built directly on top of S3. Innovated the BYOC model to enable secure and cost-effective data streaming for workloads that don’t have strong latency requirements.
Redpanda: The first BYOC offering on the market for data streaming. The biggest concern is the shared responsibility model of this solution because the vendor requires access to the customer VPC for operations and support. This is against the key principles of BYOC regarding security and compliance and why organizations (have to) look for BYOC instead of SaaS solutions.
Databricks: Cloud-based data platform that provides a collaborative environment for data engineering, data science, and machine learning, built on top of Apache Spark. Data Streaming is enabled by Spark Streaming and focuses mainly on analytical workloads that are optimized from batch to near real-time.

Partially Managed Data Streaming Cloud Platforms (PaaS)

Here is an overview of relevant PaaS data streaming cloud services (in order of adoption and completeness):

Google Cloud Managed Service for Apache Kafka (MSK): Initially branded as Google Managed Kafka for BigQuery (likely for a better marketing push), the service enables data ingestion into lakehouses on GCP such as Google BigQuery.
Amazon Managed Service for Apache Flink (MSF): A partially managed service by AWS that allows customers to transform and analyze streaming data in real-time with Apache Flink. It still provides some (costly) gaps for auto-scaling and is not truly serverless. Supports all Flink interfaces, i.e., SQL, Java, and Python. And non-Kafka connectors, too. Only available on AWS.
Oracle OCI Streaming with Apache Kafka: The service is still in early access, but available to a huge customer base already on Oracle’s cloud infrastructure.
Microsoft Azure HDInsight. A piece of Azure’s Hadoop infrastructure. Not intended for other use cases beyond data ingestion for batch analytics.
Instaclustr: Partially managed Kafka cloud offerings across cloud providers. The product portfolios offer various hosted services of open-source technologies. Instaclustr also offers a (semi-)managed offering for on-premise infrastructure.
Amazon Kinesis: Data ingestion into AWS data stores. Mature product for a specific problem. Only available on AWS.
Aiven: Partially managed Kafka cloud offerings across cloud providers. The product portfolios offer various hosted services of open-source technologies.
IBM / Red Hat AMQ Streams: Provides Kafka as a partially managed cloud offering on OpenShift (through Red Hat). Sweet spot: Existing IBM customers.
Amazon Managed Service for Apache Kafka (MSK): AWS has hundreds of cloud services, and Kafka is part of that broad spectrum. MSK is only available in public AWS cloud regions; not on Outposts, Local Zones, Wavelength, etc. MSK is likely the largest partially managed Kafka service across all clouds. It evolved with new features like support for Kafka Connect and Tiered Storage. But lacks connectivity outside the AWS ecosystem and a data governance narrative.

Fully Managed Data Streaming Cloud Services (SaaS)

Here is an overview of relevant SaaS data streaming cloud services (in order of adoption and completeness):

Decodable: A relatively new cloud service for Apache Flink in the early stage. Huge potential if it is combined with existing Kafka infrastructures in enterprises. But also provides pre-built connectors for non-Kafka systems. Main Opportunity: Combination with another cloud vendor that only does Kafka or other messaging/streaming without stream processing capabilities.
StreamNative Cloud: The primary vendor behind Apache Pulsar. Offers self-managed and fully managed solutions. StreamNative Cloud for Kafka is still in a very early stage of maturity, not sure if it will ever strengthen.
Ververica: Stream processing as a service powered by Apache Flink on all major cloud providers. Huge potential if it is combined with existing Kafka infrastructures in enterprises. Main Opportunity: Combination with another cloud vendor that only does Kafka or other messaging/streaming without stream processing capabilities.
Redpanda Cloud: Redpanda provides its data streaming as a serverless offering. Not much information is available on the website about this part of the vendor’s product portfolio.
Amazon MSK Serverless: Different functionalities and limitations than Amazon MSK. MSK Serverless still does not get much traction because of its limitations. Both MSK offerings exclude Kafka support in their SLAs (please read the terms and conditions).
Google Cloud DataFlow: Fully managed service for executing Apache Beam pipelines within the Google Cloud Platform ecosystem. Mature product for a specific problem. Only available on GCP.
Azure Event Hubs: A mature, fully managed cloud service. The service does one thing, and that is done very well: Data ingestion via the Kafka protocol. Hence, it is not a complete streaming platform but is more comparable to Amazon Kinesis or Google Cloud PubSub. The limited compatibility with the Kafka protocol and the high cost of the service for higher throughput are the two major blockers that come up regularly in conversations.
Confluent Cloud: A complete data streaming platform including Kafka and Flink as a fully managed offering. In addition to deep integration, the platform provides connectivity, security, data governance, self-service portal, disaster recovery tooling, and much more to be the most complete DSP on all major public clouds.

Potential for the Data Streaming Landscape 2026

Data streaming is a journey. So is the development of event streaming platforms and cloud services. Several established software and cloud vendors might get more traction with their data streaming offerings. And some startups might grow significantly. The following shows a few technologies that might evolve and see growing adoption in 2025:

New startups around the Kafka protocol emerge. The list of new frameworks and cloud services is growing every quarter. A few names I saw in some social network posts (but not much beyond in the real world): AutoMQ, S2, Astradot, Bufstream, Responsive, tansu, Tektite, Upstash. While some focus on the messaging/streaming part, others focus on a particular piece such as building database capabilities.
Streaming databases like Materialize or RisingWave might become a new software category. My feeling: Very early stage of the hype cycle. We will see in 2025 if and where this technology gets more broadly adopted and what the use cases are. It is hard to answer how these will compete with emerging real-time analytics databases like Apache Druid, Apache Pinot, ClickHouse, Timeplus, Tinybird, et al. I know there are differences, but the broader community and companies need to a) understand these differences and b) find business problems for it.
Stream Processing SaaS startups emerge: Quix and Bytewax provide stream processing with Python. Quix now also offers a hosted offering based on Kafka Streams; as does Responsive. DeltaStream provides Apache Flink as SaaS. And many more startups emerge these days. Let’s see which of these gets traction in the market with an innovative product and business model.
Traditional data management vendors like MongoDB or Snowflake try to get deeper into the data streaming business. I am still a fan of separation of concerns; so I think these should keep their sweet spot and (only) provide streaming ingestion and CDC as use cases, but not (try to) compete with data streaming vendors.

Fun fact: The business model of almost all emerging startups is fully managed cloud services, not selling licenses for on-premise deployments. Many are based on open-source or open-core, and others only provide a proprietary implementation.

Although they are not aiming to be full data streaming platforms (and thus are not part of the platform landscape), other complementary tools are gaining momentum in the data streaming ecosystem. For instance, Conduktor is developing a proxy for Kafka clusters, and Lenses, though relatively quiet since its acquisition by Celonis, has recently introduced updates to its user-friendly management and developer tools. These tools address gaps that some data streaming platforms leave unfilled.

Data Streaming: A Journey, Not a Sprint

Data streaming isn’t a sprint—it’s a journey! Adopting event-driven architectures with technologies like Apache Kafka or Apache Flink requires rethinking how applications are designed, developed, deployed, and monitored. Modern data strategies involve legacy integration, cloud-native microservices, and data sharing across hybrid and multi-cloud environments.

The data streaming landscape in 2025 highlights the emergence of a new software category, though it’s still in its early stages. Building such a category takes time. In discussions with customers, partners, and the community, a common sentiment emerges: “We understand the value but are just starting with the first data streaming pipelines and have a long-term plan to implement advanced stream processing and build a strategic data streaming organization.”

The Forrester Wave: Streaming Data Platforms, Q4 2023, and other industry reports from Gartner and IDG signal that this category is progressing through the hype cycle.

Last but not least, check out my Top Data Streaming Trends for 2025 to understand how the data streaming landscape fits into emerging trends:

Democratization of Kafka: Apache Kafka has transitioned from a specialized tool to a foundational platform in modern data infrastructure.
Kafka Protocol as the Standard: Vendors adopt the Kafka protocol over the framework, enabling flexibility with compatibility and performance trade-offs.
BYOC Deployment Model: Bring Your Own Cloud gains traction for balancing security, compliance, and managed services.
Flink Becomes the Standard for Stream Processing: Apache Flink rises as the premier framework for stream processing, building integration pipelines and business applications.
Data Streaming for Real-Time Predictive AI and GenAI: Real-time model inference drives predictive and generative AI applications.
Data Streaming Organizations: Companies unify real-time data strategies to standardize tools, governance, and collaboration.

Which are your favorite open-source frameworks or cloud services for data streaming? What are your most relevant and exciting trends around Apache Kafka and Flink in 2024 to set data in motion? What does your enterprise landscape for data streaming look like? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter. Make sure to download by free ebook about data streaming use cases and industry examples.

The post The Data Streaming Landscape 2025 appeared first on Kai Waehner.

When NOT to Use Apache Kafka? (Lightboard Video)

Kai Waehner — Tue, 26 Mar 2024 06:45:11 +0000

Apache Kafka is the de facto standard for data streaming to process data in motion. With its significant adoption growth across all industries, I get a very valid question every week: When NOT to use Apache Kafka? What limitations does the event streaming platform have? When does Kafka simply not provide the needed capabilities? How to qualify Kafka out as it is not the right tool for the job? This blog post contains a lightboard video that gives you a twenty-minute explanation of the DOs and DONTs.

DisclaimeR: This blog post shares a lightboard video to watch an explanation about when NOT to use Apache Kafka. For a much more detailed and technical blog post with various use cases and case studies, check out this blog post from 2022 (which is still valid today – whenever you read it).

What is Apache Kafka, and what is it NOT?

Kafka is often misunderstood. For instance, I still hear way too often that Kafka is a message queue. Part of the reason is that some vendors only pitch it for a specific problem (such as data ingestion into a data lake or data warehouse) to sell their products. So, in short:

Kafka is…

a scalable real-time messaging platform to process millions of messages per second.
a data streaming platform for massive volumes of big data analytics and small volumes of transactional data processing.
a distributed storage provides true decoupling for backpressure handling, support of various communication protocols, and replayability of events with guaranteed ordering.
a data integration framework (Kafka Connect) for streaming ETL.
a data processing framework (Kafka Streams) for continuous stateless or stateful stream processing.

This combination of characteristics in a single platform makes Kafka unique (and successful).

Kafka is NOT…

a proxy for millions of clients (like mobile apps) – but Kafka-native proxies (like REST or MQTT) exist for some use cases.
an API Management platform – but these tools are usually complementary and used for the creation, life cycle management, or the monetization of Kafka APIs.
a database for complex queries and batch analytics workloads – but good enough for transactional queries and relatively simple aggregations (especially with ksqlDB).
an IoT platform with features such as device management – but direct Kafka-native integration with (some) IoT protocols such as MQTT or OPC-UA is possible and the appropriate approach for (some) use cases.
a technology for hard real-time applications such as safety-critical or deterministic systems – but that’s true for any other IT framework, too. Embedded systems are a different software!

For these reasons, Kafka is complementary, not competitive, to these other technologies. Choose the right tool for the job and combine them!

Lightboard Video: When NOT to use Apache Kafka

The following video explores the key concepts of Apache Kafka. Afterwards, the DOs and DONTs of Kafka show how to complement data streaming with other technologies for analytics, APIs, IoT, and other scenarios.

Data Streaming Vendors and Cloud Services

The research company Forrester defines data streaming platforms as a new software category in a new Forrester Wave. Apache Kafka is the de facto standard used by over 100,000 organizations.

Plenty of vendors offer Kafka platforms and cloud services. Many complementary open source stream processing frameworks like Apache Flink and related cloud offerings emerged. And competitive technologies like Pulsar, Redpanda, or WarpStream try to get market share leveraging the Kafka protocol. Check out the data streaming landscape of 2024 to summarize existing solutions and market trends. The end of the article gives an outlook to potential new entrants in 2025.

Apache Kafka is a Data Streaming Platform: Combine it with other Platforms when needed!

Over 150,000 organizations use Apache Kafka in the meantime. The Kafka protocol is the de facto standard for many open source frameworks, commercial products and serverless cloud SaaS offerings.

However, Kafka is not an allrounder for every use case. Many projects combine Kafka with other technologies, such as databases, data lakes, data warehouses, IoT platforms, and so on. Additionally, Apache Flink is becoming the de facto standard for stream processing (but Kafka Streams is not going away and is the better choice for specific use cases).

Where do you (not) use Apache Kafka? What other technologies do you combine Kafka with? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post When NOT to Use Apache Kafka? (Lightboard Video) appeared first on Kai Waehner.

The Data Streaming Landscape 2024

Kai Waehner — Thu, 21 Dec 2023 08:47:41 +0000

The research company Forrester defines data streaming platforms as a new software category in a new Forrester Wave. Apache Kafka is the de facto standard used by over 100,000 organizations. Plenty of vendors offer Kafka platforms and cloud services. Many complementary open source stream processing frameworks like Apache Flink and related cloud offerings emerged. And competitive technologies like Pulsar, Redpanda, or WarpStream try to get market share leveraging the Kafka protocol. This blog post explores the data streaming landscape of 2024 to summarize existing solutions and market trends. The end of the article gives an outlook to potential new entrants in 2025.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

Data Streaming is a New Software Category

Real-time data beats slow data. That’s true across almost all use cases in any industry. Event-driven applications powered by data streaming are the new black. This approach increases the business value as the overall goal by increasing revenue, reducing cost, reducing risk, or improving the customer experience.

Plenty of software categories and related data platforms exist to process and analyze data:

Database: Store and execute transactional workloads.
Data Warehouse: Processing structured historical data to create recurring reports and unique insights.
Data Lake: Processing structured and semi- or unstructured big data sets with batch processing to create recurring reports and unique insights.
Lakehouse: A mix of data warehouse and data lake to process all data on one platform.
Data Streaming: Continuously process data in motion and provide data consistency across communication paradigms (like real-time, batch, request-response) instead of storing and analyzing data only at rest.

Of course, these data platforms often overlap a bit. I did a complete blog series exploring the use cases and how they complement each other.

The Forrester Wave: Streaming Data Platforms, Q4 2023

Forrester is a leading research and advisory company that provides insights and analysis on various aspects of technology, business, and market trends.

The company is known for its in-depth analysis, market research reports, and frameworks that help organizations navigate the rapidly changing landscape of technology and business. Businesses and IT leaders often use Forrester’s research to understand market trends, evaluate technology solutions, and develop strategies to stay competitive in their respective industries.

In December 2023, the research company published “The Forrester Wave: Streaming Data Platforms, Q4 2023“. Get free access to the report here. The leaders are Microsoft, Google and Confluent, followed by Oracle, Amazon, Cloudera, and a few others.

You might agree or disagree with the positions of a specific vendor regarding its offering or strategy strength. But the emergence of this new wave is a proof that data streaming is a new software category; not just yet another hype or next-generation ETL / ESB / iPaaS tool.

Data Streaming Use Cases by Business Value

A new software category opens use cases and adds business value across all industries:

Adding business value is crucial for any enterprise. With so many potential use cases, it is no surprise that more and more software vendors add Kafka support to their products. Search my blog for your favorite industry to find plenty of case studies and architectures. Or read about use cases for Apache Kafka across industries to get started.

The Data Streaming Landscape of 2024

Data Streaming is a separate software category of data platforms. Many software vendors built their entire businesses around this category. The data streaming landscape shows that most vendors use Kafka or implement its protocol because Apache Kafka has become the de facto standard for data streaming.

New software companies have emerged in this category in the last few years. And several mature players in the data market added support for data streaming in their platforms or cloud service ecosystem. Most software vendors use Kafka for their data streaming platforms. However, there is more than Kafka. Some vendors only use the Kafka protocol (Azure Event Hubs) or utterly different APIs (like Amazon Kinesis).

The following Data Streaming Landscape 2024 summarizes the current status of relevant products and cloud services.

Please note: Intentionally, this is not a complete list of frameworks, cloud services, or vendors. It is not an official research landscape. There is no statistical evidence. If your favorite technology is not in this diagram, then I did not see it in my conversations with customers, prospects, partners, analysts, or the broader data streaming community.

Also, note that I focus on general data streaming infrastructure. Brilliant solutions exist for using and analyzing streaming data for specific scenarios, like time series databases, machine learning engines, or observability platforms. These are complementary and often connected out of the box to a streaming cluster.

Deployment Models: Self Managed vs. Fully Managed

Different data streaming categories exist regarding the deployment model:

Self Managed: Operate nodes like Kafka broker, Kafka Connect, Schema Registry by yourself with your favorite scripts and tools. This can be on-premise or in the public cloud in your VPC.
Partially Managed: Reduce the operations burden via a cloud-native platform (usually Kubernetes) and related operator tools that automate operations tasks like rolling upgrades or rebalancing Kafka Partitions. This can be on-premise or in the public cloud either in your VPC or in the vendor’s VPC.
Fully Managed: Leverage (truly) fully managed SaaS offerings that do 100 percent of the operations and provide critical SLAs and support to focus on integration and business logic.

What about Bring Your Own Cloud (BYOC)?

Some vendors offer a fourth deployment model: Bring Your Own Cloud (BYOC), an approach where the software vendor operates a cluster for you in your environment. BYOC is a deployment model which sits somewhere between a SaaS cloud service and a self-managed deployment.

I do NOT believe in this approach as too many questions and challenges exist with BYOC regarding security, support, and SLAs in the case of P1 and P2 tickets and outages. Hence, I put this in the category of self-managed. That is what it is, even though the vendor touches your infrastructure. In the end, it is your risk because you have to and want to control your environment.

Jack Vanlightly wrote an excellent article “On The Future Of Cloud Services And BYOC“. Jack explores the following myths:

Myth 1: BYOC enables better security by keeping the data in your account.
Myth 2: BYOC is cheaper, with a lower total cost of ownership (TCO).

I summarize the story with these two drawings and highly recommend reading Jack’s detailed article about BYOC and its trade-offs:

Source: Jack Vanlightly

Here is Jack’s conclusion: “Just as customers moved away from racking their own hardware to move to the cloud, those that try BYOC will similarly migrate away to SaaS for its simplicity, reliability, scalability and cost-effectiveness.” I fully agree.

Streaming Categories: Native Kafka vs. Protocol Compatibility vs. Stream Processing

Apache Kafka became the de facto standard for data streaming, like Amazon S3 is the de facto standard for object storage:

Read the detailed blog post to learn more about the differences between an open-source standard like Kafka and a proprietary protocol like S3. When you explore the data streaming world, there is no way not to look at the Apache Kafka ecosystem.

The data streaming landscape covers three streaming categories:

Native Apache Kafka: The product or cloud service leverages the open source framework for real-time messaging and event store. The Kafka API is 100% compliant. Not included are Kafka Streams and Kafka Connect; many vendors exclude theses Kafka features.
Kafka protocol: The product or cloud services implements its own code but supports the Kafka API. These offerings are usually NOT 100% compliant. Usually, Kafka Connect and Kafka Streams are usually not part of the offerings.
Stream processing: Frameworks and cloud services correlate data in a stateless or stateful manner. Solutions are either Kafka-native, work together with the Kafka protocol, or run completely independent.

It is really tough to define these categories. For instance, I could add another section for Kafka Connect, or more generally, data integration. Another debate is how to clarify if a vendor supports the complete Kafka API (with or without Kafka Connect or Kafka Streams). But I do not want to have an endless list of solutions. Therefore, the focus is on the Kafka protocol (being the de facto standard for messaging and storage) and related stream processing with Kafka and non-Kafka technologies.

Changes in the Data Streaming Landscape from 2023 to 2024

My goal is NOT a growing landscape with tens or even hundreds of vendors and cloud services. Plenty of these pictures exist. Instead, I focus on a few technologies, vendors, and serverless offerings that I really see in the field used in practice, with excitement by the broader open source and cloud community. Therefore, the following changes were made compared to the data streaming landscape 2023 published a year ago:

Replaced

Category: I changed “Non-Kafka” to “Stream Processing” to get the right focus on data streaming, not including core messaging solutions like Google Pub/Sub or Amazon Kinesis.
Red Hat: Strategy shift at IBM (again). Red Hat shut down its cloud offering “Red Hat OpenShift Streams for Apache Kafka” and shifts conversations to IBM products. I replaced the logo with IBM that still offers Kafka cloud products. But to be clear: Red Hat AMQ Streams (i.e., self-managed Kafka + Strimzi Kubernetes operator) is still a selling product.

Added

WarpStream: A new entrant into fully managed cloud services using the Kafka protocol.
Confluent: Added to the stream processing category with Kafka Streams, ksqlDB, and Apache Flink (via its Immerok acquisition).
Ververica: Provides a Flink platform for many years. Honestly, not sure why I missed them last year.
Amazon Managed Service for Apache Flink (MSF). Formerly known as Amazon Kinesis Data Analytics (KDA). Added to partially managed stream processing.
Google DataFlow: Added to fully managed stream processing.

Removed

This is a controversial section. Hence, once again, I emphasize that this is just what I see in the field, not as a statistical research or survey.

Google Pub/Sub: Focus on data streaming, not message brokers.
TIBCO: Not used for data streaming (beyond TIBCO StreamBase in the financial services market for low-latency trading).
DataStax: I do not see open source Apache Pulsar much. And I did not see DataStax’ Pulsar offering Luna Streaming to get any traction in the market at all. Also, it seems like the vendor did another strategic shift and focuses much more on Vector Databases and Generative AI (GenAI) now.
Lenses + Conduktor: The reason I removed them is not because they are not used. These are great tools. But this landscape focuses on data streaming platforms, not complementary management, monitoring or proxy tools. So many tools exist in the meantime around Kafka – this deserves its own landscape or comparison article.
Pravega: I did not see this technology in the field a single time in 2023.
Immerok: Acquired by Confluent.
Hazelcast: I did not see it in the real world for data streaming scenarios. The technology is well-known as an in-memory data-grid, but not for stream processing.

As I mentioned the Forrester Wave above, you might realize that I did not include every “strong performer” from the report. E.g., TIBCO, SAS, or Hazelcast. Because I don’t see any traction among these vendors in my conversations about event-driven architectures and stream processing. This is no statistical evidence or trying to make other tools bad.

Evaluation Criteria for Data Streaming Platforms

I often recommend using the following four aspects to look at different frameworks, platforms, and cloud services to evaluate a technology for your business project or enterprise architecture strategy:

Cloud-native: Is the solution elastic to scale up and down? Is it fully managed / serverless, or just a bunch of server instances hosted in the cloud? Can you automate the development, operations, and testing process using DevOps, GitOps, test-driven development, and similar principles?
Complete: Does the solution offer all required capabilities? Data streaming requires more than just messaging or data ingestion. Hence, does it provide connectors, data processing, governance, security, self-service, open APIs, and so on?
Everywhere: Where can you use the solution? Cloud-only? Are all required cloud service providers supported? Is there an option to deploy in a data center or even at the edge (i.e., outside a data center), like a factory, cell tower, or retail store? How can you share data between regions, clouds or data centers? What use cases does it support (e.g., aggregation, disaster recovery, hybrid integration, migration, etc.)?
Supported: Is the solution mature and battle-tested? Are public case studies available for your use case or industry? Does the vendor fully support the product? What are the SLAs? Are specific features excluded from commercial enterprise support? Some vendors offer data streaming cloud services (like a managed Kafka service) and exclude support in the terms and conditions (that many people don’t read in public cloud services, unfortunately).

Let’s take a deeper look into the different data streaming categories and start with the leading technology: Native Apache Kafka…

Native Apache Kafka for Data Streaming

Starting with the leading data streaming technology, Apache Kafka, and related vendors and SaaS offerings.

The growth of the Apache Kafka community in the last few years is impressive. Here are some statistics that Jay Kreps presented last year at the data streaming conference “Current – The Next Generation of Kafka Summit” in Austin, Texas:

>100,000 organizations using Apache Kafka
>41,000 Kafka meetup attendees
>32,000 Stack Overflow questions
>12,000 Jiras for Apache Kafka
>31,000 Open job listings request Kafka skills

And look at the increased number of active monthly unique users downloading the Kafka Java client library with Maven:

Source: Sonatype

Apache Kafka Vendors: Self-managed vs. Cloud Offerings

New software companies focus on data streaming. And even long-standing companies such as IBM and Oracle followed the trend in the past few years. On a top level – to keep it simple – three kinds of offerings exist for Apache Kafka:

I made a detailed comparison of on-premise Kafka vendors and cloud services using this car analogy. Only Amazon MSK Serverless (i.e., the fully managed service, not the partially Managed MSK) was not available when writing this comparison. Hence, also read Confluent Cloud versus Amazon MSK Serverless.

Here are a few notes on each technology as a summary.

Apache Kafka: The de facto standard for data streaming. Open source with a vast community. All the vendors in this list rely on (parts of) this project.
Confluent: Provides data streaming everywhere with Confluent Platform (self-managed) and Confluent Cloud (fully managed and available across all major cloud providers).
Cloudera: Provides Kafka as a self-managed offering. Focuses on combining many data technologies like Kafka, Hadoop, Spark, Flink, NiFi, and many more.
IBM: Provides Kafka as a partially managed cloud offering and self-managed Kafka on Kubernetes via OpenShift (through Red Hat). Kafka is part of the integration portfolio that includes other open-source frameworks like Apache Camel.
AWS: Provides two separate products with Amazon MSK (partially managed) and Amazon MSK Serverless (fully managed). Both products have very different functionalities and limitations. Both MSK offerings exclude Kafka support (read the terms and conditions). AWS has hundreds of cloud services, and Kafka is part of that broad spectrum. Only available in public AWS cloud regions; not on Outposts, Local Zones, Wavelength, etc.
Instaclustr and Aiven: Partially managed Kafka cloud offerings across cloud providers. The product portfolios offer various hosted services of open-source technologies. Instaclustr also offers a (semi-)managed offering for on-premise infrastructure.
Microsoft Azure HDInsight. A piece of Azure’s Hadoop infrastructure. Not intended for other use cases. Only available in public Azure cloud regions.

This is no comparison. Just a list with a few notes. Make your own evaluation of your favorite vendors. Check what you need: Cloud-native? Complete? Everywhere? Supported?

And keep in mind that many vendors exclude or do not focus on Kafka Streams and Kafka Connect and only offer incomplete Kafka; they want to sell their own integration and processing products instead. Don’t compare apples and oranges!

Open Source Frameworks and SaaS using the Kafka Protocol

A few vendors don’t rely on open-source Apache Kafka but built their own implementations on top of the Kafka protocol for different reasons.

Marketing will not tell you, but the Kafka protocol compatibility is limited. This can create risk in operating existing Kafka workloads against the cluster and differs in operations and execution (which can be good or bad).

Here are a few notes on each technology as a summary:

Apache Pulsar: A competitor to Apache Kafka. Similar story and use cases, but different architecture. Kafka is a single distributed cluster – after removing the ZooKeeper dependency in 2022. Pulsar is three (!) distributed clusters: Pulsar brokers, ZooKeeper, and BookKeeper. I wrote about Pulsar vs. Kafka two years ago, and the status is still more or less the same: It is too late now to get more market traction. And Kafka catches up to some missing features like Queues for Kafka.
StreamNative: The primary vendor behind Apache Pulsar. Offers self-managed and fully managed solutions. StreamNative Cloud for Kafka is still in a very early stage of maturity, not sure if it will ever strengthen. No surprise you can now also choose BYOC instead.
Redpanda: A relatively new entrant into the data streaming market offering self-managed and fully managed products. Interesting approach to implementing the Kafka protocol with C++. It might take some market share if they can find the proper use cases and differentiators. Today, I don’t see Redpanda as an alternative to a Kafka-native offering because of its early stage in the maturity curve and no added value for solving business problems versus the added risk compared to Apache Kafka.
Azure Event Hubs: A mature, fully managed cloud service. The service does one thing, and that is done very well: Data ingestion via the Kafka protocol. Hence, it is not a complete streaming platform, but is more comparable to Amazon Kinesis or Google Cloud PubSub. Only available on public Azure cloud regions. The limited compatibility with the Kafka protocol and high cost of the service are the two blockers I hear regularly.
WarpStream: A new entrant into the data streaming market. The cloud service is a Kafka compatible data streaming platform built directly on top of S3. The worse latency is probably okay for some Kafka use cases. But only the future will show if this differentiating architecture plays a key role after other Kafka cloud services adopt Kafka’s KIP-405 for Tiered Storage (which is available in early access since Kafka 3.6).

Be careful about statements of vendors that reimplement the Kafka protocol. Most of these vendors oversell the Kafka protocol compatibility. Additionally, “benchmarketing” (i.e., picking a sweet spot or niche scenario where you perform better than your competitor) is the favorite marketing technique to “prove” differentiators to the real Apache Kafka.

Stream Processing Technologies

While Apache Kafka is the de facto standard for message and event storage, many complementary and competitive technologies exist for stream processing:

Even more technologies emerge these days because of the growth of this software category across the globe and all industries. That’s excellent news. Data streaming is here to stay and grow.

The situation is challenging to explore as part of the data streaming landscape, as some products are complementary and competitive to the Apache Kafka ecosystem.

Apache Flink Adoption and Growth

Fun fact: The leading conference for Kafka was rebranded from “Kafka Summit” to “Current – The Next Generation of Kafka Summit” in 2022. Why? Because data streaming is more than Kafka. Many complementary and competitive technologies were present, including vendors, booths, demos, and customer case studies. That’s a remarkable evolution of data streaming for the community and enterprises across the globe!

Apache Flink is becoming the de facto standard for stream processing. The rise of Flink looks very similar to Kafka’s adoption a few years ago:

But don’t underestimate the power and use cases of Kafka-native stream processing with Kafka Streams. The adoption rate is massive, as Kafka Streams is easy to use. And it is part of Apache Kafka. I wrote a comparison of the benefits of Apache Flink and the differences between Flink and Kafka Streams.

Some Stream Processing Products are Complementary to Kafka

Each stream processing framework or cloud service has trade-offs. While Flink gets a lot of traction, there are others, too. There is no single size that fits all use cases. Here are a few mature and emerging technologies that complement Apache Kafka.

Open Source Stream Processing Frameworks

Kafka Streams: Part of open-source Apache Kafka. Hence, if you download Kafka from the Apache website, it always includes the library. You should always ask yourself if you need another framework besides Kafka Streams for stream processing. The significant benefit: One technology, one vendor, one infrastructure.
ksqlDB (usually called KSQL, even after the rebranding): An abstraction layer on top of Kafka Streams to provide stream processing with streaming SQL. Hence, also Kafka-native. It comes with a Confluent Community License and is free to use. Sweet spot: Streaming ETL.
Apache Flink: Leading open-source stream processing framework. Advanced features include a powerful scalable compute engine (separated from Kafka), freedom of choice for developers between ANSI SQL, Java and Python, APIs for Complex Event Processing (CEP), and unified APIs for stream and batch workloads.
Spark Streaming: The streaming part of the open source big data processing framework Apache Spark. I am still not 100 percent convinced. Kafka Streams and Apache Flink are the better choices for stream processing. However, the enormous installed base of Spark clusters in enterprises broadens adoption.

Stream Processing Vendors and Cloud Services

Ververica: Well-known Flink company. Acquired by Chinese giant Alibaba in 2019. Not much traction in Europa and the US, but definitely an expert player for Flink in Asia. Personally, I have never seen the vendor being used outside of Asia.
Decodable: A new cloud service. Very early stage. I still added it, as I think it is an excellent strategic move to build a data streaming cloud service on top of Apache Flink. Huge potential if it is combined with existing Kafka infrastructures in enterprises. But also provides pre-built connectors for non-Kafka systems.
Amazon Managed Service for Apache Flink (MSF): An (almost) fully managed service by AWS that allows customers to transform and analyse streaming data in real time with Apache Flink. It still provides some (costly) gaps for auto-scaling and is not truly serverless. Supports all Flink interfaces, i.e., SQL, Java, and Python. And non-Kafka connectors, too. Only available on AWS.
Confluent Cloud: A truly serverless Flink offering that is elastic and scales to zero if not used. Only supports SQL in the beginning. Java and Python support coming in 2024. Starts on AWS, but expected to be available on GCP and Azure in early 2024. Deeply integrated with fully managed Kafka, Schema Registry, Connectors, Data Governance, and other features of Confluent. Provides a seamless end-to-end developer experience for data streaming pipelines.
Databricks: Was the leading vendor behind Apache Spark. Today, nobody talks about Spark anymore (even if it is a key technology piece of Databricks’ cloud platform). Trying to get much more into the business of real-time data. I like the platform for analytics, reporting and AI / machine learning. But I am not convinced by the data lakehouse story around “doing everything within one big data lake”. Check out my blog series “Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?“.

Most of these services work well with solutions of other vendors. For instance, Databricks integrates easily with any Kafka environment, or Amazon MSF connects directly to Confluent’s Kafka.

Apache Flink (or Spark Streaming) WITHOUT Kafka?

Most stream processing technologies complement Apache Kafka. But stream processing frameworks like Flink or cloud services like Databricks do NOT need Kafka as an ingestion layer. There are other options…

Flink, Spark, et al. can consume data from other streaming platforms or directly from data stores like a file or database.Be careful with the latter: If you use Flink or Spark Streaming for stream processing, that’s fine. But if the first thing to do is read the data from an S3 object store, well, that is data at rest.

BUT: A common trend in the data streaming market is long-term storage of (some) events within the event streaming platform. Especially, introducing Tiered Storage for Kafka changed the capabilities and use cases. The support for object storage by some vendors via the S3 interface can be an entire game changer for storing and processing events in real-time with the Kafka protocol or with other analytics engines and databases in near-real-time or batch. And Apache Iceberg might be the next trend we talk about for the 2025 streaming landscape.

Understand when to use Reverse ETL from a data lake and when it is an anti-pattern!

And understand that stream processing applications can also keep state: The backend of your Kafka Streams or Flink app can store state for your tasks like enrichment purposes. A stream processing application is not just about real-time data feeds. It also correlates these real-time feeds with (already ingested) historical data. This is a common approach for metadata or business data that is updated less frequently (like from an SAP ERP system).

Some Stream Processing Products are Competitive to Kafka

In some situations, you must evaluate whether Apache Kafka or another technology is the right choice. Here are a few open-source and cloud competitors:

Amazon Kinesis: Data ingestion into AWS data stores. Mature product for a specific problem. Only available on AWS.
Google Cloud DataFlow: Fully managed service for executing Apache Beam pipelines within the Google Cloud Platform ecosystem. Mature product for a specific problem. Only available on GCP.
Many competitive startups emerge around stream processing outside of the Kafka and Flink world. Let’s see if some get traction in 2024.

Amazon Kinesis and Google Cloud DataFlow are excellent cloud services if you “just” want to ingest data into a specific cloud storage. If there are no other use cases, these tools might be the right choice (if pricing at scale and other limitations work for you).

Apache Kafka is a much more flexible and strategic data streaming platform. Many projects still start with data ingestion and build the first pipeline. But providing access to the same stream of events to any other data sink or for powerful stream processing with tools like Kafka Streams or Apache Flink is a significant advantage.

Potential for the Data Streaming Landscape 2025

Additional Kafka cloud services like Digital Ocean or Oracle Cloud Infrastructure (OCI) Streaming might get more traction. I did not see these yet in the field. But e.g., combining Oracle GoldenGate with OCI Streaming might be interesting to some organizations for some use cases.
Hazelcast rebrands significantly and is part of Forrester’s Streaming Data Wave. With both an on-premise and serverless cloud offering, the technology might get traction for data streaming and come back to my landscape next year.
Streaming Databases like Materialize or RisingWave might become its own category. My feeling: Super early stage of the hype cycle. We will see in 2024 if and where this technology gets more broadly adopted and what the use cases are. It is hard to answer how these will compete with emerging real-time analytics databases like Apache Druid, Apache Pinot, ClickHouse, Rockset, Timeplus, et al. I know there are differences, but the broader community and companies need to a) understand these differences and b) find business problems for it.
SaaS Startups like Quix and Bytewax (both stream processing with Python), DeltaStream (powered by Apache Flink), and many more emerge. Let’s see which of these gets traction in the market with an innovative product and business model.
Existing multi-product enterprise extend their offerings around Kafka with separate Flink service. For instance, Aiven already has a Flink product in the meantime that might get traction like its Kafka offering.
Traditional data management vendors like MongoDB or Snowflake try to get deeper into the data streaming business. I am still a fan of separation of concerns; so I think these should keep their sweet spot and (only) provide streaming ingestion and CDC as use cases, but not (try to) compete with data streaming vendors.

The Data Streaming Journey is a Long One…

Data Streaming is not a race, it is a journey! Event-driven architectures and technologies like Apache Kafka or Apache Flink require a mind shift in architecting, developing, deploying, and monitoring applications. Legacy integration, cloud-native microservices, and data sharing across hybrid and multi-cloud setups are the norm, not an exception.

The data streaming landscape 2024 shows how a new software category is emerging. We are still in an early stage. A new software category takes time to create. In most conversations with customers, partners, and the community, I hear statements like: “We see the value, but we are not there yet – we now start with building first data streaming pipelines and have a roadmap for the next years to add more advanced stream processing”.

The Forrester Wave: Streaming Data Platforms, Q4 2023 is a first proof that the category is moving forward in the hype cycle.

Looking at the competitive data streaming market, one of my favorite real-world examples for choosing the right stream processing technologies comes from DoorDash: Why companies migrate from Amazon SQS and Kinesis to Apache Kafka and Flink. The article explores the trade-offs between cloud-specific solutions like Kinesis and an open ecosystem around open-source technologies like Kafka and Flink.

Last but not least, check out my Top 5 Data Streaming Trends for 2024 to understand how the data streaming landscape fits into emerging trends: data sharing, data contracts, serverless stream processing, multi-cloud architectures, and GenAI.

What are your most relevant and exciting trends for data streaming with Apache Kafka and Flink in 2024 to set data in motion? What does your enterprise landscape for data streaming look like? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post The Data Streaming Landscape 2024 appeared first on Kai Waehner.

The Data Streaming Landscape 2023

Kai Waehner — Wed, 21 Dec 2022 07:29:58 +0000

Data streaming is a new software category to process data in motion. Apache Kafka is the de facto standard used by over 100,000 organizations. Plenty of vendors offer Kafka platforms and cloud services. Many complementary stream processing engines like Apache Flink and SaaS offerings have emerged. And competitive technologies like Pulsar and Redpanda try to get market share. This blog post explores the data streaming landscape of 2023 to summarize existing solutions and market trends.

Data streaming is a new software category

Data-driven applications are the new black. This approach increases the business value as the overall goal by increasing revenue, reducing cost, reducing risk, or improving the customer experience.

Plenty of software categories and related data platforms exist to process and analyze data:

Database: Store and execute transactional workloads.
Data Warehouse: Processing structured historical data to create recurring reports and unique insights.
Data Lake: Processing structured and semi- or unstructured big data sets with batch processing to create recurring reports and unique insights.
Lakehouse: A mix of data warehouse and data lake to process all data on one platform.
Data Streaming: Continuously process data in motion and provide data consistency across communication paradigms instead of storing and analyzing data at rest.

Of course, these data platforms often overlap a bit. I did a complete blog series exploring the use cases and how they complement each other.

Data streaming use cases by business value

Use cases for data streaming exist across all industries:

Adding business value is crucial for any enterprise. With so many potential use cases, it is no surprise that more and more software vendors add Kafka support to their products. Search my blog for your favorite industry to find plenty of case studies and architectures. Or read about use cases for Apache Kafka across industries to get started.

The data streaming landscape of 2023

Data Streaming is a separate software category of data platforms. Many software vendors built their entire businesses around this category.

The data streaming landscape shows that most vendors use Kafka or implement its protocol because it has become the de facto standard.

Apache Kafka is the de facto standard for data streaming, like Amazon S3 is the de facto standard for S3 object storage. Most software vendors use Kafka for their data streaming platforms. However, there is more than Kafka. Some vendors only use the Kafka protocol (Azure Event Hubs) or utterly different APIs (like Amazon Kinesis).

The following Data Streaming Landscape 2023 summarizes the current status of relevant products and cloud services:

Please note: This is not a complete list of frameworks, cloud services, or vendors. It is not an official research landscape. If your favorite technology is not in this diagram, then I did not see it in my conversations with customers, prospects, partners, analysts, or the broader data streaming community. We will probably see many more logos in this diagram in a year or two, as this is still the beginning of the data streaming era.

Evaluation criteria for data streaming platforms

Cloud-native: Is the solution elastic to scale up and down? Is it fully managed / serverless, or just a bunch of server instances hosted in the cloud? Can you automate the development, operations, and testing process using DevOps, GitOps, test-driven development, and similar principles?
Complete: Does the solution offer all required capabilities? Data streaming requires more than just messaging or data ingestion. Hence, does it provide connectors, data processing, governance, security, self-service, and so on?
Everywhere: Where can you use the solution? Cloud-only? Are all required cloud service providers supported? Is there an option to deploy in a data center or even at the edge (i.e., outside a data center)? How can you share data between regions, clouds or data centers? What use cases are supported (e.g., aggregation, disaster recovery, hybrid integration, etc.)?
Supported: Is the solution mature and battle-tested? Are public case studies available for your use case or industry? Does the vendor fully support the product? What are the SLAs? Are specific features excluded from commercial enterprise support? It is a shame that this aspect needs to be evaluated. Still, some vendors offer data streaming cloud services and exclude support in the terms and conditions (that many people don’t read in cloud services, unfortunately).

Let’s take a deeper look into the different categories and start with the leading technology: Native Apache Kafka…

Apache Kafka is the de facto standard for data streaming

Starting with the leader and de facto standard Apache Kafka and related vendors and SaaS offerings. Apache Kafka became the de facto standard for data streaming like Amazon S3 is the de facto standard for object storage:

Read the detailed blog post to learn more about the differences between an open-source standard like Kafka and a proprietary protocol like S3.

When you explore the data streaming world, there is no way not to look at the Apache Kafka ecosystem.

Apache Kafka adoption and growth

The growth of the Apache Kafka community in the last few years is impressive. Here are some statistics that Jay Kreps presented at the data streaming conference “Current – The Next Generation of Kafka Summit” in Austin, Texas, in October 2022:

>100,000 organizations using Apache Kafka
>41,000 Kafka meetup attendees
>32,000 Stack Overflow questions
>12,000 Jiras for Apache Kafka
>31,000 Open job listings request Kafka skills

And look at the increased number of active monthly unique users downloading the Kafka Java client library with Maven:

Source: Sonatype

Fun fact: The leading conference for Kafka was rebranded from “Kafka Summit” to “Current 2022 – The Next Generation of Kafka Summit”. Why? Because data streaming is more than Kafka. Many complementary and competitive technologies were present, including vendors, booths, demos, and customer case studies. That’s a remarkable evolution of data streaming for the community and enterprises across the globe!

Apache Kafka Vendors: self-managed vs. cloud offerings

New software companies focus on data streaming. And traditional players like IBM and Amazon jumped on the bandwagon in the past few years. On a top level – to keep it simple – three kinds of offerings exist for Apache Kafka:

I made a detailed comparison of on-premise Kafka vendors and cloud services using this car analogy. Only Amazon MSK Serverless (i.e., the fully managed service, not the partially Managed MSK) was not available when writing this comparison. Hence, read Confluent Cloud versus Amazon MSK Serverless.

Here are a few notes on each vendor as a summary.

Apache Kafka: The de facto standard for data streaming. Open source with a vast community. All the vendors in this list rely on (parts of) this project.
Confluent: Provides data streaming everywhere with Confluent Platform (self-managed) and Confluent Cloud (fully managed and available across cloud providers).
Cloudera: Provides Kafka as a self-managed offering. Focuses on combining many data technologies like Kafka, Hadoop, Spark, Flink, NiFi, and many more.
Red Hat: Provides Kafka as a partially managed cloud offering and self-managed Kafka on Kubernetes via OpenShift. Kafka is part of the integration portfolio that includes other open-source frameworks like Apache Camel.
TIBCO: Offers Kafka for Linux and Windows. Strange product (as Kafka experts know Kafka does not work well on Windows) and minimal documentation.
AWS: Provides two separate products with Amazon MSK (partially managed) and Amazon MSK Serverless (fully managed). Kafka support is excluded in the MSK offerings. AWS has hundreds of cloud services, and Kafka is part of that broad spectrum. Only available on AWS clouds.
Instaclustr and Aiven: Partially managed Kafka cloud offerings across cloud providers. The product portfolios offer various hosted services of open-source technologies. Instaclustr also offers a (semi-)managed offering for on-premise infrastructure.
Microsoft Azure HDInsight. A piece of Azure’s Hadoop infrastructure. Not intended for other use cases. Only available on Azure clouds.
Lenses and Conduktor: Tools for managing and monitoring Kafka clusters. Complementary to the other vendors.

This is no comparison. Just a list with a few notes. Make your own evaluation of your favorite vendors. Check what you need: Cloud-native? Complete? Everywhere? Supported?

Kafka-compatible open-source frameworks and SaaS

A few vendors don’t rely on open-source Apache Kafka but built their own implementations for different reasons. The Kafka protocol compatibility is limited (though marketing will not tell you). This can create risk in operating existing Kafka workloads against the cluster and differs in operations and execution (which can be good or bad).

Here are a few notes on each vendor as a summary:

Apache Pulsar: A competitor to Apache Kafka. Similar story and use cases, but different architecture (Kafka is one distributed cluster – after removing the ZooKeeper dependency in 2022), Pulsar is three distributed clusters (Pulsar brokers, ZooKeeper, BookKeeper). I wrote about Pulsar vs. Kafka two years ago, and I think the status is still the same (and it is too late now to get more market traction).
StreamNative: The primary vendor behind Apache Pulsar. Offers self-managed and fully managed solutions. StreamNative Cloud for Kafka is in beta and not production ready.
DataStax: A Pulsar offering integrated into the database-focused product portfolio. Not sure if the streaming product is just marketing or not. If you want to try out the Astra Streaming cloud service powered by Pulsar, it refers you to the multi-cloud DBaaS built on Apache Cassandra.
Redpanda: A new entrant into the data streaming market offering self-managed and fully managed products. Interesting approach to implementing the Kafka protocol with C++. It might take some market share if they can find the proper use cases and differentiators. Today, I don’t see Redpanda as an alternative to a Kafka-native offering because of its early stage in the maturity curve and no added value for solving business problems versus the added risk compared to Apache Kafka.
Azure Event Hubs: A mature, fully managed cloud service. The service does one thing, and that is done very well: Data ingestion via the Kafka protocol (with limited compatibility). Hence, it is not a complete streaming platform, but is more comparable to Amazon Kinesis or Google Cloud PubSub. Only available on Azure cloud.

Data streaming is more than Apache Kafka…

While Apache Kafka is the de facto standard for data streaming, many complementary and competitive technologies exist.

Even more technologies emerge these days because of the growth of this software category across the globe and all industries. That’s excellent news. Data streaming is here to stay and grow.

The situation is challenging to explore as part of the data streaming landscape, as some products are complementary and competitive to the Apache Kafka ecosystem.

Some data streaming technologies are competitive to Kafka

In some situations, you must evaluate whether Apache Kafka or another technology is the right choice. Here are a few open-source and cloud competitors:

Amazon Kinesis: Data ingestion into AWS data stores. Mature product for a specific problem. Only available on AWS.
Google Cloud PubSub: Data ingestion into GCP data stores. Mature product for a specific problem. Only available on GCP.
Pravega and Hazelcast Jet: Open-source frameworks for stream processing. I added these to show that there are more than Kafka and Flink in the open-source world. Though, I see little market traction.

Amazon Kinesis and Google Cloud PubSub are excellent cloud services if you “just” want to ingest data into a specific cloud storage. If there are no other use cases, these tools might be the right choice (if pricing at scale and other limitations work for you).

Apache Kafka is a much more flexible and strategic data streaming platform. Many projects still start with data ingestion and build the first pipeline. But providing access to the same stream of events to any other data sink or for powerful stream processing with tools like Kafka Streams or Apache Flink is a significant advantage.

Some data streaming technologies are complementary to Kafka

Each stream processing framework or cloud service has trade-offs. There is no single size that fits all use cases. Here are a few mature and emerging technologies that complement Apache Kafka:

Apache Flink: Together with Kafka Streams (part of Apache Kafka), the leading open-source stream processing framework. Advanced features include ANSI SQL support and APIs for stream and batch workloads.
Decodable and Immerok: Two brand new cloud services. Very early stage. I still added them, as I think it is an excellent strategic move to build a data streaming cloud service on top of Apache Flink. Huge potential if it is combined with existing Kafka infrastructures in enterprises.
Spark Streaming: The streaming part of Apache Spark. I am still not 100 percent convinced. Kafka Streams and Apache Flink are the better choices for stream processing. However, the enormous installed base of Spark clusters in enterprises broadens adoption.
Databricks: The leading vendor behind Apache Spark. Getting or at least trying to get much more into the business of real-time data. I like the platform, but I am not convinced by the lakehouse story around “doing everything within one big data lake”. Check out my blog series “Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?“.

Apache Flink and Spark Streaming WITHOUT Kafka?

Most of these technologies complement Apache Kafka. But stream processing frameworks like Flink or cloud services like Databricks do NOT need Kafka as an ingestion layer. There are other options…

Flink, Spark, et al. can consume data from other streaming platforms or directly from data stores. However, be careful with the latter: If you use Flink or Spark Streaming for stream processing, that’s fine. But if the first thing to do is read the data from an S3 object store, well, that is data at rest. Don’t do stream processing with data at rest.

Or in other words, don’t store data in a database or data lake just to reverse it later. Almost all Spark Streaming examples and case studies I saw last year at conferences and customer meetings looked like this. That is an anti-pattern for stream processing!

To be clear: It is okay to ingest data from S3 or another data store to a stream processing application built with Kafka Streams, Flink, et al. This data can be used in the stateful backend for your tasks like enrichment purposes. A stream processing application is not just about real-time data feeds. It also correlates these real-time feeds with (already ingested) historical data. This is a common approach for metadata or business data that is updated less frequently (like from an SAP ERP system).

Why are Kafka Streams and KSQL missing in the data streaming landscape?

I intentionally did not put Kafka Streams and KSQL into the data streaming landscape. Both are Kafka-native stream processing technologies.

Kafka Streams, like Kafka Connect, are part of open-source Apache Kafka. Hence, the Java library is included if you download Kafka from the Apache website. It is already included in the data streaming landscape with the Kafka logo. You should always ask yourself if you need another framework besides Kafka Streams for stream processing. The significant benefit: One technology, one vendor, one infrastructure.

Many vendors exclude or do not focus on Kafka Streams and Kafka Connect and only offer incomplete Kafka; they want to sell their own integration and processing products instead.

KSQL is an abstraction layer on top of Kafka Streams to provide stream processing with streaming SQL. A great tool, also Kafka-native. It comes with a Confluent Community License and is free to use. Hence, like Kafka Streams, I see it as part of Kafka and did not explicitly put it into the data streaming landscape as a separate product. But you need to evaluate it against Flink, Decodable, and others, for your use case, of course.

The data streaming era is just beginning…

The data streaming landscape 2023 shows how a new software category is emerging. We are still in a very early stage. In most conversations with customers, partners, and the community, I hear statements like:

“We see the value, but we are not there yet – we now start with building first data streaming pipelines and have a roadmap for the next years to add more advanced stream processing”.

Data streaming is a long journey, as it is a paradigm shift. We hopefully see a Gartner Magic Quadrant for Event Streaming and a Forrester Wave for Data Streaming in the foreseeable, too. A new category takes time to create. But did you already notice how much more the analysts of Gartner, Forrester, and others already write about data streaming and the various vendors? I also wrote a dedicated blog explaining why data streaming is its own software category.

Looking at the competitive data streaming market, one of my favorite real-world examples for choosing the right stream processing technologies comes from DoorDash: Why companies migrate from Amazon SQS and Kinesis to Apache Kafka and Flink. The article explores the trade-offs between cloud-specific solutions like Kinesis or PubSub and an open ecosystem around open-source technologies like Kafka and Flink.

Last but not least, check out my Top 5 Data Streaming Trends for 2023 to understand how the data streaming landscape fits into emerging trends like data mesh, data sharing, and data governance.

What are your most relevant and exciting trends for data streaming and Apache Kafka in 2023 to set data in motion? What does your enterprise landscape for data streaming look like? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post The Data Streaming Landscape 2023 appeared first on Kai Waehner.

Comparison: JMS Message Queue vs. Apache Kafka

Kai Waehner — Thu, 12 May 2022 05:13:19 +0000

Comparing JMS-based message queue (MQ) infrastructures and Apache Kafka-based data streaming is a widespread topic. Unfortunately, the battle is an apple-to-orange comparison that often includes misinformation and FUD from vendors. This blog post explores the differences, trade-offs, and architectures of JMS message brokers and Kafka deployments. Learn how to choose between JMS brokers like IBM MQ or RabbitMQ and open-source Kafka or serverless cloud services like Confluent Cloud.

Motivation: The battle of apples vs. oranges

I have to discuss the differences and trade-offs between JMS message brokers and Apache Kafka every week in customer meetings. What annoys me most is the common misunderstandings and (sometimes) intentional FUD in various blogs, articles, and presentations about this discussion.

I recently discussed this topic with Clement Escoffier from Red Hat in the “Coding over Cocktails” Podcast: JMS vs. Kafka: Technology Smackdown. A great conversation with more agreement than you might expect from such an episode where I picked the “Kafka proponent” while Clement took over the role of the “JMS proponent”.

These aspects motivated me to write a blog series about “JMS, Message Queues, and Apache Kafka”:

THIS POST – 10 Comparison Criteria for JMS Message Broker vs. Apache Kafka Data Streaming
Alternatives for a Dead Letter Queue (DLQ) in Apache Kafka
Implementing the Request-Reply Pattern with Apache Kafka
UPCOMING – A Decision Tree for Choosing the Right Messaging System (JMS vs. Apache Kafka)
UPCOMING – From JMS Message Broker to Apache Kafka: Integration, Migration, and/or Replacement

I will link the other posts here as soon as they are available. Please follow my newsletter to get updated in real-time about new posts. (no spam or ads)

Special thanks to my colleague and long-term messaging and data streaming expert Heinz Schaffner for technical feedback and review of this blog series. He has worked for TIBCO, Solace, and Confluent for 25 years.

10 comparison criteria: JMS vs. Apache Kafka

This blog post explores ten comparison criteria. The goal is to explain the differences between message queues and data streaming, clarify some misunderstandings about what an API or implementation is, and give some technical background to do your evaluation to find the right tool for the job.

The list of products and cloud services is long for JMS implementations and Kafka offerings. A few examples:

JMS implementations of the JMS API (open source and commercial offerings): Apache ActiveMQ, Apache Qpid (using AMQP), IBM MQ (formerly MQSeries, then WebSphere MQ), JBoss HornetQ, Oracle AQ, RabbitMQ, TIBCO EMS, Solace, etc.
Apache Kafka products, cloud services, and rewrites (beyond the valid option of using just open-source Kafka): Confluent, Cloudera, Amazon MSK, Red Hat, Redpanda, Azure Event Hubs, etc.

Here are the criteria for comparing JMS message brokers vs. Apache Kafka and its related products/cloud services:

Message broker vs. data streaming platform
API Specification vs. open-source protocol implementation
Transactional vs. analytical workloads
Push vs. pull message consumption
Simple vs. powerful and complex API
Storage for durability vs. true decoupling
Server-side data-processing vs. decoupled continuous stream processing
Complex operations vs. serverless cloud
Java/JVM vs. any programming language
Single deployment vs. multi-region (including hybrid and multi-cloud) replication

Let’s now explore the ten comparison criteria.

1. Message broker vs. data streaming platform

TL;DR: JMS message brokers provide messaging capabilities to produce and consume messages. Apache Kafka is a data streaming platform that combines messaging, storage, data integration, and stream processing capabilities.

The most important aspect first: The comparison of JMS and Apache Kafka is an apple to orange comparison for several reasons. I would even further say that not both can be fruit, as they are so different from each other.

JMS API (and implementations like IBM MQ, RabbitMQ, et al)

JMS (Java Message Service) is a Java application programming interface (API) that provides generic messaging models. The API handles the producer-consumer problem, which can facilitate the sending and receiving of messages between software systems.

Therefore, the central capability of JMS message brokers (that implement the JMS API) is to send messages from a source application to another destination in real-time. That’s it. And if that’s what you need, then JMS is the right choice for you! But keep in mind that projects must use additional tools for data integration and advanced data processing tasks.

Apache Kafka (open source and vendors like Confluent, Cloudera, Red Hat, Amazon, et al)

Apache Kafka is an open-source protocol implementation for data streaming. It includes:

Apache Kafka is the core for distributed messaging and storage. High throughput, low latency, high availability, secure.
Kafka Connect is an integration framework for connecting external sources/destinations to Kafka.
Kafka Streams is a simple Java library that enables streaming application development within the Kafka framework.

This combination of capabilities enables the building of end-to-end data pipelines and applications. That’s much more than what you can do with a message queue.

2. JMS API specification vs. Apache Kafka open-source protocol implementation

TL;DR: JMS is a specification that vendors implement and extend in their opinionated way. Apache Kafka is the open-source implementation of the underlying specified Kafka protocol.

It is crucial to clarify the terms first before you evaluate JMS and Kafka:

Standard API: Specified by industry consortiums or other industry-neutral (often global) groups or organizations specify standard APIs. Requires compliance tests for all features and complete certifications to become standard-compliant. Example: OPC-UA.
De facto standard API: Originates from an existing successful solution (an open-source framework, a commercial product, or a cloud service). Examples: Amazon S3 (proprietary from a single vendor). Apache Kafka (open source from the vibrant community).
API Specification: A specification document to define how vendors can implement a related product. There are no complete compliance tests or complete certifications for the implementation of all features. The consequence is a “standard API” but no portability between implementations. Example: JMS. Specifically for JMS, note that in order to be able to use the compliance suite for JMS, a commercial vendor has to sign up to very onerous reporting requirements towards Oracle.

The alternative kinds of standards have trade-offs. If you want to learn more, check out how Apache Kafka became the de facto standard for data streaming in the last few years.

Portability and migrations became much more relevant in hybrid and multi-cloud environments than in the past decades where you had your workloads in a single data center.

JMS is a specification for message-oriented middleware

JMS is a specification currently maintained under the Java Community Process as JSR 343. The latest (not yet released) version JMS 3.0 is under early development as part of Jakarta EE and rebranded to Jakarta Messaging API. Today, JMS 2.0 is the specification used in prevalent message broker implementations. Nobody knows where JMS 3.0 will go at all. Hence, this post focuses on the JMS 2.0 specification to solve real-world problems today.

I often use the term “JMS message broker” in the following sections as JMS (i.e., the API) does not specify or implement many features you know in your favorite JMS implementation. Usually, when people talk about JMS, they mean JMS message broker implementations, not the JMS API specification.

JMS message brokers and the JMS portability myth

The JMS specification was developed to provide a common Java library to access different messaging vendor’s brokers. It was intended to act as a wrapper to the messaging vendor’s proprietary APIs in the same way JDBC provided similar functionality for database APIs.

Unfortunately, this simple integration turned out not to be the case. The migration of the JMS code from one vendor’s broker to another is quite complex for several reasons:

Not all JMS features are mandatory (security, topic/queue labeling, clustering, routing, compression, etc.)
There is no JMS specification for transport
No specification to define how persistence is implemented
No specification to define how fault tolerance or high availability is implemented
Different interpretations of the JMS specification by different vendors result in potentially other behaviors for the same JMS functions
No specification for security
There is no specification for value-added features in the brokers (such as topic to queue bridging, inter-broker routing, access control lists, etc.)

Therefore, simple source code migration and interoperability between JMS vendors is a myth! This sounds crazy, doesn’t it?

Vendors provide a great deal of unique functionality within the broker (such as topic-to-queue mapping, broker routing, etc.) that provide architectural functionality to the application but are part of the broker functionality and not the application or part of the JMS specification.

Apache Kafka is an open-source protocol implementation for data streaming

Apache Kafka is an implementation to do reliable and scalable data streaming in real-time. The project is open-source and available under Apache 2.0 license, and is driven by a vast community.

Apache Kafka is NOT a standard like OPC-UA or a specification like JMS. However, Kafka at least provides the source code reference implementation, protocol and API definitions, etc.

Kafka established itself as the de facto standard for data streaming. Today, over 100,000 organizations use Apache Kafka. The Kafka API became the de facto standard for event-driven architectures and event streaming. Use cases across all industries and infrastructure. Including various kinds of transactional and analytic workloads. Edge, hybrid, multi-cloud. I collected a few examples across verticals that use Apache Kafka to show the prevalence across markets.

Now, hold on. I used the term Kafka API in the above section. Let’s clarify this: As discussed, Apache Kafka is an implementation of a distributed data streaming platform including the server-side and client-side and various APIs for producing and consuming events, configuration, security, operations, etc. The Kafka API is relevant, too, as Kafka rewrites like Azure Event Hubs and Redpanda use it.

Portability of Apache Kafka – yet another myth?

If you use Apache Kafka as an open-source project, this is the complete Kafka implementation. Some vendors use the full Apache Kafka implementation and build a more advanced product around it.

Here, the migration is super straightforward, as Kafka is not just a specification that each vendor implements differently. Instead, it is the same code, libraries, and packages.

For instance, I have seen several successful migrations from Cloudera to Confluent deployments or from self-managed Apache Kafka open-source infrastructure to serverless Confluent Cloud.

The Kafka API – Kafka rewrites like Azure Event Hubs, Redpanda, Apache Pulsar

With the global success of Kafka, some vendors and cloud services did not build a product on top of the Apache Kafka implementation. Instead, they made their implementation on top of the Kafka API. The underlying implementation is proprietary (like in Azure’s cloud service Event Hubs) or open-source (like Apache Pulsar’s Kafka bridge or Redpanda’s rewrite in C++).

Be careful and analyze if vendors integrate the whole Apache Kafka project or rewrote the complete API. Contrary to the battle-tested Apache Kafka project, a Kafka rewrite using the Kafka API is a completely new implementation!

Many vendors even exclude some components or APIs (like Kafka Connect for data integration or Kafka Streams for stream processing) completely or exclude critical features like exactly-once semantics or long-term storage in their support terms and conditions.

It is up to you to evaluate the different Kafka offerings and their limitations. Recently, I compared Kafka vendors such as Confluent, Cloudera, Red Hat, or Amazon MSK and related technologies like Azure Event Hubs, AWS Kinesis, Redpanda, or Apache Pulsar.

Just battle-test the requirements by yourself. If you find a Kafka-to-XYZ bridge with less than a hundred lines of code, or if you find a .exe Windows Kafka server download from a middleware vendor. Be skeptical!

All that glitters is not gold. Some frameworks or vendors sound too good to be true. Just saying you support the Kafka API, you provide a fully managed serverless Kafka offering, or you scale much better is not trustworthy if you are constantly forced to provide fear, uncertainty, and doubt (FUD) on Kafka and that you are much better. For instance, I was annoyed by Pulsar always trying to be better than Kafka by creating a lot of FUDs and myths in the open-source community. I responded in my Apache Pulsar vs. Kafka comparison two years ago. FUD is the wrong strategy for any vendor. It does not work. For that reason, Kafka’s adoption still grows like crazy while Pulsar grows much slower percentage-wise (even though the download numbers are on a much lower level anyway).

3. Transactional vs. analytical workloads

TL;DR: A JMS message broker provides transactional capabilities for low volumes of messages. Apache Kafka supports low and high volumes of messages supporting transactional and analytical workloads.

JMS – Session and two-phase commit (XA) transactions

Most JMS message brokers have good support for transactional workloads.

A transacted session supports a single series of transactions. Each transaction groups a set of produced messages and a set of consumed messages into an atomic unit of work.

Two-phase commit transactions (XA transactions) work on a limited scale. They are used to integrate with other systems like Mainframe CICS / DB2 or Oracle database. But it is hard to operate and not possible to scale beyond a few transactions per second.

It is important to note that support for XA transactions is not mandatory with the JMS 2.0 specification. This differs from the session transaction.

Kafka – Exactly-once semantics and transaction API

Kafka is a distributed, fault-tolerant system that is resilient by nature (if you deploy and operate it correctly). No downtime and no data loss can be guaranteed, like in your favorite database, mainframe, or other core platforms.

And even better: Kafka’s Transaction API, i.e., Exactly-Once Semantics (EOS), has been available since Kafka 0.11 (GA’ed many years ago). EOS makes building transactional workloads even easier as you don’t need to handle duplicates anymore.

Kafka supports atomic writes across multiple partitions through the transactions API. This allows a producer to send a batch of messages to multiple partitions. Either all messages in the batch are eventually visible to any consumer, or none are ever visible to consumers.

Kafka transactions work very differently than JMS transactions. But the goal is the same: Each consumer receives the produced event exactly once. Find more details in the blog post “Analytics vs. Transactions in Data Streaming with Apache Kafka“.

4. Push vs. pull message consumption

TL;DR: JMS message brokers push messages to consumer applications. Kafka consumers pull messages providing true decoupling and backpressure handling for independent consumer applications.

Pushing messages seems to be the obvious choice for a real-time messaging system like JMS-based message brokers. However, push-based messaging has various drawbacks regarding decoupling and scalability.

JMS expects the broker to provide back pressure and implement a “pre-fetch” capability, but this is not mandatory. If used, the broker controls the backpressure, which you cannot control.

With Kafka, the consumer controls the backpressure. Each Kafka consumer consumes events in real-time, batch, or only on demand – in the way the particular consumer supports and can handle the data stream. This is an enormous advantage for many inflexible and non-elastic environments.

So while JMS has some kind of backpressure, the producer stops if the queue is full. In Kafka, you control the backpressure on the consumer. There is no way to scale a producer with JMS (as there are no partitions in a JMS queue or topic).

JMS consumers can be scaled, but then you lose guaranteed ordering. Guaranteed ordering in JMS message brokers only works via a single producer, single consumer, and transaction.

5. Simple JMS API vs. powerful and complex Kafka API

TL;DR: The JMS API provides simple operations to produce and consume messages. Apache Kafka has a more granular API that brings additional power and complexity.

JMS vendors hide all the cool stuff in the implementation under the spec. You only get the 5% (no control, built by the vendor). You need to make the rest by yourself. On the other side, Kafka exposes everything. Most developers only need 5%.

In summary, be aware that JMS message brokers are built to send messages from a data source to one or more data sinks. Kafka is a data streaming platform that provides many more capabilities, features, event patterns, and processing options; and a much larger scale. With that in mind, it is no surprise that the APIs are very different and have different complexity.

If your use case requires just sending a few messages per second from A to B, the JMS is the right choice and simple to use! If you need a streaming data hub at any scale, including data integration and data processing, that’s only Kafka.

Asynchronous request-reply vs. data in motion

One of the most common wishes of JMS developers is to use are request-response function in Kafka. Note that this design pattern is different in messaging systems from an RPC (remote procedure call) as you know it from legacy tools like Corba or web service standards like SOAP/WSDL or HTTP. Request-reply in messaging brokers is an asynchronous communication that leverages a correlation ID.

Asynchronous messaging to get events from a producer (say a mobile app) to a consumer (say a database) is a very traditional workflow. No matter if you do fire-and-forget or request-reply. You put data at rest for further processing. JMS supports request-reply out-of-the-box. The API is very simple.

Data in motion with event streaming continuously processes data. The Kafka log is durable. The Kafka application maintains and queries the state in real-time or in batch. Data streaming is a paradigm shift for most developers and architects. The design patterns are very different. Don’t try to reimplement your JMS application within Kafka using the same pattern and API. That is likely to fail! That is an anti-pattern.

Request-reply is inefficient and can suffer a lot of latency depending on the use case. HTTP or better gRPC is suitable for some use cases. Request-reply is replaced by the CQRS (Command and Query Responsibility Segregation) pattern with Kafka for streaming data. CQRS is not possible with JMS API, since JMS provides no state capabilities and lacks event sourcing capability.

A Kafka example for the request-response pattern

CQRS is the better design pattern for many Kafka use cases. Nevertheless, the request-reply pattern can be implemented with Kafka, too. But differently. Trying to do it like in a JMS message broker (with temporary queues etc.) will ultimately kill the Kafka cluster (because it works differently).

The Spring project shows how you can do better. The Kafka Spring Boot Kafka Template libraries have a great example of the request-reply pattern built with Kafka.

Check out “org.springframework.kafka.requestreply.ReplyingKafkaTemplate“. It creates request/reply applications using the Kafka API easily. The example is interesting since it implements the asynchronous request/reply, which is more complicated to write if you are using, for example, JMS API). Another nice DZone article talks about synchronous request/reply using Spring Kafka templates.

The Spring documentation for Kafka Templates has a lot of details about the Request/Reply pattern for Kafka. So if you are using Spring, the request/reply pattern is pretty simple to implement with Kafka. If you are not using Spring, you can learn how to do request-reply with Kafka in your framework.

6. Storage for durability vs. true decoupling

TL;DR: JMS message brokers use a storage system to provide high availability. The storage system of Kafka is much more advanced to enable long-term storage, back-pressure handling and replayability of historical events.

Kafka storage is more than just the persistence feature you know from JMS

When I explain the Kafka storage system to experienced JMS developers, I almost always get the same response: “Our JMS message broker XYZ also has storage under the hood. I don’t see the benefit of using Kafka!”

JMS uses an ephemeral storage system, where messages are only persisted until they are processed. Long-term storage and replayability of messages are not a concept JMS was designed for.

The core Kafka principles of append-only logs, offsets, guaranteed ordering, retention time, compacted topics, and so on provide many additional benefits beyond the durability guarantees of a JMS. Backpressure handling, true decoupling between consumers, the replayability of historical events, and more are huge differentiators between JMS and Kafka.

Check the Kafka docs for a deep dive into the Kafka storage system. I don’t want to touch on how Tiered Storage for Kafka is changing the game even more by providing even better scalability and cost-efficient long-term storage within the Kafka log.

7. Server-side data-processing with JMS vs. decoupled continuous stream processing with Kafka

TL;DR: JMS message brokers provide simple server-side event processing, like filtering or routing based on the message content. Kafka brokers are dumb. Its data processing is executed in decoupled applications/microservices.

Server-side JMS filtering and routing

Most JMS message brokers provide some features for server-side event processing. These features are handy for some workloads!

Just be careful that server-side processing usually comes with a cost. For instance:

JMS Pre-filtering scalability issues: The broker has to handle so many things. This can kill the broker in a hidden fashion
JMS Selectors (= routing) performance issues: It kills 40-50% of performance

Again, sometimes, the drawbacks are acceptable. Then this is a great functionality.

Kafka – Dumb pipes and smart endpoints

Kafka intentionally does not provide server-side processing. The brokers are dumb. The processing happens at the smart endpoints. This is a very well-known design pattern: Dumb pipes and smart endpoints.

The drawback is that you need separate applications/microservices/data products to implement the logic. This is not a big issue in serverless environments (like using a ksqlDB process running in Confluent Cloud for data processing). It gets more complex in self-managed environments.

However, the massive benefit of this architecture is the true decoupling between applications/technologies/programming languages, separation of concerns between business units for building business logic and operations of infrastructure, and the much better scalability and elasticity.

Would I like to see a few server-side processing capabilities in Kafka, too? Yes, absolutely. Especially for small workloads, the performance and scalability impact should be acceptable! Though, the risk is that people misuse the features then. The future will show if Kafka will get there or not.

8. Complex operations vs. serverless cloud

TL;DR: Self-managed operations of scalable JMS message brokers or Kafka clusters are complex. Serverless offerings (should) take over the operations burden.

Operating a cluster is complex – no matter if JMS or Kafka

A basic JMS message broker is relatively easy to operate (including active/passive setups). However, this limits scalability and availability. The JMS API was designed to talk to a single broker or active/passive for high availability. This concept covers the application domain.

More than that (= clustering) is very complex with JMS message brokers. More advanced message broker clusters from commercial vendors are more powerful but much harder to operate.

Kafka is a powerful, distributed system. Therefore, operating a Kafka cluster is not easy by nature. Cloud-native tools like an operator for Kubernetes take over some burdens like rolling upgrades or handling fail-over.

Both JMS message brokers and Kafka clusters are the more challenging, the more scale and reliability your SLAs demand. The JMS API is not specified for a central data hub (using a cluster). Kafka is intentionally built for the strategic enterprise architecture, not just for a single business application.

Fully managed serverless cloud for the rescue

As the JMS API was designed to talk to a single broker, it is hard to build a serverless cloud offering that provides scalability. Hence, in JMS cloud services, the consumer has to set up the routing and role-based access control to the specific brokers. Such a cloud offering is not serverless but cloud-washing! But there is no other option as the JMS API is not like Kafka with one big distributed cluster.

In Kafka, the situation is different. As Kafka is a scalable distributed system, cloud providers can build cloud-native serverless offerings. Building such a fully managed infrastructure is still super hard. Hence, evaluate the product, not just the marketing slogans!

Every Kafka cloud service is marketed as “fully managed” or “serverless” but most are NOT. Instead, most vendors just provision the infrastructure and let you operate the cluster and take over the support risk. On the other side, some fully managed Kafka offerings are super limited in functionality (like allowing a very limited number of partitions).

Some cloud vendors even exclude Kafka support from their Kafka cloud offerings. Insane, but true. Check the terms and conditions as part of your evaluation.

9. Java/JVM vs. any programming language

TL;DR: JMS focuses on the Java ecosystem for JVM programming languages. Kafka is independent of programming languages.

As the name JMS (=Java Message Service) says: JMS was written only for Java officially. Some broker vendors support their own APIs and clients. These are proprietary to that vendor. Almost all severe JMS projects I have seen in the past use Java code.

Apache Kafka also only provides a Java client. But vendors and the community provide other language bindings for almost every programming language, plus a REST API for HTTP communication for producing/consuming events to/from Kafka. For instance, check out the blog post “12 Programming Languages Walk into a Kafka Cluster” to see code examples in Java, Python, Go, .NET, Ruby, node.js, Groovy, etc.

The true decoupling of the Kafka backend enables very different client applications to speak with each other, no matter what programming languages one uses. This flexibility allows for building a proper domain-driven design (DDD) with a microservices architecture leveraging Kafka as the central nervous system.

10. Single JMS deployment vs. multi-region (including hybrid and multi-cloud) Kafka replication

TL;DR: The JMS API is a client specification for communication between the application and the broker. Kafka is a distributed system that enables various architectures for hybrid and multi-cloud use cases.

JMS is a client specification, while multi-data center replication is a broker function. I won’t go deep here and put it simply: JMS message brokers are not built for replication scenarios across regions, continents, or hybrid/multi-cloud environments.

Multi-cluster and cross-data center deployments of Apache Kafka have become the norm rather than an exception. Various scenarios require multi-cluster Kafka solutions. Specific requirements and trade-offs need to be looked at.

Kafka technologies like MirrorMaker (open source) or Confluent Cluster Linking (commercial) enable use cases such as disaster recovery, aggregation for analytics, cloud migration, mission-critical stretched deployments and global Kafka deployments.

I covered hybrid cloud architectures in various other blog posts. “Low Latency Data Streaming with Apache Kafka and Cloud-Native 5G Infrastructure” is a great example.

Slide deck and video recording

I created a slide deck and video recording if you prefer learning or sharing that kind of material instead of a blog post:

Fullscreen Mode

JMS and Kafka solve distinct problems!

The ten comparison criteria show that JMS and Kafka are very different things. While both overlap (e.g., messaging, real-time, mission-critical), they use different technical capabilities, features, and architectures to support additional use cases.

In short, use a JMS broker for simple and low-volume messaging from A to B. Kafka is usually a real-time data hub between many data sources and data sinks. Many people call it the central real-time nervous system of the enterprise architecture.

The data integration and data processing capabilities of Kafka at any scale with true decoupling and event replayability are the major differences from JMS-based MQ systems.

However, especially in the serverless cloud, don’t fear Kafka being too powerful (and complex). Serverless Kafka projects often start very cheaply at a very low volume, with no operations burden. Then it can scale with your growing business without the need to re-architect the application.

Understand the technical differences between a JMS-based message broker and data streaming powered by Apache Kafka. Evaluate both options to find the right tool for the problem. Within messaging or data streaming, do further detailed evaluations. Every message broker is different even though they all are JMS compliant. In the same way, all Kafka products and cloud services are different regarding features, support, and cost.

Do you use JMS-compliant message brokers? What are the use cases and limitations? When did you or do you plan to use Apache Kafka instead? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Comparison: JMS Message Queue vs. Apache Kafka appeared first on Kai Waehner.

Is Apache Kafka an iPaaS or is Event Streaming its own Software Category?

Kai Waehner — Wed, 03 Nov 2021 06:57:54 +0000

Enterprise integration is more challenging than ever before. The IT evolution requires the integration of more and more technologies. Applications are deployed across the edge, hybrid, and multi-cloud architectures. Traditional middleware such as MQ, ETL, ESB does not scale well enough or only processes data in batch instead of real-time. This post explores why Apache Kafka is the new black for integration projects, how Kafka fits into the discussion around cloud-native iPaaS solutions, and why event streaming is a new software category. A concrete real-world example shows the difference between event streaming and traditional integration platforms respectively iPaaS.

What is iPaaS (Enterprise Integration Platform as a Service)?

iPaaS (Enterprise Integration Platform as a Service) is a term coined by Gartner. Here is the official Gartner definition: “Integration Platform as a Service (iPaaS) is a suite of cloud services enabling development, execution, and governance of integration flows connecting any combination of on-premises and cloud-based processes, services, applications, and data within individual or across multiple organizations.” The acronym eiPaaS (Enterprise Integration Platform as a Service)” is used in some reports as a replacement for iPaaS.

The Gartner Magic Quadrant for iPaaS shows various vendors:

Three points stand out for me:

Many very different vendors provide a broad spectrum of integration solutions.
Many vendors (have to) list various products to provide an iPaaS; this means different technologies, codebases, support teams, etc.
No Kafka offering (like Confluent, Cloudera, Amazon MSK) is in the magic quadrant.

The last bullet point makes me wonder if Kafka-based solutions should be considered iPaaS or not?

Is Apache Kafka an iPaaS?

I don’t know. It depends on the definition of the term “iPaaS”. Yes, Kafka solutions fit into the iPaaS, but it is just a piece of the event streaming success story.

Kafka is an event streaming platform. Use cases differ from traditional middleware like MQ, ETL, ESB, or iPaaS. Check out real-world Kafka deployments across industries if you don’t know the use cases yet.

Kafka does not directly compete with ETL tools like Talend or Informatica, MQ frameworks like IBM MQ or RabbitMQ, API Management platforms like MuleSoft, and cloud-based iPaaS like Boomi or TIBCO. At least not if people understand the differences between Kafka and traditional middleware. For that reason, many people (including me) think that Event Streaming should be its Magic Quadrant.

Having said this, all these very different vendors are in the iPaaS Magic Quadrant. So, should Kafka respectively its vendors be in here? I think so because I have seen hundreds of customers leverage the Kafka ecosystem as a cloud-native, scalable, event-driven integration platform, often in hybrid and multi-cloud architectures. And that’s an iPaaS.

What’s different with Kafka as an Integration Platform?

If you are new to this discussion, check out the article “Apache Kafka vs. MQ, ETL, ESB” or the related slides and video. Here is my summary on why Kafka is unique for integration scenarios and therefore adopted everywhere:

A unique selling point for event streaming is the ability to leverage a single platform. In contrast, other iPaaS solutions require different products (including codebases, support teams, integration between the additional tech, etc.).

Kafka as Cloud-native and Serverless iPaaS

Fundamental differences exist between modern iPaaS solutions and traditional legacy middleware; this includes the software architecture, the scalability and operations of the platform, and data processing capabilities. On a high level, an “Kafka iPaaS” requires the following characteristics:

Cloud-native Infrastructure: Elasticity is vital for success in the modern IT world. The ability to scale up and down (= expanding and shrinking Kafka clusters) is mandatory. This capability enables starting with a small pilot project and scaling up or handling load spikes (like Christmas business in retail).
Automated Operations: Truly serverless SaaS should always be the preferred option if the software runs in a public cloud. Orchestration tools like Kubernetes and related tools (like a Kafka operator for Kubernetes) are the next best option in a self-managed data center or at the edge outside a data center.
Complete Platform: An integration platform requires real-time messaging and storage for backpressure handling and long-running processes. Data integration and continuous data processing are mandatory, too. Hence, an “Kafka iPaaS” is only possible if you have access to various pre-built Kafka-native connectors to open standards, legacy systems, and modern SaaS interfaces. Otherwise, Kafka needs to be combined with another middleware like an iPaaS or an ETL tool like Apache Nifi.
Single Solution: It sounds trivial, but most other middleware solutions use several codebases and products under the hood. Just look at stacks from traditional players such as IBM and Oracle or open-source-driven Cloudera. The complex software stack makes it much harder to provide end-to-end integration, 24/7 operations, and mission-critical support. Don’t get me wrong: Kafka-native solutions like Confluent Cloud also include different products with additional pricing (like a fully-managed connector or data governance add-ons), but all run on a single Kafka-native platform.

From this perspective, some Kafka solutions are modern, cloud-native, scalable, iPaaS. Having said this, would I be happy if you consider some Kafka solutions as an iPaaS on your technology radar? No, not really!

Event Streaming is its Software Category!

While some Kafka solutions can be used as iPaaS, this is only one of many usage scenarios for event streaming. However, as explained above, Kafka-based solutions differ greatly from other iPaaS solutions in the Gartner Magic Quadrant. Hence, event streaming deserves its software category.

If you still wonder what I mean, check out event streaming use cases across industries to understand the difference between Kafka and traditional iPaaS, MQ, ETL, ESB, API tools. Here is a relatively old but still fantastic diagram that summarizes the broad spectrum of use cases for event streaming:

TL;DR: Kafka provides capabilities for various use cases, not just integration scenarios. Many new innovative business applications are built with Kafka. It is not just an integration platform but a unique suite of capabilities for end-to-end data processing in real-time at scale.

New concepts like Data Mesh also prove the evolution. The basic principles are not unique: Domain-driven design, microservices, true decoupling of services, but now with much more focus on data as a product. The latter means it is turning from a cost center into a profit center and innovative new services. Event streaming is a critical component of a data mesh-powered enterprise architecture as real-time almost always beats slow data across use cases.

The Non-Existing Event Streaming Gartner Magic Quadrant or Forrester Wave

Unfortunately, a Gartner Magic Quadrant or Forrester Wave for Event Streaming does NOT exist today. While some event streaming solutions fit into some of these reports (like the Gartner iPaaS Magic Quadrant or the Forrester Wave for Streaming Analytics), it is still an apple to orange comparison.

Event Streaming is its category. Many software vendors built their entire business around this category. Confluent is the leader in this space – note that I am biased as a Confluent employee, but I guess there is no question around this statement Many other companies emerge around Kafka, or in a related way using the Kafka protocol, or competitive event streaming offerings such as Amazon Kinesis or Apache Pulsar.

The following Event Streaming Landscape 2021 summarizes the current status:

I hope to see a Gartner Magic Quadrant for Event Streaming and a Forrester Wave for Event Streaming soon, too.

Open Source vs. Partially Managed vs. Fully-Managed Event Streaming

One more aspect to point out: You might have noticed that I said, “some event streaming solutions can be considered an iPaaS”. The word “some” is a crucial detail. Just providing an open-source framework is not enough.

iPaaS requires a complete offering, ideally as fully-managed services. Many vendors for event streaming use Kafka, Pulsar, or another framework but do NOT provide a complete offering with operations tooling, commercial 24/7 support, user interfaces, etc. The following resources should help you learn more about the event streaming landscape in 2021:

TL;DR: Evaluate the various offerings. A lot of capabilities are just marketing! Many “fully-managed services” are only partially managed instead of serverless and with very limited SLAs and support. Some other offerings provide plenty of cool features but are more an alpha version and overselling than a mature battle-service solution. A counterexample is the complexity in T-Mobile’s report about upgrading Amazon MSK. This shows the difference between “promoting and selling a fully managed service” and the “not at all fully-managed reality”. A truly fully-managed offering does NOT require the end user to upgrade the infrastructure.

Kafka as Event Streaming iPaaS at Deutsche Bahn (German Railway)

Let’s now look at a practicable example to understand why a traditional iPaaS cannot help in use cases that require event streaming and why this combination of capabilities in a single technology sets a new software category.

This section explores a real-world use case with the journey of Deutsche Bahn (German Railway) providing a great customer experience to their customers. This example uses Event Streaming as iPaaS (regarding my definition of these terms).

Use Case: Improve Customer Experience with Real-time Notifications

The use case sounds very simple: Improve the customer experience by providing real-time information and notifications to customers across various channels like websites, smartphones, and displays at train stations.

Delays and cancellations happen in a complex rail network like in Germany. Frequent travelers accept this downside. Nobody can do anything against lousy weather, self-murder using a traveling train, and technical defects.

However, customers at least expect real-time information and notifications so that they can wait in a coffee shop or lounge instead of freezing at the station platform for minutes or even hours. The reality at Deutsche Bahn was a different sequence of notifications: 10min delay, 20min delay, 30min delay, train canceled – take the next train.

The goal of the project Reisendeninformation (= traveler information system) was to improve the day-to-day experience of millions of travelers across Germany by delivering up-to-date, accurate, and consistent travel information in any location.

Initial Project: A mess of different Integration Technologies

Again, the use case sounds simple. Let’s send real-time notifications to customers if a train is delayed or canceled. Every magic black iPaaS box can do this:

Is this true or just the marketing of all the integration vendors? Can each black box integrate all the different systems to correlate events in real-time?

Deutsche Bahn started with an open-source messaging platform to send real-time notifications. Unfortunately, the team quickly found out that not every piece of information was coming in in real-time. So, a caching and storage system was added to the project to handle the late-arriving information from some legacy batch or file-based systems. Now, the data from the legacy systems needed to be integrated. Hence, an integration framework was installed to connect to files, databases, and other applications. Now, the data needed to be processed, correlating real-time and non-real-time data from various systems. A stream processing engine can do this.

The pilot project included several different frameworks. A few skeptical questions came up:

How to scale this combination of very different technologies?
How to get end-to-end support across various platforms?
Is this cost-efficient?
What is the time-to-market for new features?

Deutsche Bahn re-evaluated their tech stack and found out that Apache Kafka provides all the required capabilities out-of-the-box within one platform.

The Migration to Cloud-native Kafka

The team at Deutsche Bahn re-architected their pilot project. Here is the new solution leveraging Kafka as the single point of truth between various systems, technologies, and communication paradigms:

A traditional iPaaS can implement this scenario. But with several codebases, technologies, and clusters, even if you select one software vendor! Some iPaaS might even do well in the beginning but struggle to scale up. Only event streaming allows to start small but scales up with no need to re-architect the infrastructure.

Today, the project is in production. Check your DB Navigator mobile app to get real-time updates about all trains in Germany. Living in Germany, I appreciate this new service to have a much better traveling experience.

Learn more about the project from the Deutsche Bahn team. They gave several public talks at different conferences and wrote on the Confluent Blog about their Kafka journey. Though, the journey did not end here As described in their blog post, Deutsche Bahn is now evaluating the migration from a self-managed Confluent Platform deployment in the public cloud to the fully-managed, truly serverless Confluent Cloud offering to reduce TCO and improve time-to-market.

Complete Platform: Kafka is more than just Messaging

A project like the one described above is only possible with a complete platform. Many people still think about Kafka as an ingestion layer into a data lake or data warehouse, as this was one of the first prominent Kafka use cases. Data ingestion is still an excellent use case today. Many projects already use more than just the core of Kafka to implement this. Kafka Connect provides out-of-the-box connectivity between Kafka and the data store. If you are in the public cloud, you even get integrated in a fully-managed, serverless manner whether you need to integrate with a 1st party cloud service like Amazon S3, Google Cloud BigQuery, Azure Cosmos DB, or other 3rd SaaS like MongoDB Atlas, Snowflake, or Databricks.

Continous Kafka-native stream processing is the next level of a complete platform. For instance, Deutsche Bahn leverages Kafka Streams a lot for their data correlations in real-time at scale. Other companies use ksqlDB as a fully-managed function in Confluent Cloud. The enormous benefit is that you don’t need yet another platform or service for streaming analytics. A complete platform makes the architecture more cost-efficient, and end-to-end integration is easier from SLA and support perspective.

A complete platform requires many additional services “on top”, like visualization of data flows, data governance, role-based access control, audit logs, self-service capabilities, user interfaces, visual coding/code generation tools, etc. Visual coding is the point where traditional middleware and iPaaS tools are stronger today than event streaming offerings.

3rd Party integration via Open API and non-Kafka Tools

So far, you learned why event streaming is its software category and how Deutsche Bahn is an excellent example to show this. However, event streaming is NOT the silver bullet for every problem! When exploring if Kafka and MQ/ETL/ESB are friends, enemies, or frenemies, I already pointed this out. For instance, MQ or an ESB can complement event streaming in an integration project, depending on your project requirements.

Let’s go back to Deutsche Bahn. As mentioned, their real-time traveler information platform is live, with Kafka as the single point of truth. Recently, Deutsche Bahn announced a partnership with Google and 3rd Party Integration with Google Maps:

Real-time Schedule Updates to 3rd Party Google Maps API

The integration provides real-time train schedule updates to Google Maps users:

The integration allows to reach new people and expand the business. Users can buy train tickets via one click from the Google Maps page.

I don’t know what technology or product this 3rd party integration uses. The heart of Deutsche Bahn’s real-time infrastructure enables new innovative business models and collaboration with partners.

Likely, this integration between Deutsche Bahn and Google Maps does not directly use the Kafka protocol (even though this is done sometimes, for instance, see Here Technologies Open API for their mapping service).

Event Streaming is complementary to other services. In this example, the project team might have used an API Management platform to provide internal APIs to external consumers, including access control, billing, and reporting. The article “Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies?” explores the relationship between event streaming and API Management.

Event Streaming Everywhere – Cloud, Data Center, Edge

Real-time beats slow data everywhere. This is a new software category because we don’t just send events into another database via a messaging system. Instead, we use and correlate data from different data source in real-time. That’ the real added value and game changer in innovative projects.

Hence, event streaming must be possible in every location. While cloud-first is a viable strategy for many IT projects, edge and hybrid scenarios are and will continue to be very relevant.

Think about a project related to the Deutsche Bahn example above (but being completely fictive): A hybrid architecture with real-time applications the cloud and edge computing within the trains:

I covered this in other articles, including “Edge Use Cases for Apache Kafka Across Industries“. TL;DR: Leverage the open architecture of event streaming for real-time data processing everywhere, including multi-cloud, data centers, and edge deployments (i.e., outside a data center). The enterprise architecture does not need various technologies and products to implement real-time data processing and integration with separate iPaaS, ETL tools, ESBs, MQ systems.

However, once again, it is crucial to understand how event streaming fits into the enterprise architecture. For instance, Kafka is often combined with IoT technologies such as MQTT for the last mile integration with IoT devices in these edge scenarios.

Slide Deck and Video for “Apache Kafka vs. Cloud-native iPaaS Middleware”

Here are the related slides for this topic:

And the video recording of this presentation:

Kafka is a cloud-native iPaaS, and much more!

Kafka is the new black for integration projects across industries because of its unique combination of capabilities. Some Kafka solutions are part of the iPaaS category, with trade-offs like any other integration platform.

However, event streaming is its software category. Hence, iPaaS is just one usage of Kafka or other similar event streaming platforms. Real-time data beats slow data. For that reason, event streaming is the backbone for many projects to process data in motion (but also integrate with other systems that store data at rest for reporting, model training, and other use cases).

How do you leverage event streaming and Kafka as an integration platform? What projects did you already work on or are in the planning? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Is Apache Kafka an iPaaS or is Event Streaming its own Software Category? appeared first on Kai Waehner.

Kappa Architecture is Mainstream Replacing Lambda

Kai Waehner — Thu, 23 Sep 2021 06:39:43 +0000

Real-time data beats slow data. That’s true for almost every use case. Nevertheless, enterprise architects build new infrastructures with the Lambda architecture that includes separate batch and real-time layers. This blog post explores why a single real-time pipeline, called Kappa architecture, is the better fit. Real-world examples from companies such as Disney, Shopify, Uber, and Twitter explore the benefits of Kappa but also show how batch processing fits into this discussion positively without the need for Lambda.

This post is heavily inspired by Jay Kreps’ article “Questioning the Lambda Architecture” from 2014 (!) and maps his thoughts to the real-world situation in 2021. Today, almost every business solution, data storage and analytics provider, and business application leverages event streaming and asynchronous, truly decoupled event-based communication paradigms for data processing. For that reason, many move from Lambda to Kappa architectures.

A Modern Enterprise Architecture

A modern enterprise architecture offers cloud-native characteristics: Flexibility, elasticity, automation, true decoupling between different applications, and real-time capabilities (where needed).

Microservices, Data Mesh, and Domain-driven Design for True Decoupling

Let’s quickly explore the buzzwords to understand how most people build modern enterprise architectures today:

Domain-driven Design (DDD) enforces strict boundaries between service communication and a decentralized application landscape.
Microservices enable building flexible, decoupled applications with different programming languages and communication paradigms.
Data Mesh allows to architect services around data. Data is the product in a data mesh. Self-service capabilities and federation enable business units to focus on their business problem.

My blog post “Microservices, Apache Kafka, and Domain-Driven Design” explored this discussion in more detail (even though the buzzword “data mesh” did not exist at the time of writing). TL;DR: An event-driven streaming infrastructure such as Apache Kafka uniquely enables proper decoupling and real-time data processing (contrary to traditional web service / REST / HTTP-based microservice architectures and contrary to traditional messaging systems (MQ, ESB). The blog post about Kafka vs. MQ/ETL/ESB might also be helpful to learn more.

Real-time Data Beats Slow Data, but NOT Always!

Think about your industry, business units, problems you solve, and innovative new applications you build. Real-time data beats slow data. This statement is almost always true. Either to increase revenue, reduce cost, reduce risk, or improve the customer experience.

Data at Rest means to store data in a database, data warehouse, or data lake. This way, data is processed too late in many use cases – even if a real-time streaming component (like Kafka) ingests the data. The data processing is still a web service call, SQL query, or map-reduce batch process away from providing a result to your problem.

Don’t get me wrong. Data at rest is not a bad thing. Several use cases such as reporting (business intelligence), analytics (batch processing), and model training (machine learning) work very well with this approach. But real-time beats batch in almost all other use cases.

I analyzed the relation between data at rest and data in motion and how this point of view regarding the enterprise architecture changed with the cloud-first strategy of most companies in the blog post “Serverless Kafka in a Cloud-native Data Lake Architecture“.

The de facto standard for real-time data processing is Apache Kafka. Hence, the covered real-world examples in this post use Kafka.

With this context in mind, let’s revisit Lambda architecture.

The Lambda Architecture

Nathan Marz coined the Lambda architecture: A data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream-processing methods.

Lambda architecture includes batch, speed, and serving layers. This approach enables processing data in real-time but also easy re-processing of batched static datasets. The problem with out-of-order data is also solved.

This approach attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data while simultaneously using real-time stream processing to provide views of online data. The rise of lambda architecture is correlated with the steady growth of big data, real-time analytics, and the drive to mitigate the latencies of map-reduce.

Two Options for a Lamba Architecture

The web discusses two different approaches to Lambda architecture.

The initial approach provided a unified serving layer. A unified serving layer joins the real-time and batch layer:

Another alternative is two separate serving layers. One layer is for real-time consumption, the other one for batch consumption:

I see the second option much more in the field. In the end, both have the same concept of building two separate layers for data ingestion and processing.

Issues with the Lambda Architecture

The Hadoop vendors heavily pitched the Lambda architecture to deploy and operate a super complex infrastructure with many big data frameworks. Today, I only hear the pain of enterprises complaining about this complexity and the missing business value. No surprise that most of these vendors did not survive or have a very confusing and unclear future product strategy.

Disney has summarized the concerns with the Lambda architecture on one slide:

The batch and streaming sides each require a different codebase that must be maintained and kept in sync so that processed data produces the same result from both paths. Additionally, with batch, speed, and serving layers, everything needs to be processed (at least) twice. That increases the cost and operations efforts of storage, network, and compute.

Jay Kreps had similar arguments when he proposed the Kappa architecture in 2014 (!), already: “The problem with the Lambda Architecture is that maintaining code that needs to produce the same result in two complex distributed systems is exactly as painful as it seems like it would be”.

So, what’s different in Kappa architecture?

The Kappa Architecture

The Kappa architecture is a software architecture that is event-based and able to handle all data at all scale in real-time for transactional AND analytical workloads.

The central premise behind the Kappa architecture is that you can perform both real-time and batch processing with a single technology stack. The heart of the infrastructure is streaming architecture. First, the event streaming platform log stores incoming data. From there, a stream processing engine processes the data continuously in real-time or ingests the data into any other analytics database or business application via any communication paradigm and speed, including real-time, near real-time, batch, request-response.

Unlike the Lambda Architecture, in this approach, you only do re-processing when your processing code changes, and you need to recompute your results. And, of course, the job doing the re-computation is just an improved version of the same code, running on the same framework, taking the same input data.

Benefits of the Kappa architecture

The Kappa architecture has several benefits:

Handle all the use cases (streaming, batch, RPC) with a single architecture
One codebase that is always in synch
One set of infrastructure and technology
The heart of the infrastructure is real-time, scalable, and reliable
Improved data quality with guaranteed ordering and no mismatches
No need to re-architect for new use cases

TL;DR: The Kappa architecture leverages a single source of truth focusing on simplicity in the enterprise architecture. People can develop, test, debug, and operate their systems on a single processing framework for BOTH real-time and batch systems. To be clear: The leading system for some applications can still be another system. For instance, the leading system for ERP is still SAP, while the source of truth for consumers is the Kafka log.

Kappa for Transactional and Analytical Workloads

Contrary to a data lake, event-streaming-powered Kappa architectures enable transactional workloads in addition to analytical workloads too.

For instance, Kafka and its ecosystem support exactly-once semantics so that you can build your next payment platform for aftersales or customer interactions with mission-critical SLAs, low latency, and fault-tolerance built-in. Independently, the data science team consumes historical events for finding insights in a batch process using machine learning.

Kappa is NOT a free lunch!

The Kappa architecture sounds too good to be true? Well, a basic rule of thumb is still valid: Use the right tool for the job!

Event streaming is a paradigm shift. A big bang migration will not work. Here are a few lessons learned from Disney about introducing the Kappa architecture:

As a big bang does not work, a good way is to rethink data and databases. Martin Kleppmann called it “turning the database inside out“. Let’s look at this approach and how it helps to leverage the Kappa architecture in combination with other databases and analytics platforms.

The Inside and Outside Perspective to Solve the Kappa Challenges

Turning the database inside out is a new thinking of the enterprise architecture. The heart of the infrastructure is event-based and real-time. Where needed, you consume the events in batch or store them in additional storage and analytics tools with their concepts and paradigms after they consumed the events.

The inner perspective of Kappa: The central nervous system

Think of an event streaming platform like Kafka:

Data availability/retention: Compacted Topics, Tiered Storage
Data consistency and fault-tolerance: Exactly-once semantics, Multi-Region Clusters, Cluster Linking
Handling late-arriving data: Event time and processing time are different. State management in the streaming application, proper data sinks, replay with guaranteed ordering, and timestamps.
Data reprocessing and backfill: Dynamic clusters (ideally a serverless cloud offering or at least a cloud-native self-managed cluster), stateful applications (Kafka Streams, ksqlDB, external stream processing framework like Apache Flink).
Data integration: Kafka Connect for sources and sinks, clients for any language, REST Proxy (real-time but also batch and RPC

An event streaming platform provides many characteristics to built a Kappa architecture. However, it is not a silver bullet. Additional databases and analytics tools are mandatory for some use cases. For instance, Kafka does not scale well for dynamic bursty workloads. Complex SQL queries and joins also need another database.

The outer perspective of Kappa: The applications and data stores

Think of any business application, data storage, or analytics platform:

Data Consumption: Consume the data from the central nervous system. Consume the data at your speed (real-time, near real-time, batch, RPC).
Data Storage: Store the data in your storage as long as you need it (in-memory, short-term storage, long-term storage).
Data Processing: Process the data for your use case (real-time notification, indexing into your query engine, a batch process for reporting or model training, etc.). Complex processing is not doable in the event streaming platform (e.g., complex joins, intensive compute with batch algorithms).

The discussion “Can Apache Kafka be used as a database?” is also helpful to understand both perspective and the trade-offs of using Kafka as data storage.

Cost-Efficient and Scalable Kappa Architectures

A huge problem of realizing the Kappa architecture in the real world was storing vast volumes of data in an event streaming platform. This approach was costly and had scalability issues at the Terabyte or Petabyte scale. On the contrary, data lakes were designed for vast volumes from the beginning. Hadoop and HDFS were used on-premise in the early phases. The public cloud enabled the migration to fully-managed object storage such as AWS S3 or Google Cloud Storage to make data lakes even more scalable and cost-efficient for big data.

One approach is to reduce the data stored in the event streaming platform. Infinite retention leveraging log compaction is a viable approach to reduce the storage size. However, compacted topics shrink data sets and only store the latest value for each message key. Hence, this workaround is not applicable for every use case.

Another workaround I have seen a lot in practice is building a “streaming data lake” with Kafka as a streaming layer and object storage for long-term storage. The bi-directional integration was built with Kafka Connect and sink and source connectors. This was actually the main reason why Confluent built an S3 Source connector for Kafka Connect in addition to its heavily used S3 Sink connector.

Tiered Storage for Event Streaming

The good news is that streaming platforms evolved. Tiered Storage allows decoupling storage from computing in event streaming platforms such as Kafka or Pulsar.

Tiered Storage is a game-changer for Kappa architectures. It manages the storage without a performance impact on real-time consumers. Additionally, this enables a very cost-efficient and elastic Kappa architecture without the need for a traditional data lake. Uber talks about the motivation and benefits of Tiered Storage for Kafka (KIP-405) in a recent Kafka Summit talk.

Kappa architectures are very flexible regarding the underlying storage technology. While Uber uses Hadoop’s HDFS as storage, Confluent went another way: Confluent Tiered Storage for Kafka is based on the S3 interface to leverage object storage and works for both public cloud provider object stores such as AWS S3 or GCS, and on-premise object stores such as PureStorage or MinIO for Kubernetes.

In other words: Tiered Storage for Kafka can leverage the same modern data storage as modern cloud data lakes (or as AWS calls it today: Lake House). Hence, the Kappa architecture provides the best of both worlds: Real-time data processing plus cost-efficient and scalable long-term storage for replaying historical data.

Real-World Examples for a Kappa Architecture

The above was a lot of theory. Let’s recap: Real-time data beats slow data in most use cases. But batch processing is still needed and will not go away.

Let’s now look at a few real-world examples of Kappa architectures at Uber, Shopify, and Disney.

Kappa at Uber for Trillions of Messages and Petabytes per Day

Uber is a very prominent tech giant. They talk a lot about their software architectures and deployments regularly in public. Uber is one of the most significant Kafka users in the world. In the meantime, they process over 4 trillion msgs and 3PBs per day.

As a perfect fit for this blog post, Uber presented at a recent Kafka Summit about their Kappa architecture:

As you can see, Uber’s architecture evolved precisely to what I described in the above sections. The central nervous system is a Kafka-based real-time infrastructure. Uber still has batch pipelines. Uber also provides APIs (e.g., to mobile apps). And – no surprise – they also have traditional SQL and NoSQL databases, business intelligence reporting tools, dashboards, and much more.

Uber’s architecture shows the massive benefits of Kappa: The heart of the infrastructure is real-time, scalable, fault-tolerant, and reliable. A single pipeline for everything. No need for a Lambda architecture! Kappa enables transactional and analytical workloads. Each microservice in the data mesh can use its technology and communication paradigm for each application.

Kappa at Shopify for Stateless and Stateful Data Streaming

Shopify presented their Kappa architecture in a recent Kafka Summit talk: “It’s Time To Stop Using Lambda Architecture“ The session covered the concerns of Kappa architecture and how Shopify solved them with different building blocks. The three key components are the log (Kafka), streaming framework (Kafka Streams and Apache Flink), and data sinks (any real-time consumer or data store).

Here is one example of a stateful Kappa scenario at Shopify:

Shopify discussed the core building blocks of their Kappa architecture:

The Log (Kafka)

Durability with Topic Compaction and Tiered Storage
Consistency via Exactly-Once Semantics (EOS)
Data Integration via Kafka Connect
Elasticity via dynamic Kafka clusters

Streaming Framework (Kafka Streams / Flink)

Reliability and scalability
Fault tolerance
State management

Data Sinks

Real-time consumers
Update/upsert for simplified design, for instance, RDBMS, NoSQL, Compacted Kafka Topics
Append-only storage (i.e., no update), for instance, regular Kafka Topics, Time Series databases

Kappa at Disney as Single Source of Truth

Disney’s Kafka Summit talk “Big Data Kappa” is very inspiring. It probably includes the most lessons learned and trade-offs of a real-world Kappa deployment. I encourage you to watch the on-demand video—many insights and guidance for building your own Kappa Architecture.

All data writes at Disney go through Kafka as the source of truth. The following screenshot shows the concept. The green box is the Kafka cluster, including Tiered Storage as the single source of truth. Any application consumes the data from Kafka for further processing and optional external storage.

Disney solves the following problems with its Kafka-based Kappa architecture:

Keep it simple (Kiss)
Reduce Code Duplication
Decreasing End To End Latency
Full System Immutability
Avoiding Data Discrepancies
Ability to move laterally between storage systems
Everyone wants their answers faster

Kappa at Twitter for Migration from Lambda Architecture

Twitter processes approximately 400 billion events in real-time and generates petabyte (PB) scale data every day. The on-premise architecture with Hadoop and Kafka using the Lambda architecture was not efficient enough:

Therefore, Twitter migrated to the cloud on GCP with Kafka using the Kappa architecture:

With the new hybrid architecture on both Twitter Data Center and Google Cloud Platform, they “are able to process billions of events in real-time and achieve low latency, high accuracy, stability, architecture simplicity, and reduced operation cost for engineers” as Twitter quotes in their detailed blog post about their Lambda to Kappa migration.

Example Project: Kappa for Machine Learning including Model Training, Scoring, and Monitoring

After real-world examples from Uber, Shopify, and Disney, I want to share one more practical code example: A technical demo connecting to 100,000 IoT devices to do streaming machine learning.

The use case is about integrating tens or hundreds of thousands of IoT devices and processing the data in real-time. The demo use case is predictive maintenance (i.e., anomaly detection) in a connected car infrastructure to predict motor engine failures:

The implemented Kappa architecture provides a single real-time infrastructure for various very different use cases and processing paradigms:

Real-time data ingestion at high throughput from IoT devices via an MQTT proxy: Integration with millions of interfaces, in this case, simulated vehicles.
Batch processing for model training: The TensorFlow Python application from the data scientist consumes historical data from the Kafka log to train analytic models.
Real-time stream processing for model scoring: The Java-based streaming application is powered by Kafka Streams / ksqlDB and operated by the production engineer with mission-critical SLAs and low latency.
Near-real time ingestion into the digital twin for analytics: Kafka Connect ingests the data into different databases and applications, in this case, a MongoDB Atlas cloud service.
Synchronous request-response / RPC communication for mobile app integration and transactional workloads: The Confluent REST Proxy (or any other web / mobile proxy) sends real-time alerts to humans.

The whole infrastructure is cloud-native. It runs on Kubernetes and can be deployed in a data center or on any hyperscaler. The following blog post explains the demo in detail: IoT Live Demo – 100.000 Connected Cars with Kubernetes, Kafka, MQTT, TensorFlow. The code is available in the Github project.

Kappa under the Hood of Next Generation Software Products and SaaS Offerings

Software companies have the same challenges as end-users like Uber, Shopify, or Disney. Hence, no surprise that software vendors move to Kappa architectures and real-time capabilities as the heart of their infrastructures.

This section shows a few examples of software vendors that moved to event-based architectures, event streaming, and asynchronous external interfaces within their next-generation software offerings.

Once again: This does NOT mean that everything within these products is real-time or event-based, but only if the related components provide real-time capabilities, then you can provide a real-time interface for internal or external consumers.

Business Solutions (Salesforce, SAP, Slack, et al.)

Business solutions provide customer interactions, logistics, manufacturing, internal communication, and many other use cases. No surprise that real-time data beats slow data. For this reason, most modern business solutions moved from less flexible and less scalable communication paradigms to event-based interfaces. Instead of using files, web service APIs, or manual changes, communication happens via event-driven APIs internally and externally.

A few examples across different business solutions:

Salesforce: The internal “platform events” architecture heavily relies on Apache Kafka for decoupled real-time data processing at scale. External APIs like the integration with Salesforce’s proprietary sObject datatype moved from SOAP and REST web service to Streaming API PushTopics, Enterprise Messaging Platform Events, and Change Data Capture Events.
SAP: Instead of relying on its legacy proprietary interfaces such as BAPI and iDoc, SAP moved to event-based APIs in their next-generation SAP S/4 Hana ERP platform. The blog “SAP integration options for Apache Kafka” shows the mess of numerous legacy interfaces and alternative modern event-based integration options.
Slack: Being a messaging platform by nature, it is no surprise that the heart of their core backend infrastructure leverages event streaming. Slack’s data streaming team focuses on providing Kafka as a Service for the company at the scale of trillions of messages per day across dozens of clusters in Amazon data centers. For the front-end, Slack’s current architecture leverages a service mesh built with Envoy and WebSockets.

Databases, Data Warehouses, Log Analytics

Data storage and analytics vendors are traditionally batch technologies for long-term storage, dashboards, reporting, and interactive queries. The heart of most solutions is still a batch system for analytics workloads. That’s the core business of these products and services.

Nevertheless, almost all of these vendors went into (near) real-time business due to customer demand. Hence, event-based integration capabilities and near real-time ingestion, processing, and analytics are becoming more prevalent. Some examples:

MongoDB: “Change Streams” allow applications to access real-time data changes from the document-based NoSQL datastore.
Snowflake: “Snowpipe” can help organizations seamlessly load continuously generated data into the cloud data warehouse.
Elasticsearch: “Data Streams” lets you store append-only time series data across multiple indices while giving you a single named resource for requests. Data streams are well-suited for logs, events, metrics, and other continuously generated data to ingest data into the Elastic search engine.

These solutions have in common that they move from batch to near real-time ingestion into their data store or data lake. Nevertheless, they still store and analyze data at rest. Hence, this is complementary but not an alternative to event streaming.

New entrants into the market try to differentiate from the above data storage vendors by providing a real-time infrastructure at its core. A great example is Rockset, a scalable real-time analytics platform in the cloud. As it is a native real-time solution, Rockset natively integrates with event streaming platforms such as Apache Kafka.

Event Streaming

Event Streaming platforms are event-based by nature. They process data in motion continuously. Therefore, the central nervous system of a Kappa architecture has to be an event streaming platform. Period.

For a comparison of frameworks like Kafka and Pulsar, plus reviewing the differentiators from platform vendors and SaaS providers such as Confluent, Cloudera, Red Hat, Amazon MSK, Azure Event Hubs, etc., please check out this comparison of event streaming platforms.

Event streaming will be one serverless component in a cloud-native data lake architecture in many future enterprise architectures.

It is worth noting that event streaming and the above-discussed business solutions and data storage and analytics vendors are complementary, not competitive! For instance, Confluent partners with business solutions such as Salesforce, database vendors such as MongoDB and Elastic, data-warehouses such as Snowflake, and cloud providers such as AWS or Azure to provide Source, Sink, and Change Data Capture (CDC) connectors powered by Kafka Connect. The fully managed Confluent Cloud service even provides the end-to-end integration as part of the serverless offering in the public cloud.

Video Recording: Kappa vs. Lambda Architecture

I covered the discussion around “Kappa vs. Lambda” in a 40-minute video recording, too. Enjoy:

Kappa is the New Black for the Enterprise Architecture

Real-time data beats slow data. After reading this article, think about your industry, business unit, and projects again. If real-time data processing improves your customer experiences, increases your revenue, or reduces your cost and risk, then why wait? The Kappa architecture provides enormous benefits and a much simpler infrastructure than the Lambda architecture.

Having said this, batch processing and other data storage and analytics services are not going away. Kappa and event streaming are complementary, and no silver bullet for every problem. For more details, check out the article “Can Apache Kafka replace a database?” – that article emphases this statement and explores the trade-offs.

Event streaming is the foundation of Kappa architecture. There is no way around this. Apache Kafka is the de facto standard for event streaming and the choice in real-world Kappa architectures. If you still need or want to evaluate your own event streaming platform, continue with the Kafka vs. Pulsar comparison or the general comparison of competitive event streaming vendors and cloud solutions.

Did you already Kappa architecture? Or do you still rely on or even prefer Lambda architectures? What are your experiences and opinions? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Kappa Architecture is Mainstream Replacing Lambda appeared first on Kai Waehner.

Kafka API is the De Facto Standard API for Event Streaming like Amazon S3 for Object Storage

Kai Waehner — Sun, 09 May 2021 14:32:49 +0000

Real-time beats slow data in most use cases across industries. The rise of event-driven architectures and data in motion powered by Apache Kafka enables enterprises to build real-time infrastructure and applications. This blog post explores why the Kafka API became the de facto standard API for event streaming like Amazon S3 for object storage, and the tradeoffs of these standards and corresponding frameworks, products, and cloud services.

Event-Driven Architecture: This Time It’s Not A Fad

The Forbes’ article “Event-Driven Architecture: This Time It’s Not A Fad” from April 2021 explained why enterprises are not just talking about event-driven real-time applications, but finally building them. Here are some arguments:

REST limitations can limit your business strategy
Data needs to be fluid and real-time
Microservices and serverless need event-driven architectures

Real-time Data in Motion beats Slow Data

Use cases for event-driven architectures exist across industries. Some examples:

Transportation: Real-time sensor diagnostics, driver-rider match, ETA updates
Banking: Fraud detection, trading, risk systems, mobile applications/customer experience
Retail: Real-time inventory, real-time POS reporting, personalization
Entertainment: Real-time recommendations, a personalized news feed, in-app purchases
The list goes on across verticals…

Real-time data in motion beats data at rest in databases or data lakes in most scenarios. There are a few exceptions that require batch processing:

Reporting (traditional business intelligence)
Batch analytics (processing high volumes of data in a bundle, for instance, Hadoop and Spark’s map-reduce, shuffling, and other data processing only make sense in batch mode)
Model training as part of a machine learning infrastructure (while model scoring and monitoring often requires real-time predictions, the model training is batch in almost all currently available ML algorithms)

Beyond these exceptions, almost everything is better in real-time than batch.

Be aware that real-time data processing is more than just sending data from A to B in real-time (aka messaging or pub/sub). Real-time data processing requires integration and processing capabilities. If you send data into a database or data lake in real-time but have to wait until it is processed there in batch, it does not solve the problem.

With the ideas around real-time in mind, let’s explore what a de facto standard API is.

What is a (De Facto) Standard API?

The answer is longer than you might expect and needs to be separated into three sections:

API
Standard API
De facto standard API

What is an API?

An application programming interface (API) is an interface that defines interactions between multiple software applications or mixed hardware-software intermediaries. It defines the kinds of calls or requests that can be made, how to make them, the data formats that should be used, the conventions to follow, etc. It can also provide extension mechanisms so that users can extend existing functionality in various ways and to varying degrees.

An API can be entirely custom, specific to a component, or designed based on an industry-standard to ensure interoperability. Through information hiding, APIs enable modular programming, allowing users to use the interface independently of the implementation.

What is a Standard API?

Industry consortiums or other industry-neutral (often global) groups or organizations specify standard APIs. A few characteristics show the trade-offs:

Vendor-agnostic interfaces
Slow evolution and long specification process
Most vendors add proprietary features because a) too slow process of the standard specification or more often b) to differentiate their commercial offering
Acceptance and success depend on the complexity and added value (this sounds obvious but is often the key blocker for success)

Examples for Standard APIs

Here are some examples of standard APIs. I also add my thoughts if I think they are successful or not (but I fully understand that there are good arguments against my opinion).

Generic Standards

SQL: Domain-specific language used in programming and designed for managing data held in a relational database management system. Successful as almost every database somehow supports SQL or tries to build a similar syntax. A good example is ksqlDB, the Kafka-native streaming SQL engine. ksqlDB (like most other streaming SQL engines) is not ANSI SQL, but still understood easily by people that know SQL.
J2EE / Java EE / Jakarta EE: Successful as most vendors adopted at least parts of it for Java frameworks. While early versions were very heavyweight and complex, the current APIs and implementations are much more lightweight and user-friendly. JMS is a great example where vendors added proprietary add-ons to add features and differentiate. No vendor-lockin is only true in theory!
HTTP: Successful as application layer protocol for distributed, collaborative, hypermedia information systems. While not 100% correct, people typically interpret HTTP as REST Web Services. HTTP is often misused for things it is not built for.
SOAP / WSDL: Partly successful in providing XML-based web service standard specifications. Some vendors built good tooling around it. However, this is typically only true for the basic standards such as SOAP and WSDL, not so much for all the other complex add-ons (often called WS-* hell).

Standards for a Specific Problem or Industry

OPC-UA for Industrial IoT (IIoT): Partly successful machine-to-machine communication protocol for industrial automation developed. Adopted by almost every vendor in the industrial space. The drawback (similarly to HTTP) is that it is often misused. For instance, MQTT is a much better and more lightweight choice in some scenarios. OPC-UA is a great example where the core is successful, but the industry-specific add-ons are not prevalent and not supported by tools. Also, OPC-UA is too heavyweight for many of the use cases it is used in.
PMML for Machine Learning: Not successful as an XML-based predictive model interchange format. The idea is great: Train an analytic model once and then deploy it across platforms and programming languages. In practice, it did not work. Too many limitations and unnecessary complexity for a project. Most real-world machine learning deployments I have seen in the wild avoid it and deploy models to production with a standard wrapper. ONNX and other successors are not more prevalent yet either.

In summary, some standard APIs are successful and adopted well; many others are not. Contrary to these standards specified by consortiums, there is another category emerging: De Facto Standard APIs.

What is a De Facto Standard API?

De Facto standard APIs originate from an existing successful solution (that can be an open-source framework, a commercial product, or a cloud service). Two ways exist how these de facto standard APIs emerge:

Driven by a single vendor (often proprietary), for example: Amazon S3 for object storage.
Driven by a huge community around a successful open-source project, for example: Apache Kafka for event streaming.

No matter how a de facto standard API originated, they typically have a few characteristics in common:

Creation of a new category of software, something that did not exist before
Adoption by other frameworks, products, or cloud services as the API because became the de facto standard
No complex, formal, long-running standard processes; hence innovation is possible in a relatively flexible and agile way
Practical processes and rules are in place to ensure good quality and consensus (either controlled by the owner company for a proprietary standard API or across the open source community)

Let’s now explore two de facto standard APIs: Amazon S3 and Apache Kafka. Both are very successful but very different regarding being a standard. Hence, the trade-offs are very different.

Amazon S3: De Facto Standard API for Object Storage

Amazon S3 or Amazon Simple Storage Service is a service offered by Amazon Web Services (AWS) that provides object storage through a web service interface in the public AWS cloud. It uses the same scalable storage infrastructure that Amazon.com uses to run its global e-commerce network. Amazon S3 can be employed to store any type of object, which allows for uses like storage for internet applications, backup and recovery, disaster recovery, data archives, data lakes for analytics, and hybrid cloud storage. Additionally, S3 on Outposts provides on-premises object storage for on-premises applications that require high-throughput local processing.

“Amazon CTO on Past, Present, Future of S3” is a great read about the evolution of this fully-managed cloud service. While the public API was kept stable, the internal backend architecture under the hood changed several times significantly. Plus, new features were developed on top of the API, for instance, AWS Athena for analytics and interactive queries using standard SQL. I really like how Werner Vogels describes his understanding of a good cloud service:

Vogels doesn’t want S3 users to even think for a moment about spindles or magnetic hardware. He doesn’t want them to care about understanding what’s happening in those data centers at all. It’s all about the services, the interfaces, and the flexibility of access, preferably with the strongest consistency and lowest latency when it really matters.

So, we are talking about a very successful proprietary cloud service by AWS. Hence, what’s the point?

Most Object Storage Vendors Support the Amazon S3 API

Many enterprises use the Amazon S3 API. Hence, it became the de facto standard. If other storage vendors want to sell object storage, supporting the S3 interface is often crucial to get through the evaluations and RFPs. If you don’t support the S3 API, it is much harder for companies to adopt the storage and implement the integration (as most companies already use Amazon S3 and have built tools, scripts, testing around this API).

For this reason, many applications have been built to support the Amazon S3 API natively. This includes applications that write data to Amazon S3 and Amazon S3-compatible object stores.

S3 compatible solutions include client backup, file browser, server backup, cloud storage, cloud storage gateway, sync&share, hybrid storage, on-premises storage, and more.

Many vendors sell S3-compatible products: Oracle, EMC, Microsoft, NetApp, Western Digital, MinIO, Pure Storage, and many more. Check out the Amazon S3 site from Wikipedia for a more detailed and complete list.

So why has the S3 API become so ubiquitous?

The creation of a new software category is a dream for every vendor! Let’s understand how and why Amazon was successful in establishing S3 for object storage. The following is a quote from Chris Evan’s great article from 2016: “Has S3 become the de facto API standard?”

So why has the S3 API become so ubiquitous? I suspect there are a number of reasons. These include:

First to market – When S3 was launched in 2006, most enterprises were familiar with object storage as “content addressable storage” through EMC’s Centera platform. Other than that, applications were niche and not widely adopted except for specific industries like High Performance Computing where those users were used to coding to and for the hardware. S3 quickly became a platform everyone could use with very little investment. That made it easy to consume and experiment with. By comparison, even today the leaders in object storage (as ranked by the major analysts) still don’t make it easy (or possible) to download and evaluate their products, even though most are software only implementations.
Documentation – following on from the previous point, S3 has always been well documented, with examples on how to run API commands. There’s a document history listing changes over the past 6-7 years that shows exactly how the API has evolved.
A Single Agenda – the S3 API was designed to fit a single agenda – that of storing and retrieving objects from S3. As such, Amazon didn’t have to design by committee and could implement the features they required and evolve from there. Contrast that with the CDMI (Cloud Data Management Interface) from SNIA. The SNIA website is difficult to navigate, the standard itself is only on the 4th published iteration in six years, while the documentation runs to 264 pages! (Note that the S3 API runs into more pages, but is infinitely more consumable, with simple examples from page 11 onwards).

Cons of a Proprietary De Facto Standard like Amazon S3

Many people might say: “Better a proprietary standard than no standard.” I partly agree with this. The possibility to learn one API and use it across multi-cloud and on-premise systems and vendors is great. However, Amazon S3 has several disadvantages as it is NOT an open standard:

Other vendors (have to) build their implementation on a best guess about the behavior of the API. There is no official standard specification they can rely on.
Customers cannot be sure what they buy. At least, they should not expect the same behavior of 3rd party S3 implementations that they get from their experiences using Amazon S3 on AWS.
Amazon can change APIs and features as it likes. Other vendors need to “reverse engineer the API” and adjust their products.
Amazon could sue competitors for using S3 API branding – even though this is not likely to happen as the benefits are probably bigger (I am not a lawyer; hence this statement might be wrong and is just my personal opinion)

Let’s now look at an open-source de facto standard: Kafka.

Kafka API: De Facto Standard API for Event Streaming

Apache Kafka is mainstream today! The Kafka API became the de facto standard for event-driven architectures and event streaming. Two proof points:

Use cases across all industries and infrastructure. Including various kinds of transactional and analytics workloads. Edge, hybrid, multi-cloud. I collected a few examples across verticals that use Apache Kafka to show the prevalence across markets.
Adoption by various open-source frameworks and many software/cloud vendors. Check out my blog post if you are interested in a comparison of Kafka vendors such as Confluent, Cloudera, Red Hat or Amazon MSK and related technologies like Azure Event Hubs, AWS Kinesis, RedPanda, or Apache Pulsar.

The Kafka API (aka Kafka Protocol)

Kafka became the de facto event streaming API. Similar like the S3 API became the de facto standard for object storage. Actually, the situation is even better for the Kafka API as the S3 API is a proprietary protocol from AWS. In contrast, the Kafka API and protocol are open source under Apache 2.0 license.

The Kafka protocol covers the wire protocol implemented in Kafka. It defines the available requests, their binary format, and the proper way to make use of them to implement a client.

One of my favorite characteristics of the Kafka protocol is backward compatibility. Kafka has a “bidirectional” client compatibility policy. In other words, new clients can talk to old servers, and old clients can talk to new servers. This allows users to upgrade either clients or servers without experiencing any downtime or data loss. This makes Kafka ideal for microservice architectures and domain-driven design (DDD). Kafka really decouples the applications from each other in contrary to web service/REST-based architectures).

Pros of an Open Source De Facto Standard like the Kafka API

The huge benefit of an open-source de facto standard API is that it is open and usually follows a collaborative standardized process to make changes to the API. This brings various benefits to the community and software vendors.

The following facts about the Kafka API make many developers and enterprises happy:

Changes occur in a visible process enforced by a committee. For Apache Kafka, the Apache Software Foundation (ASF) is the relevant organization. Apache projects are managed using a collaborative, consensus-based process with members from various countries and enterprises. Check out how it works if you don’t know it yet.
Frameworks and vendors can implement against the open protocol and validate the implementation. That is significantly different from proprietary de facto standards like Amazon S3. Having said this, not every product that says it uses the Kafka API is 100% compatible and consequently is limited in the feature set and provides different behavior.
Developers can test the underlying behavior against the same API. Hence, unit and performance tests for different implementations can use the same code.
The Apache 2.0 license makes sure that the user does not have to worry about infringing any patents by using the software.

Frameworks, Products, and Cloud Services using the Kafka API

Many frameworks and vendors adopted the Kafka API. Let’s take a look at a few very different alternatives available today that use the Kafka API:

Open-source Apache Kafka from the Apache website
Self-managed Kafka-based vendor solutions for on-premises or cloud deployments from Confluent, Cloudera, Red Hat
Partially managed Kafka-based cloud offerings from Amazon MSK, Red Hat, Azure HD Insight’s Kafka, Aiven, cloudkarafka, Instaclustr.
Fully managed Kafka cloud offerings such as Confluent Cloud – actually, there is no other serverless, fully compatible Kafka SaaS offering on the market today (even though many marketing departments try to sell it like this)
Partly protocol-compatible, self-managed solutions such Apache Pulsar (with a simple, very limited Kafka wrapper class) or RedPanda for embedded / WebAssembly (WASM) use cases
Partly protocol-compatible, fully managed offerings like Azure EventHubs

Just be aware that the devil is in the details. Many offerings only implement a fraction of the Kafka API. Additionally, many offerings only support the core messaging concept, but exclude key features such as Kafka Connect for data integration, Kafka Streams for stream processing, or exactly-once semantics (EOS) for building transactional systems.

The Kafka API Dominates the Event Streaming Landscape

If you look at the current event streaming landscape, you see that more and more frameworks and products adopt the Kafka API. Even though the following is not a complete list (and other non-Kafka offerings exist), it is imposing:

If you want to learn more about the different Kafka offerings on the market, check out my Kafka vendor comparison. It is crucial to understand what Kafka offering is right for you. Do you want to focus on business logic and consume the Kafka infrastructure as a service? Or do you want to implement security, integration, monitoring, etc., by yourself?

The Kafka API is here to stay…

The Kafka API became the de facto standard API for event streaming. The usage of an open protocol creates huge benefits for corresponding frameworks, products, and cloud services leveraging the Kafka API.

Vendors can implement against the open standard and validate their implementation. End users can choose the best solution for their business problem. Migration between different Kafka services is also possible relatively easily – as long as each vendor is compliant with the Kafka protocol and implements it completely and correctly.

Are you using the Kafka API today? Open source Kafka (“car engine”), a commercial self-managed offering (“complete car”), or the serverless Confluent Cloud (“self-driving car) to focus on business problems? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Kafka API is the De Facto Standard API for Event Streaming like Amazon S3 for Object Storage appeared first on Kai Waehner.

Comparison of Open Source Apache Kafka vs Vendors including Confluent, Cloudera, Red Hat, Amazon MSK

Kai Waehner — Tue, 20 Apr 2021 07:49:39 +0000

Apache Kafka became the de facto standard for event streaming. The open-source community is huge. Various vendors added Kafka and related tooling to their offerings or provide a Kafka cloud service. This blog post uses the car analogy – from the motor engine to the self-driving car – to explore the different Kafka offerings available on the market. I also cover a few other vehicles, meaning (partly) Kafka-compatible technologies. The goal is not a feature-by-feature comparison (that would be outdated the day after the publication). Instead, the intention is to educate about the different deployment models, product strategies, and trade-offs from the available options.

Disclaimer: I work for Confluent. However, the post is not about comparing features but explaining the concepts behind the alternatives. I talk to enterprises across the globe every week. I can assure you that many people I talk to are not aware or mislead about what you read in the following sections. Hence, I hope that the following helps you to make the right decision. Either choose to run open-source Apache Kafka or one of the various commercial Kafka offerings, or even a combination of both.

UPDATE (August 2022): This blog post was written before Amazon MSK Serverless was released. The below is still accurate and worth a read for comparing Kafka products and cloud services. Additionally, please check out the article “When NOT to choose Amazon MSK Serverless for Apache Kafka?“

Apache Kafka Components and Use Cases

The goal is not to introduce Kafka here. The minimum you should know is that Kafka is NOT just a messaging layer for data ingestion into a data lake. This is just a fraction of today’s usages.

Kafka is an open-source framework under Apache 2.0 license. It provides a combination of messaging, storage, processing, and integration of high volumes of data at scale in real-time and fault-tolerant. That’s what makes Kafka unique compared to other MQ, ETL, ESB, and API platforms.

Kafka is deployed in production for various use cases across industries. This includes analytical and mission-critical workloads. Different deployments require different SLAs. You should always ask yourself what happens if the Kafka infrastructure is in trouble. What are your RTO (Recovery Time Objective) and RPO (Recovery Point Objective)? Or in other words: How much data is okay to lose? How much downtime is acceptable? Start your Kafka projects with these questions in mind when you start your comparison of the options!

Kafka is the De Facto Standard API for Event Streaming like S3 API for Object Storage

Apache Kafka is mainstream! The latest proof: Check out the new ThoughtWorks Technology Radar: “Kafka API without Kafka“:

Kafka became the de facto event streaming API. Similar to S3 API became the de facto standard for object storage. Actually, the situation is even better for the Kafka API as the S3 API is a proprietary protocol from AWS. In contrast, the Kafka API and protocol are open source under Apache 2.0 license.

Check out the blog “Kafka API is the De Facto Standard API for Event Streaming like Amazon S3 for Object Storage” for more details.

Let’s take a look at a few very different Kafka alternatives available today:

Open-source Apache Kafka from the Apache website under Apache 2.0 license
Self-managed vendor offerings from Confluent, Cloudera, Red Hat, Amazon MSK, and many more
Fully-managed cloud offerings such as Confluent Cloud
Partly protocol-compatible products such as RedPanda for embedded / WebAssembly (WASM) use cases
Partly protocol-compatible SaaS offerings like Azure EventHubs

That’s a lot of options. So, how do you make a Kafka comparison to choose the right one? Before we go into more detail, let’s explore how complex Kafka actually is and when you do have to care about this at all.

Should you care how complex or heavyweight your event streaming technology is?

Complexity matters (only) if you need to operate the infrastructure by yourself. The beauty of SaaS is that you just consume the service and focus on your business problems. For instance, the AWS S3 object storage is a simple API with a fully managed service under the hood. You do not need to worry about operations or monitoring. You just use the cloud service.

Having said this, it is a little bit strange that ThoughtWorks mentions the barriers and complexity of Kafka but then refers to the Pulsar wrapper. That is an (immature) single class mapping implementation that only maps a small part of Kafka’s protocol (here is the producer wrapper as an example). Developers can use that wrapper to move data between Kafka clients and Pulsar brokers. However, Pulsar has a much more complex three-tier distributed architecture with ZooKeeper, BookKeeper, and Pulsar clusters. What is the benefit here? Is this really what you want to do in mission-critical workloads? Why? Please let me know if you seriously consider using such a wrapper architecture. Also, please read my post about the “Myths of Kafka vs. Pulsar“. A lot of arguments like the Kafka wrapper are simply just marketing and not usable for real-world projects!

Therefore, when I think about using the Kafka API without operating Kafka, then I have fully managed SaaS offerings such as Confluent Cloud or Azure Event Hubs in my mind.

Having said this, a fully managed cloud service is not always an option. For instance, Kafka at the edge is the new black. Plenty of use cases exist for a single broker or highly-available Kafka clusters at the edge.

But even if you want or need to operate Kafka by yourself: With KIP-500 and the removal of ZooKeeper, it gets easier and more lightweight than ever before. A lot of arguments do not exist anymore to move to a more “lightweight alternative”. There might be good reasons to choose something like RedPanda. But the main argument of having a more simple and lightweight deployment is not given anymore. Check out this video showing Kafka without ZooKeeper.

How to Choose the Right Kafka Distribution or Cloud Service?

So, how to make a comparison to find out which Kafka distribution or cloud service is the right one for your project?

The answer is simpler than you might think: The ultimate goal is to focus and solve your business problem.

How do you do that? By implementing business logic. Ideally, you don’t have to worry about infrastructure, operations, security, scalability, reliability, and non-business characteristics. Hence, SaaS with fully managed Kafka should be the first choice.

Unfortunately, SaaS is not always possible or the best option for many reasons:

Missing features
Technical limitations
Cost
Security requirements
On-premise or edge use cases

Therefore, we need to go one step back and understand what options you have to deploy and operate Kafka. Understanding these concepts without all the marketing fluff from the vendors is crucial to make the right decision!

The Kafka Car: An Analogy for a Product Comparison

It is often easier to compare technology by using an analogy from real life. Something everybody understands. No matter what industry you are in. No matter how technical you are. First, I thought I use the analogy of pizza, including self-made pizza, pizza ingredients, restaurants, delivery services, and other related topics. But pizza is used so often in the IT world. This originated in the early days of Amazon. Jeff Bezos instituted a rule: Every internal team should be small enough to be fed with two pizzas.

Finally, I choose to use the analogy of a car because I think many of the arguments are less debatable this way. I guess we could never agree on what would be the best pizza option for most people…

Hence, let’s talk about car engines, car brands, self-driving, connected fleets, and vintage vehicles in the following sections.

Give me a self-driving car, please!

Obviously, most people would prefer a self-driving car (if the price is right). It is safe, cost-efficient, and comfortable.

In the Kafka context, this means that the Kafka infrastructure would be

cloud-native (= elastic, scalable and automated, ideally fully managed)
complete (= entire set of security and operational features that enterprises require)
everywhere (= available in multiple public clouds, private cloud, on-premise, edge outside the data center)

Unfortunately, not every Kafka setup can be self-driving. We need to disassemble a car into its parts to understand what’s going on under the hood. Then we can choose the right car for our business problem.

Car Brands: Comparison of Confluent, Cloudera, Red Hat, Amazon MSK

Competition creates innovation. Hence, it is great to see many car brands and car models on the streets. Similarly, many competing companies fight for market share around Kafka business. Let’s quickly think about the available car brands (= Kafka vendors) on the market.

I only focus on the most relevant ones that either care about the Kafka project and community, have a lot of market power, or ideally both.

The car brands are Confluent, Cloudera, Red Hat, and Amazon MSK. I have a section on other Kafka and non-Kafka streaming vendors at the end of the blog post to provide a more detailed comparison.

Again, the idea is NOT to have a feature-by-feature comparison or flame war. The following are a few facts about each vendor. I only focus on Kafka-related points. Hence, it is no surprise that Confluent looks best in the following list as they only focus on event streaming. But obviously, each vendor has strengths and weaknesses. For instance, if you want to discuss the overall cloud infrastructure capabilities and strategy, well, then AWS would look much stronger than all the other Kafka vendors…

Confluent – The Leading Apache Kafka Vendor

A few facts about Confluent:

Focus on event streaming
Original creators of Kafka
The main contributor to the Apache Kafka project with 80% of Kafka commits
Always the latest Kafka version (without limitations) and full support
Rich Kafka ecosystem (connectors, governance, security, etc.)
Hybrid architectures (including the only true fully-managed and complete Kafka service)
Partnership and 1st party integration into cloud providers (AWS, GCP, Azure) – e.g., you can use your cloud provider credits and account to consume Confluent Cloud
Certified for self-managed operations on cloud providers’ edge offerings (e.g., AWS Outpost including Wavelength, Google’s Anthos)

Cloudera – Big Data Analytics Suite

A few facts about Cloudera:

Focus on big data analytics
Provides a platform around tens of different big data frameworks for storage, batch, and real-time analytics
Kafka is part of the platform (Hadoop, Spark, Flume, Flink, many more) with tooling and support for the whole platform
Hybrid architectures (but no fully-managed Kafka service)
Partnership and 3rd party integration into cloud providers (AWS, GCP, Azure)

Red Hat (IBM) – Cloud-native PaaS Infrastructure

A few facts about Red Hat (IBM):

Focus on infrastructure (mainly around Linux and Kubernetes)
Kafka is available as part of the Red Hat AMQ product portfolio, combined with other open-source frameworks like ActiveMQ or Camel
OpenShift Streams for Apache Kafka provides integration with Kubernetes
Focus on open source frameworks; working actively with the community (for Kafka, Red Hat, e.g., contributes to Debezium for CDC and the Strimzi Kubernetes Operator)
Hybrid Architectures (but no fully-managed Kafka service)
Partnership and 3rd party integration into cloud providers (AWS, GCP, Azure)

Interesting side notes for the relationship between Confluent, Red Hat, and IBM:

IBM acquired Red Hat in 2019.
Confluent and IBM announced a strategic partnership in 2020
IBM deprecated its own Kafka offering (IBM Streams) in March 2021.
Confluent is the way to go with IBM as part of the IBM Cloud Pak for Integration. Even IBM’s salespeople sell Confluent.

Amazon Web Services (AWS) – The Leading Cloud Provider

AWS focuses on cloud infrastructure and 1st party fully managed cloud services (S3, Kinesis, Lambda, etc.)

A few facts about Amazon MSK, the AWS offering for Kafka:

MSK misses several key Kafka features, including Kafka Connect or Kafka Streams
Cloud-only (but only self-managed, not fully managed)
MSK is not cloud-native (like S3 or Kinesis) but just provisioned infrastructure
Obviously only available on AWS
For on-premise deployments (like AWS Outpost or AWS Wavelength), the recommended Kafka product is the certified Confluent Platform

Interesting side note about the commercial support and SLAs of AWS’s Kafka offering: Kafka is excluded from MSK support! Quote from the MSK SLAs: “The Service Commitment DOES NOT APPLY to any unavailability, suspension, or termination… caused by the underlying Apache Kafka or Apache ZooKeeper engine software that leads to request failures…”

Event Streaming Technology and Cloud-native Infrastructure are Complementary!

The above showed a few facts for the main Kafka vendors: Confluent, Cloudera, Red Hat, AWS. However, it is worth explicitly pointing out that these vendors are often complementary. For instance, most Confluent Platform deployments I see on Kubernetes on-premise are actually on Red Hat OpenShift. And with AWS’s huge market share, most self-managed Confluent deployments in the cloud are on AWS.

Also, Confluent Platform is certified on AWS Outpost and Google Anthos. Hence, you can even combine cloud-native technologies at the edge. A great example is smart factory 5G use cases leveraging Confluent Platform on AWS Wavelength. Consequently, a Kafka comparison does typically not eliminate all the other Kafka vendors from the project.

The following architecture depicts the combination of Confluent Cloud in AWS plus Confluent Platform on AWS Wavelength leveraging 5G Carrier networks:

This is not just theory. The joint teams from AWS and Confluent are working on this example in the real world while I am writing this blog post.

Cloud-native? Complete? Everywhere? What Kafka should I buy?

After exploring different vendors, let’s now walk through the different deployment options and commercial offerings.

Again, I will not make a feature-by-feature comparison. Way more important is to understand the different concepts and architecture principles: First of all, you need to decide if a self-driving car (= fully managed Kafka) works for you. In that case, why bother at all about Kafka operations? Otherwise, project teams must evaluate partially managed (= complete car) or self-managed (car engine) Kafka offerings.

Here is an overview showing the event streaming landscape. It contains native Kafka offerings, (partly) Kafka-protocol compatible products, and a few relevant non-Kafka solutions:

Let’s now take a deeper look into these alternatives to find out how to choose the right one for your next project.

Car Engine: Self-managed Open Source Apache Kafka

The car engine is the heart of the car. It provides the power. It brings you from your source to your destination. However, a lot of work is needed around the motor engine. Tires, steering wheel, breaks, and much more are required. Hence, this is a great solution for playing around, learning how a car works, or building a car by enthusiastic car fanatics.

If you download open-source Apache Kafka from the Apache website or related Docker images, then you can use it for free in all your projects. No limitations. You should be able to get it running quickly. However, be aware that similar to the motor engine of a car, there is much more to do: Operating and monitoring the ZooKeeper and Kafka Clusters, rebalancing partitions, scaling up and down, managing storage, securing and encryption the end-to-end communication between producer, Kafka cluster and consumers, and so much more.

If you can handle the operations burden and risk of downtime, open-source Apache Kafka might be a good option. Some tech giants from Silicon Valley do exactly this. They have hired masses of tech experts (or car fanatics to keep the analogy) to run huge Kafka clusters to process trillions of messages and gigabytes of data per second.

Free Kafka add-ons to build your car

Plenty of open-source Kafka add-ons originated through this. Just to name a few tools for very different purposes: kafkacat, Kafka Manager, Kafdrop, burrow, cruise control, and so many more. Some are maintained well, others not at all. Of course, you will never get guarantees to get a version upgrade or bug fix. Often built by a tech giant for their specific scenario. Not easily usable outside that organization and without a big community.

Alternatively, there are well-maintained community projects like Confluent’s Schema Registry, REST Proxy, and ksqlDB, all under Confluent Community License (CCL). This is not open source but free to use if you are not a cloud provider like AWS. Confluent also provides some components under Apache 2.0 license, such as the widely used non-Java Kafka clients based on librdkafka or the parallel-consumer to integrate with non-scalable interfaces like web services in a scalable and performant way.

Tuned Car Engine: Self-managed Kafka Product

If you want or need to self-manage your Kafka infrastructure, then you still have more options than just using open-source Apache Kafka and (well or not so well maintained) open-source add-ons:

Open-source Apache Kafka with additional commercial tools for operations and monitoring. For instance, Lenses or Conduktor.
Complete commercial platforms. For instance, Confluent Platform, Red Hat AMQ, Cloudera DataFlow.

These “tuned car engines” are based on top of Apache Kafka (or at least parts of it) and provide additional tooling for development, operations, monitoring, security, etc. Maturity of the tools, support SLAs, expertise, and consulting vary a lot between vendors. I recommend to talk to your potential vendors. Ask the right questions to understand if they really understand what they seem to sell and support.

Shiny user interfaces attract many people. Just be careful. The underlying technology needs to work reliably and scale for your needs. The UI is nice to have on top of the infrastructure. Nevertheless, a good UI can improve the developer experience, increase time to market, and bring other benefits.

Should I use a (Tuned) Car Engine and Build my own Car?

All the explored options above are still self-managed. If you consider building your own car with a car engine, always evaluate the cost-benefit equation.

Remember, at the beginning of this post, I talked about solving business problems. Hence, don’t forget to consider all the impacts on:

Total Cost of Ownership (TCO)
Risk (downtime, data loss, security, governance, etc.)
Return on Investment (ROI)
Time-to-market (Increased developer velocity and increased business agility)

At Confluent, we do TCO assessments with our prospects and customers so that they understand the complete costs and risks of a Kafka project. Such an assessment should be part of every Kafka comparison!

Hence, don’t forget to evaluate other alternatives to open-source Apache Kafka seriously. If self-managed Apache Kafka still works for you after the evaluation, then do it! But be aware that even the tech giants from Silicon Valley consider and buy other options today. Many had to build Kafka infrastructure because there was no other alternative when they built it years ago.

Does my favorite open-source vendor really provide open-source?

Also, be careful: Some open source solutions don’t provide an easy way to build the product. So, evaluate what exactly is available from a so-called “open source offering”: Only a binary download? Docker images? Or can you also build and deploy everything from scratch easily and documented (not just in theory !!!) using Maven, Gradle, Terraform, Ansible, or similar build and automation tools?

Before building your own car with an available car engine, why not buy a car? Let’s consider next if a complete car might make more sense for you.

Complete Car: Kafka Products and Kubernetes-based PaaS

I am not a car fanatic. I want to buy a complete car that I can drive everywhere. You should also at least consider this option!

In Kafka terms, this means you get help from the product for running the Kafka infrastructure. Buzzwords from vendors include terms like “platform as a service”, “private cloud”, “fully managed”, and “cloud-native”. In the end, the products help you with provisioning, operating, and monitoring everything.

The main benefit compared to the (tuned) car engines is that these products give you a more elastic, scalable, and automated infrastructure. Two options exist today:

Kubernetes-based products that can run everywhere on-premise and across multiple cloud providers. Examples: Confluent Platform, Cloudera DataFlow (CDF), Red Hat AMQ respectively Red Hat OpenShift Streams for Apache Kafka.
Proprietary cloud offerings that are typically tied to the related hyperscaler. Example: Amazon MSK.

This is similar to buying a car: It comes preassembled. Although, of course, you are still responsible for operating and maintaining it.

In Kafka terms, for many scenarios, these self-managed Kafka products (= car) are a better choice than the self-managed Kafka (= car engine) because they partially reduce the operations burden, risk, and (hopefully) TCO.

A complete car is still not self-driving!

However, as you know, today’s cars still need a lot of manual work: Driving, refueling, maintenance, and more. In Kafka terms: How much works do you still have to do by yourself? Do you have to handle rolling upgrades manually? Do you have to rebalance partitions on brokers? How do you scale up and down? Who fixes security issues and bugs? And so on.

Hence, it is really a pity that most vendors use incorrect marketing intentionally! No one of the above solutions are fully managed. All of them require work to do by you to operate the Kafka cluster. All of them! Confluent Platform. Cloudera DataFlow. Red Hat AMQ. Red Hat OpenShift Streams for Apache. Kafka Amazon MSK. None of these services is fully managed!

Each car brand and model is different. If you buy a Porsche, you probably have very different expectations than buying a small medium-priced car from another brand. The same is true for all the self-managed Kafka products on the market. Each product is very different: Confluent Platform. Cloudera DataFlow. Red Hat AMQ / OpenShift Streams for Apache Kafka. Amazon MSK. All of them have strengths and weaknesses. Make sure it fits your expectations so that you can solve your business problem within your required SLAs and budget.

Having said this, wouldn’t it be nice if you don’t have to worry about all these things? Let’s explore the self-driving Kafka car next.

Self-driving Car: Fully-managed Cloud Kafka Service

A self-driving car provides a complete solution. You just tell it where you want to go. It drives you automatically. Chooses the best route. Allows relaxing, reading, playing games, or similar things. Of course, an autonomous car with level 5 automation is not mature yet (beyond some early stages, like Waymo operating in the desert in Phoenix where no rain and other weather or traffic issues occur).

In Kafka terms, the solution needs to be fully managed by the vendor.

Fully managed means serverless (i.e., you don’t have to care and even don’t get access to the Kafka Brokers at all). Mission-critical SLAs. Usage-based billing. And so on. Like you know it from other really fully managed cloud offerings such as AWS S3 or AWS Kinesis. These are fully managed. Amazon MSK is not!

Checklist to compare partially managed and fully-managed Kafka cloud services

Please compare different Kafka cloud offerings by yourself. Here are some bullet points to check:

Infrastructure management

Upgrades (latest stable version of Kafka)
Patching
Maintenance

Kafka-specific management

Sizing (retention, latency, throughput, storage, etc.)
Data balancing for optimal performance
Performance tuning for real-time and latency requirements
Fixing Kafka bugs
Uptime monitoring and proactive remediation of issues
Recovery support from data corruption

Scaling

Scaling the cluster as needed
Data balancing the cluster as nodes are added
Support for any Kafka issue with less than 60-minute response time

Most “Kafka as a service” offerings are only partially managed. That’s like a self-driving car which you actually have to control by yourself (more like level 3, not level 5 in autonomous driving terminology).

At this point, I have to do marketing for my employer. However, it is not an advertisement, but reality: Confluent Cloud is the only offering on the market that provides a complete, fully-managed Kafka SaaS offering. And it is available everywhere – on all major cloud provides (AWS, Azure, GCP). All the other Kafka offerings are NOT fully managed – even though most vendors claim it!

Other Vehicles on the Street: Comparison of Kafka-Compatible and Non-Kafka Offerings

On the street, we don’t see just one car brand or car model. Plenty of different ones exist. Nevertheless, they have to drive on the same streets. Competition creates innovation and tackles different markets and personal interests. That’s great. The same is true for Kafka!

I focused on the “mainstream Kafka vendors” in the above sections. Namely, Confluent, Cloudera, Red Hat, Amazon MSK. Obviously, more Kafka offerings exist on the market. Some are really good for some use cases. Others are more like an April Fools’ Joke, in my opinion. Let’s quickly walk through a few other offerings.

A few more car brands: Azure HD Insight’s Kafka, Aiven, cloudkarafka, Instaclustr. These Kafka-native PaaS vendors provision Kafka clusters for you. Similarly to Amazon MSK. These offerings slightly differ from each other. In summary, they typically ask you to do storage management, scalability configuration, performance tuning, etc., by yourself. This is definitely not self-driving!
A self-driving car: Azure Event Hubs is a SaaS offering from Microsoft supporting the Kafka protocol. It has several limitations regarding support of the Kafka API and infrastructure. A solid product. Contrary to Confluent Cloud, you don’t get additional capabilities such as fully-managed connectors, Schema Registry, RBAC, Audit Logs, and much more. And obviously, this product is only available on the Azure cloud.
A vintage car: TIBCO focuses on their legacy messaging solutions like TIBCO EMS. They (try to) provide support for Kafka (and Pulsar) to sell their proprietary technologies. Zero expertise or interest in Kafka. They even provide Kafka as .exe Windows file even though this does not work well in reality. If you need to run Kafka brokers on Windows (e.g., for development), only use Kafka Docker containers and the Windows Subsystem for Linux 2 (WSL 2).

Non-Kafka offerings

Self-driving scooters: AWS Kinesis, GCP Pub/Sub, etc., are solid SaaS offerings that work well if you don’t need to be vendor-agnostic and if the feature set, scalability, and pricing work for you.
A few bicycles, motorbikes, and cars: Non-Kafka solutions, including message queuing (IBM MQ, RabbitMQ, NATS), stream processing (Flink, Spark Streaming), event streaming (Pulsar, Pravega), integration middleware (many open-source/proprietary and self-managed/SaaS). These are solid frameworks and products that you can compare to Kafka. There is no silver bullet! Make sure to understand the differences between MQ/ETL/ESB and Kafka when you do your evaluation.

Connected Car Fleet: Multiple Kafka Clusters and Hybrid Integration

The digital transformation around connected vehicles is a real game-changer. Vehicles talk to each other (V2V), their infrastructure like traffic lights (V2I), and to many other backend systems (V2X).

As a side note: If you are interested in the relation of Kafka and connected vehicles/mobility services, I covered use cases for connected vehicles and V2X in my blog series about Kafka and MQTT in more detail.

Today, we usually have to drive by ourselves. This is expected to change in the next five to ten years. However, even if Waymo, Telsa, and the likes successfully deploy level 4 and level 5 cars to the street (including legal allowance), we will still only see a fraction of all cars driving themselves. It will be a connected fleet with regular cars and self-driving cars for at least a few decades. Not even sure if self-driving cars can ever go to India

The same is true for Kafka. Self-managed open-source Kafka is still mainstream today. Many enterprises move to Kubernetes and private or public cloud infrastructure, though. In parallel, most new Kafka clusters in the cloud are consumed from vendors that provide partially or fully managed services so that the enterprise can focus on their business problems.

Kafka is deployed across infrastructures. Often, new projects have a cloud-first strategy. But there are still a lot of data centers out there. Not just for legacy reasons. For instance, in Russia, there is no public cloud provider at all. Kafka has to be deployed on-premise. And there is the trend of deploying Kafka at the edge (i.e., outside a data center).

Architectures for Hybrid Kafka (SaaS + PaaS + Self-Managed)

Hence, a connected car fleet with various brands and operation types is required. Most enterprises use different vendors and cloud providers. Most enterprises have their own data centers and a multi-cloud strategy. Hybrid Kafka includes various architectures. This includes:

Kafka in one or multiple clouds. There is no Azure or GCP in China, only Alibaba and Tencent Cloud. This is why Audi built their connected car infrastructure in the cloud, but with Kafka instead of proprietary cloud services. They need to deploy globally.
Kafka at the edge outside the data center, e.g., in a smart factory, oil rigs, ships, retail stores, etc. Often deployed as a single broker on very lightweight hardware, without high availability.
Kafka stretched across regions, i.e., one single cluster operating across the US west, east, central. Confluent’s Multi-Region Clusters (MRC) is mainly used for this architecture.
Replication between different Kafka clusters. Use cases include aggregation, disaster recovery, global deployments, and more. Kafka-native technologies such as MirrorMaker 2, Confluent Replicator, or Confluent Cluster Linking enable these architectures.

“Architecture patterns for distributed, hybrid, edge and global Apache Kafka deployments” explores this topic in more detail.

Focus on the Business Problem when making your Kafka Comparison!

This blog post explored the different deployment options for Kafka. Several open-source and commercial options exist.

If you want to remember one thing from this post: A fully-managed Kafka service (= real SaaS) takes over all the operations complexity and risk for you, similarly like a self-driving car handles all the actions on the street. However, most services available today only provide self-managed Kafka clusters. Fully managed is often only a marketing term.

A hybrid architecture is the norm in most enterprises. A combination of fully-managed Kafka in the public cloud with self-managed Kafka on premise or at the edge works very well and is the way to go for most enterprises across industries.

What Kafka car do you drive today? What is your plan for the future? Maybe you are already planning to migrate to a self-driving car to focus on your business problems – and consequently reducing cost and risk this way, too? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Comparison of Open Source Apache Kafka vs Vendors including Confluent, Cloudera, Red Hat, Amazon MSK appeared first on Kai Waehner.

Pulsar vs Kafka – Comparison and Myths Explored

Kai Waehner — Tue, 09 Jun 2020 15:32:29 +0000

Pulsar vs Kafka – which one is better? This blog post explores pros and cons, popular myths, and non-technical criteria to find the best tool for your business problem.

My discussions are usually around Apache Kafka and its ecosystem as I work for Confluent. The only questions I got about Pulsar in the last years came from Pulsar committers and contributors. They asked me deep technical questions so as to be able to explain where Kafka sucks and why Pulsar is the much better option. Discussions about this topic on platforms like Reddit are typically very opinionated, often inaccurate, and brutal. The following is my point of view based on years of experience with open source streaming platforms.

Tech comparisons are the new black: Kafka vs. Middleware, Event Streaming and API Platforms

Tech comparisons are meant to guide people to choose the right solution and architecture for their business problem. There is no all-rounder, and there should be no bias. Choose the right tool for the problem.

However, technical comparisons are almost always biased. Even if the author does not work for a vendor and is an “independent” consultant, he or she is still likely to have a biased opinion from past experiences and knowledge, whether purposely or unknowingly. Still, comparisons from different perspectives are useful, and we’ve seen Apache Pulsar discussed in a few places on the internet, so I wanted to share my personal views of how Kafka and Pulsar compare. I work for Confluent, the leading experts behind Apache Kafka and its ecosystem, so keep that in mind, but the aim of this post is not to provide opinion, it’s to weigh up facts rather than myths.

Technical comparisons of open source frameworks and commercial software products happen all the time. I did several comparisons in the past on my blog or other platforms like InfoQ, including a Comparison of integration frameworks, Choosing the right ESB for your integration needs, Kafka vs. ETL / ESB / MQ, Kafka vs. Mainframe and Apache Kafka and API Management / API Gateway. All these comparisons were done because customers wanted to understand when to use which tool.

For Pulsar vs. Kafka, the situation is a little bit different.

Why compare Pulsar and Kafka?

Talking to prospects or customers, I rarely get asked about Pulsar. To be fair, this increased slightly in the last months. I guess the question comes up in every ~15th or ~20th meeting due to the overlapping feature set and use cases. However, this seems to be mostly due to a few posts on the internet that claim Pulsar is in some ways better than Kafka. There is no fact-checking and very little material, if any, for the opposing view.

I have not talked to a single organization that seriously considered deploying Pulsar in production, although I know there are a large number of users out there in the world who need a distributed messaging technology like Kafka or Pulsar. But I also think that Pulsar’s alleged reference users are not particularly accurate.

For example, their flagship user is Tencent, a large Chinese tech company, but Tencent is a huge Kafka user, whereas Pulsar’s use is limited to just one project. Tencent processes trillion messages per day (in digits: 10,000,000,000,000) with Kafka. As it turns out, Tencent uses Kafka 1000x more than Pulsar (ten trillion msg/day vs. tens of billion msg/day). The Tencent team discussed their Kafka deployment in more detail: How Tencent PCG Uses Apache Kafka to Handle 10 Trillion+ Messages Per Day.

Comparison of two competitive open source frameworks

Apache Kafka and Apache Pulsar are two exciting and competing technologies. Therefore, it makes a lot of sense to compare them. Period.

Both Apache Kafka and Apache Pulsar have very similar feature sets. I recommend that you evaluate both frameworks for available features, maturity, market adoption, open source tools and projects, training material, availability of local meetups, videos, blog posts, etc. Reference use cases from your industry or business problems help making the right decision.

Confluent published such a comparison of “Kafka vs. Pulsar vs. RabbitMQ: Performance, Architecture, and Features Compared“. I was involved in creating this comparison. So we have that comparison already…

What is this blog post here about then?

I want to explore the myths from some ‘Kafka vs. Pulsar’ arguments which I see regularly in blog posts and forum discussions. Afterwards, I will give a more comprehensive comparison beyond just technical aspects because most Pulsar discussions focus purely on tech features.

Kafka vs Pulsar – Technology myths explored

The following discusses some myths I have come across. I agree with some of them, but also counter some others with hard facts. Of course, different opinions can exist for some of these statements. Again, this is totally fine. The following is my point of view.

Myth 1: “Pulsar has differentiating built-in features compared to Kafka”?

True.

If you compare Apache Kafka to Apache Pulsar, features like its tiered architecture, queuing, and multi-tenancy are mentioned as differentiators.

But:

Kafka has many differentiating features, too:

Half as many servers to run
Data saved to disk only once
Data cached in memory only once
Battle-tested replication protocol
Zero copy performance
Transactions
Built-in stream processing
Long term storage
In the works: ZooKeeper removal (KIP-500), which makes Kafka even more simple to operate and deploy than Pulsar (which has a four-component architecture of Pulsar, ZooKeeper, BookKeeper, and RocksDB), apart from making Kafka more scalable, more resilient, etc. etc..)
In the works: Tiered Storage (KIP-405), which makes Kafka more elastic and cost-efficient.

Also ask yourself: Should you really compare just the open source frameworks or products and vendors with their complete offering?

It is easy to add new features if you don’t have to provide mission-critical support for it. Don’t just evaluate features in a checklist, but also evaluate how they are battle-tested in production scenarios. How many “differentiating features” are low-quality and implemented quickly vs. high-quality implementations?

For instance: It took a few years to implement and battle-test Kafka Streams as Kafka-native stream processing engine. Do you really want to compare this to Pulsar Functions? The latter is a feature to add user-defined functions (UDF); without any relation to “real stream processing”. Or is this more like Single Message Transformations (SMT), a core feature of Kafka Connect? Just be sure to a) compare apples to apples (instead of apples to oranges) and b) don’t forget to think about the maturity of a feature. The more powerful and critical, the more mature it should be…

The Kafka community spends a large amount of efforts to improve the core project and its ecosystem. Confluent alone has over 200 full time engineers working on the Kafka project, additional community components, commercial products and the SaaS offering on major cloud providers.

Myth 2: “Pulsar has a few very big users like Tencent in China”?

True.

But: Tencent actually uses Kafka more than Pulsar. The billing department, which uses Pulsar, is only a small fraction at Tencent, whereas a large portion of the core business is using Kafka, and they have a Global-Kafka like architecture that combines 1000+ brokers into a single logical cluster.

Always be cautious with open source projects. Check out the success at “normal companies”. Just because a tech giant uses it, does not mean it will work for your company well. How many Fortune 2000 companies shared their success stories around Pulsar in the past?

Look for proof points beyond tech giants!

Proof points beyond the tech giants are helpful to get insights and lessons learned from other people. Not from the software vendors. The Kafka website gives many examples about mission-critical deployments. Even more impressive: At the past Kafka Summit conferences in San Francisco, New York and London, every year various enterprises from different industries present their use cases and success stories. Including fortune 2000 companies, mid-size enterprises and startups.

Just to give you one specific example in the Kafka world: Various different implementations exist for replication of data in real time between separate Kafka clusters, including MirrorMaker 1 (part of the Apache Kafka project), MirrorMaker 2 (part of the Apache Kafka project), Confluent Replicator (built by Confluent and only available as part of Confluent Platform or Confluent Cloud), uReplicator (open sourced by Uber), Mirus (open sourced by Salesforce), Brooklin (open sourced by LinkedIn).

In practice, only two options are reasonable if you don’t want to maintain and improve the code by yourself: MirrorMaker 2 (very new, not mature yet, but a great option mid and long term) and Confluent Replicator (battle-tested in many mission-critical deployments, but not open source). All the other options work, too. But who maintains the projects? Who solves bugs and security issues? Who do you call when you have a problem in production? Deployment in production for mission-critical deployments is different from evaluating and trying out an open source project.

Myth 3: “Pulsar provides message queuing and event streaming in a single solution”?

Partly.

Message queues are used for point-to-point communication. They provide an asynchronous communications protocol, meaning that the sender and receiver of the message do not need to interact with the message queue at the same time.d

Pulsar has only limited support for message queuing, and limited support for event streaming. If it wants to compete in either area, it still has a long way to go for two reasons:

1) Pulsar has only limited support for message queuing because it misses popular messaging features like message XA transactions, routing, message filtering, etc. that are commonly used with messaging systems like IBM MQ, RabbitMQ, and ActiveMQ. Pulsar’s “adapters” for messaging systems are similarly limited. While they may look nice on paper, they are less useful in practice.

2) Pulsar has only limited support for event streaming. For example, it does not support exactly-once delivery and processing semantics, which disqualifies it for most use cases in practice – you would never implement, say, a payment processing system with Pulsar as it may cause duplicate payments, or lose payments. It also lacks functionality to perform stream processing with features like joins, aggregations, windowing, fault-tolerant state management, and event-time based processing. Pulsar’s “topics” functionality is also different to Kafka’s, and suffers from BookKeeper’s origins, as it was conceived and designed in 2008 as a write ahead log for Hadoop’s HDFS namenode, with only short-lived data storage in mind.

Side note: Pulsar’s “Kafka adapter”, like its messaging siblings, is similarly limited. While it may look nice on paper, it is less useful in practice because it supports only a small subset of Kafka functionality.

Like Pulsar, Kafka has only limited support for message queuing.

In Kafka, different workarounds can be used to realize “real queuing” behavior. If you want to use separate message queues instead of shared Kafka topics for:

Security? => Use Kafka’s ACLs (and optional tools like Confluent’s role-based access control aka RBAC).
Semantics (i.e. separate applications)? => Use Kafka’s consumer groups.
Load balancing? => Use Kafka’s partitions.

I typically ask customers what exactly they want to do with queuing. Often, Kafka provides out-of-the-box solutions for use cases which simply require thinking of the solution in new terms. Also, the number of high throughput use cases that need queuing is relatively small.

Having explained all these workarounds and limitations of Pulsar and Kafka for messaging, let’s be clear: Neither Kafka nor Pulsar provide a “real messaging solution”.

If you really need a messaging solution, shouldn’t you better choose a “real messaging framework” like RabbitMQ or NATS for a messaging problem anyway?

There is no ‘yes or no’ answer to this. I see many customers replacing existing messaging systems like IBM MQ with Kafka (for scalability and cost reasons). Know the options, their trade-offs, and do an evaluation to solve your problem the best way…

Myth 4: “Pulsar provides stream processing”?

False.

Or to be fair: It depends on your definition of stream processing. Is it only rudimentary features, or full-fledged stream processing?

In one sentence, I typically explain stream processing as continuous consumption, processing, and aggregation of events from different data sources. In real time. At scale. And, of course, in a fault-tolerant manner, including (and especially) for any stateful processing operations.

Pulsar provides only rudimentary functionality for stream processing, using its Pulsar Functions interface. This is suited for simple callbacks, but it isn’t a true stream processing offering like you get it with Kafka Streams or ksqlDB for building streaming applications that include stateful information, sliding windows, and other stream processing concepts. Use cases exist in every industry. For instance, check out the Kafka Streams website for examples from the New York Times, Pinterest, Trivago, Zalando, and others.

Streaming analytics examples with Pulsar typically use Pulsar in conjunction with another “proper” stream processing framework like Apache Spark or Apache Flink, which of course means you now need to operate even more additional pieces of distributed infrastructure and to understand their complex interactions.

Myth 5: “Pulsar provides exactly-once semantics like Kafka”?

False.

Pulsar provides a deduplication feature that ensures that a message will not be stored in the Pulsar broker twice, but nothing prevents a consumer from reading this message multiple times. This is insufficient for any form of stream processing use case where both input and output are from Pulsar.

Also, unlike Kafka’s Transactions feature, it is not possible to accurately tie messages committed to state recorded inside a stream processor.

Exactly-Once Semantics (EOS) are available since Kafka 0.11 (released three years ago) and used in many production deployments. Kafka’s EOS supports the whole Kafka ecosystem, including Kafka Connect, Kafka Streams, ksqlDB and clients like Java, C, C++, Go or Python. Kafka Summit had several talks about Kafka’s EOS functionality, including this great intro for everybody, with slides and video recording.

Myth 6: “Pulsar’s performance is much better than Kafka’s”?

False.

I am not a fan of most “benchmarks” of performance and throughput. Benchmarks are almost always opinionated and configured for a specific problem (no matter if a vendor, independent consultant, or researcher conducts them).

For example, there is one benchmark published by GIGAOM, which compares the latency and performance of Kafka versus Pulsar. But this benchmark deliberately slowed Kafka down by forcing it to synchronize-to-disk on every single message by setting the Kafka config ‘flush.messages = 1’ (this makes every request cause an fsync). The benchmark also forces the Kafka Consumer to acknowledge synchronously while the Pulsar consumer acknowledges asynchronously. Unsurprisingly, this benchmark setup makes Pulsar the seemingly clear “winner”. But this benchmark does not mention or explain this significant configuration difference in the setup and measurements. This is what some people call apples-to-oranges comparison.

Pulsar’s architecture actually requires higher network utilization (due to the Pulsar broker tier which acts as a proxy in front of BookKeeper bookies) as well as twice the I/O (as BookKeeper writes data to a write ahead log as well as to the main segment).

Confluent did some benchmarks, too. More an apple-to-apple comparison. Not surprisingly, the results were different. But should you really care about these benchmark fights from software vendors?

Think about your performance requirements. Do a proof of concept (POC) with Kafka and Pulsar, if you must. I bet that in 99% of scenarios, both will show acceptable performance for your use case. Don’t trust opinionated benchmarks from others! Your use case will have different requirements and characteristics anyway, and typically performance is just one of many evaluation dimensions.

Myth 7: “Pulsar is easier to operate than Kafka”?

False.

Both Kafka and Pulsar are hard to operate if you don’t use additional tooling.

Kafka includes two distributed systems: Kafka itself and Apache ZooKeeper.

But: Pulsar includes three distributed systems and an additional storage technology: Pulsar, ZooKeeper, and Apache BookKeeper. Like Pulsar, BookKeeper uses ZooKeeper, too. And lastly, RocksDB is used for certain storage tasks. This means that Pulsar has a significantly higher complexity to understand, tweak, and tune than Kafka. Additionally, Pulsar also has more configuration parameters than Kafka.

Kafka is firmly going into the opposite direction and is removing ZooKeeper (see KIP-500) so that you have just one distributed system to deploy, operate, scale and monitor:

ZooKeeper is Kafka’s biggest scalability bottleneck and comes with operational challenges — This is true for Kafka but even more so for Pulsar!

One of the key issues of my customers is how to run ZooKeeper in mission-critical deployments at scale. Therefore I am really looking forward to Kafka’s simplified architecture, where you will deploy Kafka brokers only. This also establishes a unified security model, as ZooKeeper’s security no longer needs to be separately configured. This is a huge benefit, especially for larger organizations and regulated industries. Compliance and information security departments will thank you for this simplified architecture.

Operations is NOT just about Architecture!

Kafka is significantly better documented, has a tremendously larger community of experts, and a vast array of supporting tooling that make operations easier.

Additionally, there are many options for local and online Kafka training, including online courses, books, meetups, and conferences. You won’t find much for Pulsar, unfortunately.

Myth 8: “An architecture with three tiers is better than two tiers”?

It depends.

Personally, I am skeptical that Pulsar’s three tier architecture (using Pulsar brokers, ZooKeeper and BookKeeper) is an advantage for most projects. It is a trade-off!

Twitter described their move away from BookKeeper + DistributedLog (the latter a system very similar to Pulsar, with comparable architecture and design) just over a year ago, citing the advantages of Kafka’s single-tier architecture, such as cost efficiency and better performance, over a two-tier architecture that decouples storage and serving.

Like Pulsar, DistributedLog is built on top of BookKeeper and adds streaming-like functionality with an architecture and concepts similar to Pulsar (e.g., using decoupled storage and serving tiers). DistributedLog was originally a standalone project but eventually became a sub-project of BookKeeper, though nowadays it appears to be no longer actively developed (only a few commits in the past 12 months). The main reasons Twitter cited for switching to Kafka were (1) significant cost savings and performance gains and (2) Kafka’s huge community and adoption. For example, they concluded: “For single consumer use cases, we saw a 68% resource savings, and for fanout cases with multiple consumers, we saw a 75% resource savings.”

There are benefits from a three-tier architecture to build a scalable infrastructure. But the extra layer also increases network utilization by (at least) 33%, and data held in Pulsar’s brokers must additionally be cached in both layers for equivalent performance, and also written to disk twice because the storage format of Bookkeeper is not based on a log.

On the cloud, where most Kafka deployments are being run, the best backing storage tier is in fact not a niche technology like BookKeeper, but a widely used and battle-tested object store like AWS S3 or GCP GCS.

Tiered Storage in Confluent Platform, which is backed by the likes of AWS S3 and GCP GCS, provides the same benefits without Pulsar’s extra layer of BookKeeper and the resulting extra network transfer cost and latency that this architecture incurs. It took Confluent two years to build and make Tiered Storage for Kafka generally available, including global 24/7 support for your most mission-critical data. Tiered Storage is not available yet for open source Apache Kafka, but Confluent is working with the rest of the Kafka community (including some major tech companies like Uber) on KIP-405 to add Tiered Storage to Kafka with different storage options.

There are always pros and cons for both architectures. Personally, I think that 95% of projects do not need a complex three-tier architecture. And where they make sense it is to add the advantages of external, price-efficient storage. You should care about 24/7 service level agreements (SLA), scalability, and throughout. Plus integration into your ecosystem as well as security, management tooling, and support. If your requirements require a three-tier architecture, then of course give it a go!

Sub-Myth: “Pulsar is better for lagging consumers because of its caching layer and storage layer”?

False.

The main problem with lagging consumers is that they exhaust the page cache i.e. recent messages are already cached. Reads from older segments replace these reducing the performance of consumers reading from the head of the log.

Pulsar’s architecture is actually worse in this regard. It retains the same issue around cache-flushing, but now the reads must do an extra network hop + and IO rather than just reading from the local media.

Myth 9: “Kafka does not scale as well as Pulsar”?

False.

This is one of the key arguments by the Pulsar community. As I said before, this always depends on the chosen benchmark. For example, I have seen tests with equivalent computing resources where Kafka did significantly better at high throughputs than Pulsar. Here is a “Pulsar vs. Kafka benchmark” where Kafka is much faster than Pulsar:

Scalability is not a problem for most use cases. You can easily scale up Kafka to process several gigabytes per second, as you can see in a demo to “Scale Apache Kafka to 10+ GB Per Second in Confluent Cloud“:

Honestly speaking, less than 1% of users should be worried about this discussion at all. If you have requirements like Netflix (processing Petabytes per day) or LinkedIn (processing trillions of messages), let’s talk about and discuss the best architecture, hardware, and configuration for such a deployment. For anybody else, don’t be worried.

Sub-Myth: “Kafka’s current approach means it can only store ~ 500K partitions per cluster”?

True.

Kafka today has not yet the best architecture for large scale deployments with hundreds of thousands of Kafka Topics and Partitions.

But: Pulsar, too, does not allow for unlimited scale. It just has different limits.

Kafka’s partition limit is imposed by Zookeeper. Removing Zookeeper from Kafka through the work in KIP-500 removes this upper bound.

As a side note:

The right design of your architecture is critical for success!

Most of the customers I have seen in trouble with Kafka partition counts and scalability are because they designed their architecture and applications in the wrong way (they’d run into the same issues if they were using Pulsar)!

Kafka is an event streaming platform, and not the next IBM MQ. If you try to recreate your favorite MQ solution and architecture with Kafka, you will likely fail. I have seen several customers failing here and then succeeding by re-architecting their setup with our help.

Chances are very high that you will not have any issues with partition numbers and scalability, even today with Kafka’s usage of ZooKeeper, if you design your use case right and understand Kafka’s basic concepts. This experience of customers is a common theme for any technology, like Kafka, that introduces a new technology level and paradigm well beyond what was done before (a prime example is the adoption hurdles faced by companies when they first began to move their use cases to the cloud).

Sub-Myth: “Pulsar supports a practically infinite number of partitions”?

False.

BookKeeper has the same 1-file-per-ledger limitation Kafka has, but there are multiple ledgers in one partition. Pulsar’s broker layer groups partitions into bundles, but it’s storage layer, Bookkeeper, stores data in segments with many segments for each partition.

Like for Kafka, the metadata for these segments is stored in Zookeeper, which imposes a limit on the total number that can be stored. Kafka is removing this dependency, thus allowing it to scale significantly further. I am really looking forward to seeing KIP-500 being implemented until ~ the end of 2020. “Apache Kafka Needs No Keeper: Removing the Apache ZooKeeper Dependency” walks you through the implementation details and planned timelines.

Sub-Myth: “Kafka scaling needs to be defined when creating a Kafka Topic”?

Partly true.

If more scalability is needed, Kafka topics can either be over-partitioned (i.e., you configure a topic with more partitions than you initially need for a use case; see Streams and Tables in Apache Kafka: Topics, Partitions, and Storage Fundamentals), or they can be re-configured to use more partitions if there are requirements to scale in the future. This is not perfect, but a consequence of how distributed event streaming works (and why it scales much better than traditional messaging systems like IBM MQ).

Best practices for creating topics and procedures for changing topic configurations during production are available. So no worries!

But: Pulsar topics have this restriction, too!

Write throughput is based on the number of partitions allocated in a Pulsar topic in the exact same way it is in a Kafka topic, so Pulsar topics must be over-provisioned for exactly the same reasons. That’s because, for each partition, only a single ledger (of the partition’s potentially many ledgers) is writable at the same time. Also, increasing the number of partitions dynamically impacts message ordering just like it does in Kafka (i.e. the message order is lost).

Both Kafka and Pulsar scale like crazy. This is sufficient for almost all use cases!

If you need even more extreme scale, I think a ZooKeeper-free implementation is the best choice. KIP-500 is thus the most anticipated Kafka change I see in the community and in Confluent’s customer base.

Myth 10: “Pulsar recovers from machine failure instantly but Kafka has to reload data”?

True and false.

Killing a Pulsar broker is indeed seamless, but (in contrast to a Kafka broker) the Pulsar broker doesn’t store any data but is only a proxy fronting the actual storage layer, which is BookKeeper. So highlighting that a Pulsar broker failure can easily be resolved is a marketing distraction, because actually one must talk about what happens when a BookKeeper node (a “bookie”) fails.

Killing and restarting a BookKeeper bookie requires the same redistribution of data seen in Kafka’s case. This is the nature of distributed systems, with concepts like replication and partitions.

Elastic Kafka is here already!

Elasticity is important. Confluent’s founder Jay Kreps has recently blogged about this topic: Elastic Apache Kafka Clusters in Confluent Cloud. In a SaaS cloud service like Confluent Cloud, the end user shouldn’t have to care at all about machine failure. 24/7 uptime is expected and should be guaranteed with 99.xx SLAs. Consumption-based pricing (i.e., pay as you go) means you do not have to worry about issues like broker management, sizing broker nodes, expanding or shrinking clusters, etc. under the hood at all.

Self-managed Kafka clusters also need similar capabilities. Tiered Storage for Kafka is huge because most of the data is not stored on the broker anymore to allow almost instant recovery from failures. In conjunction with tools like Self-Balancing Kafka (a Confluent feature coming in Q3 and discussed in the above link blog post), users don’t have to worry about elasticity in their self-managed clusters at all.

Unfortunately, if you are looking for such a modern offering for Pulsar, there is none available.

Myth 11: “Pulsar has better Inter-Cluster (Geo) Replication than Kafka”?

False.

Every distributed system has to solve problems like the CAP theorem and quorum in distributed computing. The quorum is the minimum number of votes that a distributed transaction has to obtain in order to be allowed to perform an operation in a distributed system. A quorum-based technique is implemented to enforce consistent operation in a distributed system.

Kafka requires ZooKeeper to solve the quorum problem. Even after KIP-500 and ZooKeeper removal, the universal laws of real-world physics are still the same: There are latency issues deploying a distributed system over regions like the US East, Central and West or even globally. That’s because the speed of light, though very high, does have a limit.

Various deployment options exist to work around this problem, including real time replication tools like Apache Kafka’s MirrorMaker 2, Confluent’s Replicator or Confluent’s Multi-Region-Clusters. Check out “Architecture patterns for distributed, hybrid, edge and global Apache Kafka deployments” for various different deployment options and best practices:

There is no single pattern or implementation to provide global replication AND zero downtime + zero data loss! For the most critical applications, Confluent’s Multi-Region-Clusters allows RTO=0 and RPO=0 (i.e. zero downtime and zero data loss) with automatic disaster recovery and client fail-over even if a complete data center or cloud region goes down.

Here, Pulsar’s architecture requires even more complexity than a “basic” Pulsar deployment. That’s because, for geo-replication, Pulsar requires an additional “global” Zookeeper cluster, which makes Pulsar inappropriate for geo-distribution over large distances. There is a workaround, but the problem around CAP theorem and physics do not go away.

No matter if you use Kafka or Pulsar, you need a battle-tested design to fight the laws of physics in your global deployments!

Myth 12: “Pulsar is compatible with Kafka’s interface and API”?

Partially True.

Pulsar provides a very basic implementation that is compatible with only minor parts of the Kafka v2.0 protocol.

Pulsar has a converter for basic parts of the Kafka protocol.

So, while alleged “Kafka compatibility” sounds nice on paper, one shouldn’t seriously consider this for migrating your running Kafka infrastructure to Pulsar? I doubt someone will take the risk…

We have seen “Kafka compatibility” claims in other examples such as the much more mature Azure Event Hubs service. Check out the limiting factors of their Kafka API, and be surprised! No support for core Kafka features like transactions (and thus exactly-once semantics), compression, or log compaction.

As it is not Kafka under the hood, also expect further diverging and unexpected behaviors when you connect your existing Kafka applications against such a “compatible” setup. No matter if Azure Event Hubs, Pulsar, or any other wrapper.

Kafka vs. Pulsar – Comprehensive Comparison

The last sections explored various technology myths we find in many other blog posts. I think I brought some clarity into these discussions.

Now, let’s not forget to take a look beyond the technical details of Kafka and Pulsar. Non-functional aspects are as important when choosing a technology.

I will cover three critical aspects in the following: Market traction, enterprise support and cloud offerings.

Market Traction of Apache Kafka and Apache Pulsar

Taking a look at Google Trends from the last five years confirms my personal experience, I see the interest in Apache Pulsar is very limited compared to Apache Kafka:

The picture looks very similar when you take a look at Stack Overflow and similar platforms, number and size of supporting vendors, the open ecosystem (tool integrations, wrapper frameworks like Spring Kafka), and similar characteristics for technology trends.

Job openings is another very good indicator of adoption of technology. Not many job openings for Pulsar means not many companies are using it. Search in your favorite job search engine. If you search globally, you will find <100 job openings for Pulsar, but thousands of jobs for Kafka. Additionally, most of the ones showing Pulsar say something like “looking for experience with Kafka, Pulsar, Kinesis or similar technologies”.

In most cases, these characteristics are much more relevant for the success of your next project than the subtle technical differences. The key goal is to solve your business problem, isn’t it?

So with the lack of adoption, why is Pulsar coming up in conversations at all? One reason is that independent consulting companies, research analysts, and bloggers (including me) need to talk about new cutting-edge technologies to keep their audience interested… And to be honest, it makes a good story.

Enterprise Support for Kafka and Pulsar

There is enterprise support for Kafka and Pulsar!

Though, the situation is not what you might expect. Here are the vendors you can call and ask for a meeting to discuss the potential next steps for working together on your Pulsar journey:

Streamlio (now acquired by Splunk), the former company behind Apache Pulsar. Splunk did not yet announce a future Pulsar strategy to support people working on their own Pulsar-based projects. Splunk is well-known for their widely-adopted analytics platform. That’s their core business (~ $1.8B in 2019). The only thing people complain about Splunk is the pricing. Splunk is a heavy Kafka user under the hood and now incorporates Pulsar into their Splunk Data Stream Processor (DSP). It is very doubtful that Splunk will jump on the open source bandwagon to support your next standalone Pulsar project (but a broader-scope DSP might be coming, of course). The future will show us…
StreamNative, founded by one of the original developers of Apache Pulsar, provides an event streaming platform based on Pulsar. At the time of writing this in June 2020, StreamNative has 13 (!) employees on LinkedIn. I am not sure if this is the right scale to support your next mission-critical deployment in 2020 but they do offer it.
TIBCO announced support for Pulsar in December 2019. Their core strategy moved from integration to analytics in the last years. TIBCO’s middleware customers are migrating away in high numbers. Their middleware team had to do some desperate strategy decisions: Support other platforms even though having zero contribution and experience with the projects. You are right, this might be a myth. But hey, a fact is that TIBCO also does the same for Kafka. And here is a nice trivia: TIBCO provides Kafka and ZooKeeper to you on Windows! Something nobody else does – because others know that this is not stable and creates inconsistencies all the time. But hey, TIBCO can support you now with Kafka and Pulsar. Why evaluate these two frameworks if one single vendor allows you to use both? Even on Windows; with .exe download and .bat scripts for starting the server components:

The number of vendors supporting Kafka grows every quarter!

Kafka has incredible huge market adoption in the meantime. The best proof for this is when the biggest software vendors provide support and tools around it. IBM, Oracle, Amazon, Microsoft and many other software companies support Kafka and build integration capabilities and own products around it.

The latest “wake-up call” for me was at Oracle OpenWorld 2019 in San Francisco where I attended a roadmap session from the Oracle product manager for GoldenGate (Oracle’s well-known great but also very expensive CDC tool). Most of the talk focused on opening GoldenGate to make it the data integration platform for everything. Half the talk was about event streaming, Kafka and how GoldenGate will provide integration with different databases / data lakes and Kafka in both directions.

Fully-Managed Cloud Offerings for Kafka and Pulsar

Let’s take a look at the cloud offerings available for Kafka and Pulsar.

There is a cloud service available for Apache Pulsar. It has a very innovative name:

Kafkaesque.

No kidding. Check the link… [Update: On ~June 17th, they rebranded the service: KAFKAESQUE is now KESQUE – probably they realized how embarrassing the name was.]

Maybe you also check out the various cloud offerings for Apache Kafka to find out which offering fits you better:

Confluent Cloud (SaaS) is a fully-managed service providing consumption-based pricing, 24/7 SLAs and elastic, serverless characteristics for Apache Kafka and its ecosystem (e.g. Schema Registry, Kafka Connect connectors and ksqlDB for stream processing).
Amazon MSK (PaaS) provisions ZooKeeper and Kafka Brokers so that the end user can operate it, fix bugs, do rolling upgrades, etc. One important fact everybody should be aware of: AWS excludes Kafka issues from its 99.95 SLAs and support!
Azure Event Hubs (SaaS) provides a Kafka endpoint (with a proprietary implementation under the hood) to interact with Kafka applications. It is very scalable and performant. As it is not really Kafka, but just an emulation, it misses several core features of Kafka like exactly-once semantics, log compaction, and compression. Not to mention the surrounding capabilities like Kafka Connect and Kafka Streams
Big Blue (IBM) and Big Red (Oracle) have cloud offerings around Kafka and its APIs. I have no idea if anyone is using them and how good they are. Never seen them in the wild by myself.
Plenty of smaller players like Aiven, CloudKarafka, Instaclustr, and others.

As you can see, the current cloud offerings show relatively clear how the market adoption of Kafka and Pulsar look like.

Conclusion – Apache Kafka or Apache Pulsar?

TL;DR: Pulsar is still a long way from Kafka’s level of maturity in terms of being proven for high scale use cases and building a community.

You should also question whether Pulsar is actually better.

Evaluate Kafka and Pulsar if you are going the purely open source way. Find out which fits you best. In your evaluation, include the technical feature set, maturity, vendors, developer community, and other relevant factors. Which one fits your situation best?

If you need an enterprise solution that covers much more than what both of these two open source systems offer, Kafka is the only option: Choose a Kafka-based offering from one of the various vendors or a suitable cloud offering. Pulsar, unfortunately, is not ready for this today and the foreseeable future.

How do you think about Apache Kafka vs. Apache Pulsar? What is your strategy? Let’s connect on LinkedIn and discuss! Stay informed about new blog posts by subscribing to my newsletter.

The post Pulsar vs Kafka – Comparison and Myths Explored appeared first on Kai Waehner.