Comparison Archives - Kai Waehner

Databricks and Confluent Leading Data and AI Architectures – What About Snowflake, BigQuery, and Friends?

Kai Waehner — Thu, 15 May 2025 09:57:25 +0000

The modern data landscape is shaped by platforms that excel in different but increasingly overlapping domains. Confluent leads in data streaming with enterprise-grade infrastructure for real-time data movement and processing. Databricks and Snowflake dominate the lakehouse and analytics space—each with unique strengths. Databricks is known for scalable AI and machine learning pipelines, while Snowflake stands out for its simplicity, governed data sharing, and performance in cloud-native analytics.

This final blog in the series brings together everything covered so far and highlights how these technologies power real customer innovation. At Erste Bank, Confluent and Databricks are combined to build an event-driven architecture for Generative AI use cases in customer service. At Siemens, Confluent and Snowflake support a shift-left architecture to drive real-time manufacturing insights and medical AI—using streaming data not just for analytics, but also to trigger operational workflows across systems.

Together, these examples show why so many enterprises adopt a multi-platform strategy—with Confluent as the event-driven backbone, and Databricks or Snowflake (or both) as the downstream platforms for analytics, governance, and AI.

About the Confluent and Databricks Blog Series

This article is part of a blog series exploring the growing roles of Confluent and Databricks in modern data and AI architectures:

Blog 1: The Past, Present and Future of Confluent (The Kafka Company) and Databricks (The Spark Company)
Blog 2: Confluent Data Streaming Platform vs. Databricks Data Intelligence Platform for Data Integration and Processing
Blog 3: Shift-Left Architecture for AI and Data Warehousing with Confluent and Databricks
Blog 4: Databricks and Confluent in Enterprise Software Environments (with SAP as Example)
Blog 5 (THIS ARTICLE): Databricks and Confluent Leading Data and AI Architectures – and How They Compare to Competitors

Learn how these platforms will affect data use in businesses in future articles. Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including technical architectures and the relation to analytical platforms like Databricks and Snowflake.

The Broader Data Streaming and Lakehouse Landscape

The data streaming and lakehouse space continues to expand, with a variety of platforms offering different capabilities for real-time processing, analytics, and storage.

Data Streaming Market

On the data streaming side, Confluent is the leader. Other cloud-native services like Amazon MSK, Azure Event Hubs, and Google Cloud Managed Kafka provide Kafka-compatible offerings, though they vary in protocol support, ecosystem maturity, and operational simplicity. StreamNative, based on Apache Pulsar, competes with the Kafka offerings, while Decodable and DeltaStream leverage Apache Flink for real-time stream processing using a complementary approach. Startups such as AutoMQ and BufStream pitch reimagining Kafka infrastructure for improved scalability and cost-efficiency in cloud-native architectures.

The data streaming landscape is growing year by year. Here is the latest overview of the data streaming market:

Lakehouse Market

In the lakehouse and analytics platform category, Databricks leads with its cloud-native model combining compute and storage, enabling modern lakehouse architectures. Snowflake is another leading cloud data platform, praised for its ease of use, strong ecosystem, and ability to unify diverse analytical workloads. Microsoft Fabric aims to unify data engineering, real-time analytics, and AI on Azure under one platform. Google BigQuery offers a serverless, scalable solution for large-scale analytics, while platforms like Amazon Redshift, ClickHouse, and Athena serve both traditional and high-performance OLAP use cases.

The Forrester Wave for Lakehouses analyzes and explores the vendor options, showing Databricks, Snowflake and Google as the leaders. Unfortunately, it is not allowed to post the Forrester Wave, so you need to download it from a vendor.

Confluent + Databricks

This blog series highlights Databricks and Confluent because they represent a powerful combination at the intersection of data streaming and the lakehouse paradigm. Together, they enable real-time, AI-driven architectures that unify operational and analytical workloads across modern enterprise environments.

Each platform in the data streaming and Lakehouse space has distinct advantages, but none offer the same combination of real-time capabilities, open architecture, and end-to-end integration as Confluent and Databricks.

It’s also worth noting that open source remains a big – if not the biggest – competitor to all of these vendors. Many enterprises still rely on open-source data lakes built on Elastic, legacy Hadoop, or open table formats such as Apache Hudi—favoring flexibility and cost control over fully managed services.

Confluent: The Leading Data Streaming Platform (DSP)

Confluent is the enterprise-standard platform for data streaming, built on Apache Kafka and extended for cloud-native, real-time operations at global scale. The data streaming platform (DSP) delivers a complete and unified platform with multiple deployment options to meet diverse needs and budgets:

Confluent Cloud – Fully managed, serverless Kafka and Flink service across AWS, Azure, and Google Cloud
Confluent Platform – Self-managed software for on-premises, private cloud, or hybrid environments
WarpStream – Kafka-compatible, cloud-native infrastructure optimized for BYOC (Bring Your Own Cloud) using low-cost object storage like S3

Together, these options offer cost efficiency and flexibility across a wide range of streaming workloads:

Small-volume, mission-critical use cases such as payments or fraud detection, where zero data loss, strict SLAs, and low latency are non-negotiable
High-volume, analytics-driven use cases like clickstream processing for real-time personalization and recommendation engines, where throughput and scalability are key

Confluent supports these use cases with:

Cluster Linking for real-time, multi-region and hybrid cloud data movement
100+ prebuilt connectors for seamless integration with enterprise systems and cloud services
Apache Flink for rich stream processing at scale
Governance and observability with Schema Registry, Stream Catalog, role-based access control, and SLAs
Tableflow for native integration with Delta Lake, Apache Iceberg, and modern lakehouse architectures

While other providers offer fragments—such as Amazon MSK for basic Kafka infrastructure or Azure Event Hubs for ingestion—only Confluent delivers a unified, cloud-native data streaming platform with consistent operations, tooling, and security across environments.

Confluent is trusted by over 6,000 enterprises and backed by deep experience in large-scale streaming deployments, hybrid architectures, and Kafka migrations. It combines industry-leading technology with enterprise-grade support, expertise, and consulting services to help organizations turn real-time data into real business outcomes—securely, efficiently, and at any scale.

Databricks: The Leading Lakehouse for AI and Analytics

Databricks is the leading platform for unified analytics, data engineering, and AI—purpose-built to help enterprises turn massive volumes of data into intelligent, real-time decision-making. Positioned as the Data Intelligence Platform, Databricks combines a powerful lakehouse foundation with full-spectrum AI capabilities, making it the platform of choice for modern data teams.

Its core strengths include:

Delta Lake + Unity Catalog – A robust foundation for structured, governed, and versioned data at scale
Apache Spark – Distributed compute engine for ETL, data preparation, and batch/stream processing
MosaicML – End-to-end tooling for efficient model training, fine-tuning, and deployment of custom AI models
AI/ML tools for data scientists, ML engineers, and analysts—integrated across the platform
Native connectors to BI tools (like Power BI, Tableau) and MLOps platforms for model lifecycle management

Databricks directly competes with Snowflake, especially in the enterprise AI and analytics space. While Snowflake shines with simplicity and governed warehousing, Databricks differentiates by offering a more flexible and performant platform for large-scale model training and advanced AI pipelines.

The platform supports:

Batch and (sort of) streaming analytics
ML model training and inference on shared data
GenAI use cases, including RAG (Retrieval-Augmented Generation) with unstructured and structured sources
Data sharing and collaboration across teams and organizations with open formats and native interoperability

Databricks is trusted by thousands of organizations for AI workloads, offering not only powerful infrastructure but also integrated governance, observability, and scalability—whether deployed on a single cloud or across multi-cloud environments.

Combined with Confluent’s real-time data streaming capabilities, Databricks completes the AI-driven enterprise architecture by enabling organizations to analyze, model, and act on high-quality, real-time data at scale.

Stronger Together: A Strategic Alliance for Data and AI with Tableflow and Delta Lake

Confluent and Databricks are not trying to replace each other. Their partnership is strategic and product-driven.

Recent innovation: Tableflow + Delta Lake – this feature enables bi-directional data exchange between Kafka and Delta Lake.

Direction 1: Confluent streams → Tableflow → Delta Lake (via Unity Catalog)
Direction 2: Databricks insights → Tableflow → Kafka → Flink or other operational systems

This simplifies architecture, reduces cost and latency, and removes the need for Spark jobs to manage streaming data.

Source: Confluent

Confluent becomes the operational data backbone for AI and analytics. Databricks becomes the analytics and AI engine fed with data from Confluent.

Where needed, operational or analytical real-time AI predictions can be done within Confluent’s data streaming platform: with embedded or remote model inference, native integration for search with vector databases, and built-in models for common predictive use cases such as forecasting.

Erste Bank: Building a Foundation for GenAI with Confluent and Databricks

Erste Group Bank AG, one of the largest financial services providers in Central and Eastern Europe, is leveraging Confluent and Databricks to transform its customer service operations with Generative AI. Recognizing that successful GenAI initiatives require more than just advanced models, Erste Bank first focused on establishing a solid foundation of real-time, consistent, and high-quality data leveraging data streaming and an event-driven architecture.

Using Confluent, Erste Bank connects real-time streams, batch workloads, and request-response APIs across its legacy and cloud-native systems in a decoupled way but ensuring data consistency through Kafka. This architecture ensures that operational and analytical data — whether from core banking platforms, digital channels, mobile apps, or CRM systems — flows reliably and consistently across the enterprise. By integrating event streams, historical data, and API calls into a unified data pipeline, Confluent enables Erste Bank to create a live, trusted digital representation of customer interactions.

With this real-time foundation in place, Erste Bank leverages Databricks as its AI and analytics platform to build and scale GenAI applications. At the Data in Motion Tour 2024 in Frankfurt, Erste Bank presented a pilot project where customer service chatbots consume contextual data flowing through Confluent into Databricks, enabling intelligent, personalized responses. Once a customer request is processed, the chatbot triggers a transaction back through Kafka into the Salesforce CRM, ensuring seamless, automated follow-up actions.

Source: Erste Group Bank AG

By combining Confluent’s real-time data streaming capabilities with Databricks’ powerful AI infrastructure, Erste Bank is able to:

Deliver highly contextual, real-time customer service interactions
Automate CRM workflows through real-time event-driven architectures
Build a scalable, resilient platform for future AI-driven applications

This architecture positions Erste Bank to continue expanding GenAI use cases across financial services, from customer engagement to operational efficiency, powered by consistent, trusted, and real-time data.

Confluent: The Neutral Streaming Backbone for Any Data Stack

Confluent is not tied to a monolithic compute engine within a cloud provider. This neutrality is a strength:

Bridges operational systems (mainframes, SAP) with modern data platforms (AI, lakehouses, etc.)
An event-driven architecture built with a data streaming platform feeds multiple lakehouses at once
Works across all major cloud providers, including AWS, Azure, and GCP
Operates at the edge, on-prem, in the cloud and in hybrid scenarios
One size doesn’t fit all – follow best practices from microservices architectures and data mesh to tailor your architecture with purpose-built solutions.

The flexibility makes Confluent the best platform for data distribution—enabling decoupled teams to use the tools and platforms best suited to their needs.

Confluent’s Tableflow also supports Apache Iceberg to enable seamless integration from Kafka into lakehouses beyond Delta Lake and Databricks—such as Snowflake, BigQuery, Amazon Athena, and many other data platforms and analytics engines.

Example: A global enterprise uses Confluent as its central nervous system for data streaming. Customer interaction events flow in real time from web and mobile apps into Confluent. These events are then:

Streamed into Databricks once for multiple GenAI and analytics use cases.
Written to an operational PostgreSQL database to update order status and customer profiles
Pushed into an customer-facing analytics engine like StarTree (powered by Apache Pinot) for live dashboards and real-time customer behavior analytics
Shared with Snowflake through a lift-and-shift M&A use case to unify analytics from an acquired company

This setup shows the power of Confluent’s neutrality and flexibility: enabling real-time, multi-directional data sharing across heterogeneous platforms, without coupling compute and storage.

Snowflake: A Cloud-Native Companion to Confluent – Natively Integrated with Apache Iceberg and Polaris Catalog

Snowflake pairs naturally with Confluent to power modern data architectures. As a cloud-native SaaS from the start, Snowflake has earned broad adoption across industries thanks to its scalability, simplicity, and fully managed experience.

Together, Confluent and Snowflake unlock high-impact use cases:

Near real-time ingestion and enrichment: Stream operational data into Snowflake for immediate analytics and action.
Unified operational and analytical workloads: Combine Confluent’s Tableflow with Snowflake’s Apache Iceberg support through its open source Polaris catalog to bridge operational and analytical data layers.
Shift-left data quality: Improve reliability and reduce costs by validating and shaping data upstream, before it hits storage.

With Confluent as the streaming backbone and Snowflake as the analytics engine, enterprises get a cloud-native stack that’s fast, flexible, and built to scale. Many enterprises use Confluent as data ingestion platform for Databricks, Snowflake, and other analytical and operational downstream applications.

Shift Left at Siemens: Real-Time Innovation with Confluent and Snowflake

Siemens is a global technology leader operating across industry, infrastructure, mobility, and healthcare. Its portfolio includes industrial automation, digital twins, smart building systems, and advanced medical technologies—delivered through units like Siemens Digital Industries and Siemens Healthineers.

To accelerate innovation and reduce operational costs, Siemens is embracing a shift-left architecture to enrich data early in the pipeline before it reaches Snowflake. This enables reusable, real-time data products in the data streaming platform leveraging an event-driven architecture for data sharing with analytical and operational systems beyond Snowflake.

Siemens Digital Industries applies this model to optimize manufacturing and intralogistics, using streaming ETL to feed real-time dashboards and trigger actions like automated inventory replenishment—while continuing to use Snowflake for historical analysis, reporting, and long-term data modeling.

Source: Siemens Digital Industries

Siemens Healthineers embeds AI directly in the stream processor to detect anomalies in medical equipment telemetry, improving response time and avoiding costly equipment failures—while leveraging Snowflake to power centralized analytics, compliance reporting, and cross-device trend analysis.

Source: Siemens Healthineers

These success stories are part of The Data Streaming Use Case Show, my new industry webinar series. Learn more about Siemens’ usage of Confluent and Snowflake and watch the video recording about “shift left”.

Open Outlook: Agentic AI with Model-Context Protocol (MCP) and Agent2Agent Protocol (A2A)

While data and AI platforms like Databricks and Snowflake play a key role, some Agentic AI projects will likely rely on emerging, independent SaaS platforms and specialized tools. Flexibility and open standards are key for future success.

What better way to close a blog series on Confluent and Databricks (and Snowflake) than by looking ahead to one of the most exciting frontiers in enterprise AI: Agentic AI.

As enterprise AI matures, there is growing interest in bi-directional interfaces between operational systems and AI agents. Google’s A2A (Agent-to-Agent) architecture reinforces this shift—highlighting how intelligent agents can autonomously communicate, coordinate, and act across distributed systems.

Confluent + Databricks is an ideal combination to support these emerging Agentic AI patterns, where event-driven agents continuously learn from and act upon streaming data. Models can be embedded directly in Flink for low-latency applications or hosted and orchestrated in Databricks for more complex inference workflows.

The Model-Context-Protocol (MCP) is gaining traction as a design blueprint for standardized interaction between services, models, and event streams. In this context, Confluent and Databricks are well positioned to lead:

Confluent: Event-driven delivery of context, inputs, and actions
Databricks: Model hosting, training, inference, and orchestration
Jointly: Closed feedback loops between business systems and AI agents

Together with protocols like A2A and MCP, this architecture will shape the next generation of intelligent, real-time enterprise applications.

Confluent + Databricks: The Future-Proof Data Stack for AI and Analytics

Databricks and Confluent are not just partners. They are leaders in their respective domains. Together, they enable real-time, intelligent data architectures that support operational excellence and AI innovation.

Other AI and data platforms are part of the landscape, and many bring valuable capabilities. As explored in this blog series, the true decoupling using an event-driven architecture with Apache Kafka allows using any kind of combination of vendors and cloud services. I see many enterprises using Databricks and Snowflake integrated to Confluent. However, the alignment between Confluent and Databricks stands out due to its combination of strengths:

Confluent’s category leadership in data streaming, powering thousands of production deployments across industries
Databricks’ strong position in the lakehouse and AI space, with broad enterprise adoption for analytics and machine learning
Shared product vision and growing engineering and go-to-market alignment across technical and field organizations

For enterprises shaping a long-term data and AI strategy, this combination offers a proven, scalable foundation—bridging real-time operations with analytical depth, without forcing trade-offs between speed, flexibility, or future-readiness.

Stay tuned for deep dives into how these platforms are shaping the future of data-driven enterprises. Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including technical architectures and the relation to analytical platforms like Databricks and Snowflake.

The post Databricks and Confluent Leading Data and AI Architectures – What About Snowflake, BigQuery, and Friends? appeared first on Kai Waehner.

The Data Streaming Landscape 2025

Kai Waehner — Wed, 04 Dec 2024 13:49:37 +0000

Data streaming is a new software category. It has grown from niche adoption to becoming a fundamental part of modern data architecture. With real-time data processing transforming industries, the ecosystem of tools, platforms, and services has evolved significantly. This blog post explores the data streaming landscape of 2025, analyzing key players, trends, and market dynamics shaping this space.

The data streaming landscape of 2025 categorizes solutions by their adoption and completeness as fully-featured data streaming platforms, as well as their deployment models, which range from self-managed setups to BYOC (Bring Your Own Cloud) and fully managed cloud services like PaaS and SaaS. While Apache Kafka remains the backbone of this ecosystem, the landscape also includes stream processing engines like Apache Flink and competitive technologies such as Pulsar and Redpanda that are built on the Kafka protocol.

This blog also explores the latest market trends and provides an outlook for 2025 and beyond, highlighting potential new entrants and evolving use cases. By the end, you’ll gain a clear understanding of the data streaming platform landscape and its trajectory in the years to come.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download by free ebook about data streaming use cases and industry-specific success stories.

Data Streaming in 2025: The Rise of a New Software Category

Real-time data beats slow data. That’s true across almost all use cases in any industry. Event-driven applications powered by data streaming continuously process data from any data source. This approach increases the business value as the overall goal by increasing revenue, reducing cost, reducing risk, or improving the customer experience. And the event-driven architecture ensures a future-ready architecture.

Even top researchers and advisory firms such as Forrester, Gartner, and IDG recognize data streaming as a new software category. In December 2023, Forrester released “The Forrester Wave: Streaming Data Platforms, Q4 2023,” highlighting Microsoft, Google, and Confluent as leaders, followed by Oracle, Amazon, and Cloudera.

Data Streaming is NOT just another data integration tool. Plenty of software categories and related data platforms exist to process and analyze data. I explored in a few dedicated series how data streaming. differs:

The Business Value of Data Streaming

A new software category opens use cases and adds business value across all industries:

Source: Lyndon Hedderly (Confluent)

Adding business value is crucial for any enterprise. With so many potential use cases, it is no surprise that more and more software vendors add Kafka support to their products.

Search my blog for your favorite industry to find plenty of case studies and architectures. Or read about use cases for Apache Kafka across industries to get started.

The Data Streaming Landscape of 2025

Data Streaming is a separate software category of data platforms. Many software vendors built their entire businesses around this category. Several mature players in the data market added support for data streaming in their platforms or cloud service ecosystem. Various SaaS startups have emerged in this category in the last few years.

It all began with the open-source framework Apache Kafka, and today, the Kafka protocol is widely adopted across various implementations, including proprietary ones. What truly matters now is leveraging the capabilities of a complete data streaming platform—one that is fully compatible with the Kafka protocol. This includes built-in features like connectors, stream processing, security, data governance, and the elimination of self-management, reducing risks and operational effort.

The Kafka Protocol is the De Facto Standard of Data Streaming

Most software vendors use Kafka (or its protocol) at the core of their data streaming platforms. Apache Kafka has become the de facto standard for data streaming.

Additionally, “benchmarketing” (i.e., picking a sweet spot or niche scenario where you perform better than your competitor) is the favorite marketing technique to “prove” differentiators to the real Apache Kafka. Some vendors also present misleading cost-efficiency comparisons by excluding critical cloud costs such as data transfer or storage, giving an incomplete picture of the true expenses.

Apache Kafka vs. Data Streaming Platform

Many still use Kafka merely as a dumb ingestion pipeline, overlooking its potential to power sophisticated, real-time data streaming use cases. One reason is that Kafka alone lacks the full capabilities of a comprehensive data streaming platform.

A complete solution requires more than “just” Kafka. Apache Flink is becoming the de facto standard for stream processing. Data integration capabilities (connectors, clients, APIs), data governance, security, and critical 24/7 SLAs and support are important for many data streaming projects.

The Data Streaming Landscape 2025 summarizes the current status of relevant products and cloud services, focusing on deployment models and the adoption/completeness of the data streaming platforms.

Data Streaming Vendors and Categories for the 2025 Landscape

The data streaming landscape changed this year. As most solutions evolve, I do not distinguish anymore between Kafka, non-Kafka, and stream processing as categories. Instead, I look at the adoption and completeness to assess the maturity of a data streaming solution from an open-source framework to a complete platform.

The deployment models also changed in the 2025 landscape. Instead of categorizing it into Self Managed, Partially Managed, and Fully Managed, I sort as follows: Self Managed, Bring Your Own Cloud (BYOC), and Cloud. The Cloud category is separated into PaaS (Platform as a Service) and SaaS (Software as a Service) to indicate that many Kafka cloud offerings are still NOT fully managed!

Please note: Intentionally, this data streaming landscape is not a complete list of frameworks, cloud services, or vendors. It is also not an official research. There is no statistical evidence. You might miss your favorite technology in this diagram. Then I did not see it in my conversations with customers, prospects, partners, analysts, or the broader data streaming community.

Also, note that I focus on general data streaming infrastructure. Brilliant solutions exist for using and analyzing streaming data for specific scenarios, like time-series databases, machine learning engines, observability platforms, or purpose-built IoT solutions. These are usually complementary, often connected out of the box via a Kafka connector, or even built on top of a data streaming platform (invisible for the end user).

Adoption and Completeness of Data Streaming (X-Axis)

Data streaming is adopted more and more across all industries. The concept is not new. In “The Past, Present and Future of Stream Processing“, I explored how the data streaming journey started decades ago with research and the first purpose-built proprietary products for specific use cases like stock trading.

Open source stream processing frameworks emerged during the big data and Hadoop era to make at least the ingestion layer a bit more real-time. Is anyone still remembering (or even still using) Apache Storm?

Today, most enterprises are realizing the value of data streaming for both analytical and operational use cases across industries. The cloud has brought a transformative shift, enabling businesses to start streaming and processing data with just a click, using fully managed SaaS solutions and intuitive UIs. Complete data streaming platforms now offer many built-in features that users previously had to develop themselves, including connectors, encryption, access control, governance, data sharing, and more.

Capabilities of a Complete Data Streaming Platform

Data streaming vendors are on the way to building a complete Data Streaming Platform (DSP). Capabilities include:

Messaging (“Streaming”): Transfer messages in real-time and persist for durability, decoupling, and slow consumers (near real-time, batch, API, file).
Data Integration: Connect to any legacy and cloud-native sources and sinks.
Stream Processing: Correlate events with stateless and stateful transformation or business logic.
Data Governance: Ensure security, observability, data sovereignty, and compliance.
Developer Tooling: Enable flexibility for different personas such as software engineers, data scientists, and business analysts by providing different APIs (such as Java, Python, SQL, REST/HTTP), graphical user interfaces, and dashboards.
Operations Tooling and SaaS: Ease infrastructure management on premise respectively take over the entire operations burden in the public cloud with serverless offerings.
Uptime SLAs and Support: Provide the required guarantees and expertise for critical use cases.

Evolution from Open Source Adoption to a Data Streaming Organization

Modern data streaming is not just about adopting a product; it’s about transforming the way organizations operate and derive value from their data. Hence, the adoption goes beyond product features:

From open source and self-operations to enterprise-grade products and SaaS.
From human scale to automated, serverless elasticity with consumption-based pricing.
From dumb ingestion pipes to complex data pipelines and business applications.
From analytical workloads to critical transactional (and analytical) use cases.
From a single data streaming cluster to a powerful edge, hybrid, and multi-cloud architecture, including integration, migration, aggregation, and disaster recovery scenarios.
From wild adoption across business units with uncontrolled growth using various frameworks, cloud services, and integration tools to a center of excellence (CoE) with a strategic approach with standards, best practices, and knowledge sharing in an internal community.
From effortful and complex human management to enterprise-wide data governance, automation, and self-service APIs.

Data Streaming Deployment Models: Self-Managed vs. BYOC vs. Cloud (Y-Axis)

Different data streaming categories exist regarding the deployment model:

Self-Managed: Operate nodes like Kafka Broker, Kafka Connect, and Schema Registry by yourself with your favorite scripts and tools. This can be on-premise or in the public cloud in your VPC. Reduce the operations burden via a cloud-native platform (usually Kubernetes) and related operator tools that automate operations tasks like rolling upgrades or rebalancing Kafka Partitions.
Bring Your Own Cloud (BYOC): Allow organizations to host Kafka within their own cloud VPC. BYOC combines some of the benefits of cloud flexibility with enhanced security and control, while it outsources most of Kafka’s management to specialized vendors. The data plane is still customer-managed, but in contrast to self-managed Kafka, the customer does not need to worry about the complexity under the hood (like rebalancing, rolling upgrades, backups) – that is what cloud-native object storage and other magic code of the BYOC control plane service take over.
Cloud (PaaS or SaaS): Both PaaS and SaaS solutions operate within the cloud provider’s VPC. Fully managed SaaS for data streaming takes overall operational responsibilities, including scaling, failover handling, upgrades, and performance tuning, allowing users to focus solely on integration and business logic. In contrast, partially managed PaaS reduces the operational burden by automating certain tasks like rolling upgrades and rebalancing, but still requires some level of user involvement in managing the infrastructure. Fully Managed SaaS typically provides critical SLAs for support and uptime while partially managed PaaS cannot provide such guarantees.

Most organizations prefer SaaS for data streaming when business and technical requirements allow, as it minimizes operational complexity and maximizes scalability. Other deployment models are chosen when specific constraints or needs require them.

The Evolution of BYOC Kafka Cloud Services

Cloud and On-Premise deployment options are typically well understood, but BYOC (Bring Your Own Cloud) often requires more explanation due to its unique operating model and varying implementations across vendors.

In last year’s data streaming landscape 2024, I wrote the following about BYOC for Kafka:

“I do NOT believe in this approach as too many questions and challenges exist with BYOC regarding security, support, and SLAs in the case of P1 and P2 tickets and outages. Hence, I put this in the category of self-managed. That is what it is, even though the vendor touches your infrastructure. In the end, it is your risk because you have to and want to control your environment.”

This statement made sense because BYOC vendors at that time required access to the customer VPC and offered a shared support model. While this is still true for some BYOC solutions, my mind changed with the innovation of BYOC by one emerging vendor: WarpStream.

WarpStream’s BYOC Operating Model with Stateless Agents in the Customer VPC

WarpStream published a new operating model for BYOC: The customer only deploys stateless agents in its VPC and provides an object storage bucket to store the data. The control plane and metadata store are fully managed by the vendor as SaaS and the vendor takes over all the complexity.

Source: Confluent

With this innovation, BYOC is now a worthwhile third deployment option besides a self-managed and fully managed data streaming platform. It brings several benefits:

No access is needed by the BYOC cloud vendor to the customer VPC. The data plane (i.e., the “brokers” in the customer VPC) is stateless. The metadata/consensus is in the control plane (i.e., the cloud service in the WarpStream VPC).
The architecture solves sovereignty challenges and is a great fit for security and compliance requirements.
The cost of the BYOC offering is cheaper than self-managed Apache Kafka because it is built with cloud-native concepts and technologies in mind (e.g., zero disks and zero interzone networking fees, leveraging cloud object storage such as Amazon S3, Azure Blog Storage, or Google Cloud Storage).
The stateless architecture in the customer VPC makes autoscaling and elasticity very easy to implement/configure.

When to use BYOC?

WarpStream introduced an innovative share-nothing operating model that makes BYOC practical, secure, and cost-efficient. With that being said, I still recommend only looking at BYOC options for Apache Kafka in the public cloud if a fully managed and serverless data streaming platform does NOT work for you because of cost, security, or compliance reasons! When it comes to simplicity and ease of operation, nothing beats a fully managed cloud service.

And please keep in mind that NOT every BYOC cloud service provides these characteristics and benefits. Make sure to make a proper evaluation of your favorite solutions. For more details, look at my blog post: “Deployment Options for Apache Kafka: Self-Managed, Fully-Managed / Serverless and BYOC (Bring Your Own Cloud)“.

Changes in the Data Streaming Landscape from 2024 to 2025

My goal is NOT a growing landscape with tens or even hundreds of vendors and cloud services. Plenty of these pictures exist. Instead, I focus on a few technologies, vendors, and cloud offerings that I see in the field, with adoption by the broader open-source and cloud community.

I already discussed the important conceptual changes in the data streaming landscape:

Deployment Model: From self-managed, partially managed, and fully managed to self-managed, BYOC and cloud.
Streaming Categories: From different streaming categories to a single category for all data streaming platforms sorted by adoption and completeness.

Additionally, every year I modified the list of solutions compared to the data streaming landscape 2024 published one year ago.

Additions to the Data Streaming Landscape 2025

The following data streaming services were added:

Alibaba (Cloud): Confluent Data Streaming Service on Alibaba Cloud is an OEM partnership between Alibaba Cloud and Confluent to offer a fully managed SaaS in Mainland China. The service was announced end of 2021 and sees more and more traction in Asia. Alibaba is the contractor and first-level support for the end user.
Google Managed Service for Kafka (Cloud): Google announced this Kafka PaaS recently. The strategy looks very similar to Amazon’s MSK. Even the shortcut is the same: MSK. I explored when (not) to choose Google’s Kafka cloud service after the announcement. The service is still in preview, but available to a huge customer base already.
Oracle Streaming with Apache Kafka (Cloud): A partially managed Apache Kafka PaaS on Oracle Cloud Infrastructure (OCI). The service is in early access, but available to a huge customer base already.
WarpStream (BYOC): WarpStream was acquired by Confluent. It is still included with its logo as Confluent continues to keep the brand and solution separated (at least for now).

Removals from the Data Streaming Landscape 2025

There are NO REMOVALS this year, BUT I was close to removing two technologies:

Apache Pulsar and StreamNative: I was close to removing Apache Pulsar as I see zero traction in the market. Around 2020, Pulsar had some traction but focused too much on Kafka FUD instead of building a vibrant community. While Kafka simplified its architecture (ZooKeeper removal), Pulsar still includes three distributed systems (ZooKeeper or alternatives like etcd, BookKeeper, and Pulsar Broker). It also pivots to the Kafka protocol trying to get some more traction again. But it seems to be too late.
ksqlDB (formerly called KSQL): The product is feature complete. While it is still supported by Confluent, it will not get any new features. ksqlDB is still a great piece of software for some (!) Kafka-native stream processing projects but might be removed in the future. Confluent co-founder and Kafka co-creator Jay Kreps commented on X (former Twitter): “Confluent went through a set of experiments in this area. What we learned is that for *platform* layers you want a clean separation. We learned this the hard way: our source available stream processing layer KSQL, lost to open-source Apache Flink. We pivoted to Flink.“

Vendor Overview for Data Streaming Platforms

All vendors of the landscape do some kind of data streaming. However, the offerings differ a lot in adoption, completeness, and vision. And many solutions are not available everywhere but only in one public cloud or only as self-managed. For detailed product information and experiences, the vendor websites and other blogs/conference talks are the best and most up-to-date resources. The following is just a summary to get an overview.

Before we do the deep dive, here again, the entire data streaming landscape for 2025:

Self-Managed Data Streaming with Open Source and Proprietary Products

The following list describes the open-source frameworks and proprietary products for self-managed data streaming deployments (in order of adoption and completeness):

Apache Pulsar: A competitor to Apache Kafka. Similar story and use cases, but different architecture. Kafka is a single distributed cluster – after removing the ZooKeeper dependency in 2022. Pulsar is three (!) distributed clusters: Pulsar brokers, ZooKeeper, and BookKeeper. Pulsar vs. Kafka explored the differences. And Kafka catches up to some missing features like Queues for Kafka.
StreamNative: The primary vendor behind Apache Pulsar. Not much market traction.
ksqlDB (usually called KSQL, even after Confluent’s rebranding): An abstraction layer on top of Kafka Streams to provide stream processing with streaming SQL. Hence, also Kafka-native. It comes with a Confluent Community License and is free to use. Sweet spot: Streaming ETL.
Redpanda: Implements the Kafka protocol with C++. Trying out different market strategies to define Redpanda as an alternative to a Kafka-native offering. Still in the early stage in the maturity curve. Adding tons of (immature) product features in parallel to find the right market fit in a growing Kafka market. Recently acquired Benthos to provide connectivity to data sources and sinks (similar to Kafka Connect).
Ververica: Well-known Flink company. Acquired by Chinese tech giant Alibaba in 2019. Not much traction in Europe and the US. Sweet spot: Flink in Mainland China.
Apache Flink: Becoming the de facto standard for stream processing. Open-source implementation. Provides advanced features including a powerful scalable compute engine, freedom of choice for developers between SQL, Java, and Python, APIs for Complex Event Processing (CEP), and unified APIs for stream and batch workloads.
Spark Streaming: The streaming part of the open-source big data processing framework Apache Spark. The enormous installed base of Spark clusters in enterprises broadens adoption thanks to solutions from Cloudera, Databricks, and the cloud service providers. Sweet spot: Analytics in (micro)batches with data stored at rest in the data lake/lakehouse.
Apache Kafka: The de facto standard for data streaming. Open-source implementation with a vast community. Almost all vendors rely on (parts of) this project. Often underestimated: Kafka includes data integration and stream processing capabilities with Kafka Connect and Kafka Streams, making even the open-source Apache download already more powerful than many other data streaming frameworks and products.
IBM / Red Hat AMQ Streams: Provides Kafka as self-managed Kafka on Kubernetes via OpenShift. Kafka is part of the integration portfolio that includes other open-source frameworks like Apache Camel. Sweet spot: Existing IBM customers.
Cloudera: Provides Kafka, Flink, and other open-source data and analytics frameworks as a self-managed offering. The main strategy is offering one product with a vast combination of many open-source frameworks that can be deployed on any infrastructure. Sweet spot: Analytics.
Confluent Platform: Focuses on a complete data streaming platform including Kafka and Flink, and various advanced data streaming capabilities for operations, integration, governance, and security. Sweet spot: Unifying operational and analytical workloads, and combination with the fully managed cloud service.

Data Streaming with Bring Your Own Cloud (BYOC)

BYOC is an emerging category and is mainly used for specific challenges such as strict data security and compliance requirements. The following vendors provide dedicated BYOC offerings for data streaming (in order of adoption and completeness)

WarpStream (Confluent): A new entrant into the data streaming market. The cloud service is a Kafka-compatible data streaming platform built directly on top of S3. Innovated the BYOC model to enable secure and cost-effective data streaming for workloads that don’t have strong latency requirements.
Redpanda: The first BYOC offering on the market for data streaming. The biggest concern is the shared responsibility model of this solution because the vendor requires access to the customer VPC for operations and support. This is against the key principles of BYOC regarding security and compliance and why organizations (have to) look for BYOC instead of SaaS solutions.
Databricks: Cloud-based data platform that provides a collaborative environment for data engineering, data science, and machine learning, built on top of Apache Spark. Data Streaming is enabled by Spark Streaming and focuses mainly on analytical workloads that are optimized from batch to near real-time.

Partially Managed Data Streaming Cloud Platforms (PaaS)

Here is an overview of relevant PaaS data streaming cloud services (in order of adoption and completeness):

Google Cloud Managed Service for Apache Kafka (MSK): Initially branded as Google Managed Kafka for BigQuery (likely for a better marketing push), the service enables data ingestion into lakehouses on GCP such as Google BigQuery.
Amazon Managed Service for Apache Flink (MSF): A partially managed service by AWS that allows customers to transform and analyze streaming data in real-time with Apache Flink. It still provides some (costly) gaps for auto-scaling and is not truly serverless. Supports all Flink interfaces, i.e., SQL, Java, and Python. And non-Kafka connectors, too. Only available on AWS.
Oracle OCI Streaming with Apache Kafka: The service is still in early access, but available to a huge customer base already on Oracle’s cloud infrastructure.
Microsoft Azure HDInsight. A piece of Azure’s Hadoop infrastructure. Not intended for other use cases beyond data ingestion for batch analytics.
Instaclustr: Partially managed Kafka cloud offerings across cloud providers. The product portfolios offer various hosted services of open-source technologies. Instaclustr also offers a (semi-)managed offering for on-premise infrastructure.
Amazon Kinesis: Data ingestion into AWS data stores. Mature product for a specific problem. Only available on AWS.
Aiven: Partially managed Kafka cloud offerings across cloud providers. The product portfolios offer various hosted services of open-source technologies.
IBM / Red Hat AMQ Streams: Provides Kafka as a partially managed cloud offering on OpenShift (through Red Hat). Sweet spot: Existing IBM customers.
Amazon Managed Service for Apache Kafka (MSK): AWS has hundreds of cloud services, and Kafka is part of that broad spectrum. MSK is only available in public AWS cloud regions; not on Outposts, Local Zones, Wavelength, etc. MSK is likely the largest partially managed Kafka service across all clouds. It evolved with new features like support for Kafka Connect and Tiered Storage. But lacks connectivity outside the AWS ecosystem and a data governance narrative.

Fully Managed Data Streaming Cloud Services (SaaS)

Here is an overview of relevant SaaS data streaming cloud services (in order of adoption and completeness):

Decodable: A relatively new cloud service for Apache Flink in the early stage. Huge potential if it is combined with existing Kafka infrastructures in enterprises. But also provides pre-built connectors for non-Kafka systems. Main Opportunity: Combination with another cloud vendor that only does Kafka or other messaging/streaming without stream processing capabilities.
StreamNative Cloud: The primary vendor behind Apache Pulsar. Offers self-managed and fully managed solutions. StreamNative Cloud for Kafka is still in a very early stage of maturity, not sure if it will ever strengthen.
Ververica: Stream processing as a service powered by Apache Flink on all major cloud providers. Huge potential if it is combined with existing Kafka infrastructures in enterprises. Main Opportunity: Combination with another cloud vendor that only does Kafka or other messaging/streaming without stream processing capabilities.
Redpanda Cloud: Redpanda provides its data streaming as a serverless offering. Not much information is available on the website about this part of the vendor’s product portfolio.
Amazon MSK Serverless: Different functionalities and limitations than Amazon MSK. MSK Serverless still does not get much traction because of its limitations. Both MSK offerings exclude Kafka support in their SLAs (please read the terms and conditions).
Google Cloud DataFlow: Fully managed service for executing Apache Beam pipelines within the Google Cloud Platform ecosystem. Mature product for a specific problem. Only available on GCP.
Azure Event Hubs: A mature, fully managed cloud service. The service does one thing, and that is done very well: Data ingestion via the Kafka protocol. Hence, it is not a complete streaming platform but is more comparable to Amazon Kinesis or Google Cloud PubSub. The limited compatibility with the Kafka protocol and the high cost of the service for higher throughput are the two major blockers that come up regularly in conversations.
Confluent Cloud: A complete data streaming platform including Kafka and Flink as a fully managed offering. In addition to deep integration, the platform provides connectivity, security, data governance, self-service portal, disaster recovery tooling, and much more to be the most complete DSP on all major public clouds.

Potential for the Data Streaming Landscape 2026

Data streaming is a journey. So is the development of event streaming platforms and cloud services. Several established software and cloud vendors might get more traction with their data streaming offerings. And some startups might grow significantly. The following shows a few technologies that might evolve and see growing adoption in 2025:

New startups around the Kafka protocol emerge. The list of new frameworks and cloud services is growing every quarter. A few names I saw in some social network posts (but not much beyond in the real world): AutoMQ, S2, Astradot, Bufstream, Responsive, tansu, Tektite, Upstash. While some focus on the messaging/streaming part, others focus on a particular piece such as building database capabilities.
Streaming databases like Materialize or RisingWave might become a new software category. My feeling: Very early stage of the hype cycle. We will see in 2025 if and where this technology gets more broadly adopted and what the use cases are. It is hard to answer how these will compete with emerging real-time analytics databases like Apache Druid, Apache Pinot, ClickHouse, Timeplus, Tinybird, et al. I know there are differences, but the broader community and companies need to a) understand these differences and b) find business problems for it.
Stream Processing SaaS startups emerge: Quix and Bytewax provide stream processing with Python. Quix now also offers a hosted offering based on Kafka Streams; as does Responsive. DeltaStream provides Apache Flink as SaaS. And many more startups emerge these days. Let’s see which of these gets traction in the market with an innovative product and business model.
Traditional data management vendors like MongoDB or Snowflake try to get deeper into the data streaming business. I am still a fan of separation of concerns; so I think these should keep their sweet spot and (only) provide streaming ingestion and CDC as use cases, but not (try to) compete with data streaming vendors.

Fun fact: The business model of almost all emerging startups is fully managed cloud services, not selling licenses for on-premise deployments. Many are based on open-source or open-core, and others only provide a proprietary implementation.

Although they are not aiming to be full data streaming platforms (and thus are not part of the platform landscape), other complementary tools are gaining momentum in the data streaming ecosystem. For instance, Conduktor is developing a proxy for Kafka clusters, and Lenses, though relatively quiet since its acquisition by Celonis, has recently introduced updates to its user-friendly management and developer tools. These tools address gaps that some data streaming platforms leave unfilled.

Data Streaming: A Journey, Not a Sprint

Data streaming isn’t a sprint—it’s a journey! Adopting event-driven architectures with technologies like Apache Kafka or Apache Flink requires rethinking how applications are designed, developed, deployed, and monitored. Modern data strategies involve legacy integration, cloud-native microservices, and data sharing across hybrid and multi-cloud environments.

The data streaming landscape in 2025 highlights the emergence of a new software category, though it’s still in its early stages. Building such a category takes time. In discussions with customers, partners, and the community, a common sentiment emerges: “We understand the value but are just starting with the first data streaming pipelines and have a long-term plan to implement advanced stream processing and build a strategic data streaming organization.”

The Forrester Wave: Streaming Data Platforms, Q4 2023, and other industry reports from Gartner and IDG signal that this category is progressing through the hype cycle.

Last but not least, check out my Top Data Streaming Trends for 2025 to understand how the data streaming landscape fits into emerging trends:

Democratization of Kafka: Apache Kafka has transitioned from a specialized tool to a foundational platform in modern data infrastructure.
Kafka Protocol as the Standard: Vendors adopt the Kafka protocol over the framework, enabling flexibility with compatibility and performance trade-offs.
BYOC Deployment Model: Bring Your Own Cloud gains traction for balancing security, compliance, and managed services.
Flink Becomes the Standard for Stream Processing: Apache Flink rises as the premier framework for stream processing, building integration pipelines and business applications.
Data Streaming for Real-Time Predictive AI and GenAI: Real-time model inference drives predictive and generative AI applications.
Data Streaming Organizations: Companies unify real-time data strategies to standardize tools, governance, and collaboration.

Which are your favorite open-source frameworks or cloud services for data streaming? What are your most relevant and exciting trends around Apache Kafka and Flink in 2024 to set data in motion? What does your enterprise landscape for data streaming look like? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter. Make sure to download by free ebook about data streaming use cases and industry examples.

The post The Data Streaming Landscape 2025 appeared first on Kai Waehner.

When (Not) to Choose Google Managed Service for Apache Kafka?

Kai Waehner — Wed, 10 Apr 2024 20:47:00 +0000

Google announced its Managed Service for Apache Kafka cloud service at its conference Google Cloud Next 2024 in Las Vegas. Welcome to the data streaming club joining Amazon, Microsoft, IBM, Oracle, Confluent, and others. This blog post explores this new managed Kafka offering for GCP, reviews the current status of the data streaming landscape, and shares some criteria to evaluate when Kafka in general and Google Apache Kafka in particular should (not) be used.

Welcome Google Apache Kafka to the Data Streaming Club

Better late than never… Google announced a brand new Apache Kafka cloud service for GCP at Google Cloud Next 2024. All other leading cloud providers already have one, including AWS, Azure, Oracle, IBM, and Alibaba. Various other software vendors provide Kafka services, including Confluent, Aiven, Redpanda, WarpStream, and many more. Most leverage the open source Kafka project as its core component, others re-implement the Kafka protocol.

Apache Kafka and Apache Flink dominate the open source data streaming ecosystem. Vendors and cloud solutions provide cloud-native offerings. Some developers, data engineers and business people still struggle with a paradigm shift: Continuous data processing enables better data quality, reduced cost, and faster time to market with innovative new applications. Kafka and Flink are a match made in heaven for data streaming.

Use Cases for data streaming exist across all industries. Google Managed Service for Apache Kafka is potentially a good fit for some of them, but not for others.

Google Managed Service for Apache Kafka (MSK) – What is it?

UPDATE August 2024: Google actually released this service as “Google Apache Kafka for BigQuery”. They rebranded it to “Google Managed Service for Kafka” which makes more sense. It is also the same shortcut as for Amazon’s Kafka offering: MSK.

What is Google Cloud Managed Service For Apache Kafka? Quoting Google’s website: “Google Managed Service for Apache Kafka is a managed service that operates highly available Apache Kafka clusters. It is compatible with open source versions of Apache Kafka and includes first-party Google Cloud IAM, monitoring, logging, key management, organization policy, networking, and more.”

Here are a few more thoughts:

Asynchronous messaging with true decoupling and producers and consumers using the publish/subscribe pattern is possible with GCPs proprietary service Google Pub/Sub. Why did Google now introduce a Kafka service? Limitations of Google Pub/Sub or because Kafka became the standard (e.g., to migrate on premise Kafka workloads from customers)? I guess a bit of both.
Google re-uses open source Kafka instead of re-implementing the Kafka protocol (like Microsoft Azure’s Event Hubs). I like this approach as a new implementation always creates several new challenges like missing completeness, delays of new features, and unexpected behaviour. The compatibility with open source Kafka is mentioned several times. My personal assumption is that Google’s main strategic goal for the new Kafka service is to migrate existing on-premise workloads into Google Cloud.
I really like that the service is secure out of the box. It is integrated with and supports Google Cloud IAM, customer-managed encryption keys (CMEK), and Virtual Private Cloud (VPC) from the beginning. This is important as most workloads at enterprises require this.
Including the term ‘BigQuery’ is only a marketing strategy: “Data engineers often rely on Apache Kafka to build pipelines that stream data into BigQuery and other analytics systems. Apache Kafka for BigQuery can be used for real-time and batch use cases”. There is no requirement for using BigQuery for analytics. Google’s Kafka service is usable with other analytics platforms, too. EDIT: That’s why Google Cloud rebranded the service a few months after its announcement to Google Cloud for Managed Apache Kafka (MSK).
Google emphasizes analytics use cases everywhere around its Kafka service; NOT transactional workloads. This approach is similar to Amazon MSK. Hopefully, the Google terms and conditions don’t exclude Kafka support when the service is GA (that’s what MSK does – unfortunately, too many people don’t read T&C and just use a cloud service in production).

Data Streaming is a NEW Software Category

Data streaming represents a new software category that revolutionizes the way businesses harness and process data in real time. Unlike traditional batch processing methods, data streaming enables continuous ingestion, analysis, and processing of data as it flows through systems. I explored this topic in the past when many people wanted to put Apache Kafka and its vendors into the integration platform category.

The Data Streaming Landscape 2024

Many software companies have emerged in the data streaming category in the last few years. And several mature players in the data market added support for data streaming in their platforms or cloud service ecosystem. Most software vendors use Kafka for their data streaming platforms. However, there is more than solutions powered by open source Kafka. Some vendors only use the Kafka protocol (e.g., Azure Event Hubs) or utterly different APIs (like Amazon Kinesis).

The following Data Streaming Landscape 2024 summarizes the current status of relevant products and cloud services for data streaming around Kafka and additional stream processing engines.

Forrester Wave for Streaming Data and IDG MarketScape for Stream Processing

Apache Kafka became the de facto standard for data streaming, similar to how Amazon S3 became the de facto standard for object storage.

In December 2023, the research company Forrester published “The Forrester Wave: Streaming Data Platforms, Q4 2023“. Get free access to the report here. The leaders are Microsoft, Google and Confluent, followed by Oracle, Amazon, Cloudera, and a few others.

In April 2024, IDC named Confluent a leader in the IDC MarketScape for Worldwide Analytic Stream Processing 2024.

It would not be a surprise if we see a Gartner Magic Quadrant for Data Streaming soon, too. Gartner reports mention Kafka and related vendors more and more year by year.

When NOT to Choose Google Managed Service for Apache Kafka (MSK)?

Qualifying out a technology is often the easier option. Why evaluate a service if it does not solve the requirements? Let’s explore when NOT to use Kafka at all, and specifically when the Google Apache Kafka service is probably NOT the right choice for you.

When NOT to use Apache Kafka?

Apache Kafka has overlaps with technologies like a message broker (like IBM MQ, TIBCO or RabbitMQ), other streaming analytics platforms, and it actually is a database, too. But Apache Kafka is not an allrounder to solve every problem.

Apache Kafka is NOT:

A replacement for your favorite database, data warehouse or data lake. Instead it complements and integrates with these platforms.
An analytics platform for AI/ML model training, even though model scoring is often done within the streaming platform for critical or low-latency use cases.
A proxy for thousands of clients in bad networks.
An API Management solution, even though you can connect REST/HTTP producers and consumers against Kafka.
An IoT gateway, even though direct integration with IoT protocols like MQTT or OPC-UA is possible.
Hard real-time for safety-critical embedded workloads.

Read the thorough analysis “When NOT to use Apache Kafka?” for more details. Or watch this YouTube video:

When to Choose ANOTHER Kafka instead of Google’s?

If Apache Kafka is the right choice for your project, you still have plenty of options.

Here are a few criteria that let you easily disqualify out Google Cloud Managed Service for Apache Kafka (MSK):

Non-GCP: If your use case requires on-premise, multi-cloud, hybrid cloud or edge deployments, then you need another offering.
Critical SLAs: If you need 24/7 critical support and consulting expertise, a dedicated Kafka vendor like Confluent is the better choice. Kafka is not just for analytics, but shines for transactional workloads, too. Google’s Managed Apache Kafka service is not GA yet. This will probably happen in the second half of 2024. Hence, don’t even consider it for critical applications before GA.
Serverless: A managed service is not always a truly managed service. The future will show where Google goes with Kafka. But right now, Google Apache Kafka is not serverless like e.g., Confluent Cloud. You pay for capacity pricing and cluster capacity management is required. Amazon even created a second service Amazon MSK Serverless to handle this issue with its traditional MSK offering. Learn why Amazon MSK is NOT always the right choice and adopt the learnings to Google.
Complete Data Streaming Platform: A data streaming platform requires more than just messaging: data integration with first and third party systems, stream processing for continuous data correlation, flexible (long-term) retention with Tiered Storage, data governance, and more. The future will show us where Google’s Kafka service goes. A detailed article compares Kafka offerings as car engine, complete car and self-driving car level 5. Google is a car, but not (yet) a Porsche (complete luxury car) and not yet a Google Waymo (self-driving car level 5). Google Apache Kafka even misses basic features for data streaming best practices, like defining data contracts in schemas for building data products in good data quality.

The Evolution of Data Streaming is NOT Stopping…

If you did not qualify out Kafka in general or Google Apache Kafka in particular yet, that’s great. Start evaluating Google Cloud’s Managed Service for Apache Kafka (MSK) service and compare it against self-managed open source Kafka and other semi-managed or fully-managed Kafka cloud services on GCP.

As we look ahead, the future possibilities for data streaming are boundless, promising more agile, intelligent, and real-time insights into the ever-increasing streams of data. I recently looked at the past, present and future of stream processing.

I often get the question if I am worried about the emerging competition as I work for Confluent where we “only do data streaming”?

No, I am not! Actually, the new Google Apache Kafka cloud service is great news for the industry! Data Streaming established itself as a new software category. Research analysts like Forrester and IDG already created dedicated waves and comparisons. What could be better than working with the people that invented Kafka and the company that created this software category across all industries and continents? And competition is always good for innovation, too.

Real-time data beats slow data. That’s true in almost every use case. At Confluent, we are now ~3000 people working only on one thing: Data Streaming. I think we should celebrate this Google announcement and look forward to more mass adoption of data streaming around the world.

And as a strategic Google partner, customers can

leverage GCP credits to consume Confluent Cloud
leverage GCPs security and private networking infrastructure
integrate via fully managed connectors into various GCP services like Google Big Query or Google Cloud Storage and third party cloud solutions like MongoDB, Snowflake or Databricks.

Are you excited about the new Google Cloud Managed Service for Apache Kafka (MSK) service? Or do you use still plan to use open source Kafka or another cloud service like Confluent Cloud? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post When (Not) to Choose Google Managed Service for Apache Kafka? appeared first on Kai Waehner.

When NOT to choose Amazon MSK Serverless for Apache Kafka?

Kai Waehner — Tue, 30 Aug 2022 05:54:21 +0000

Apache Kafka became the de facto standard for data streaming. Various cloud offerings emerged and improved in the last years. Amazon MSK Serverless is the latest Kafka product from AWS. This blog post looks at its capabilities to explore how it relates to “the normal” partially managed Amazon MSK, when the serverless version is a good choice, and when other fully-managed cloud services like Confluent Cloud are the better option.

Disclaimer: I work for Confluent. While AWS is a strong strategic partner, it also offers the competitive offering Amazon MSK. This post is not about comparing every feature but explaining the concepts behind the alternatives. Read articles and docs from the different vendors to make your own evaluation and decision. View this post as a list of criteria to not forget important aspects in your cloud service selection.

What is a fully-managed Kafka cloud offering?

Before we get started looking at Kafka cloud services like Amazon MSK and Confluent Cloud, it is important to understand what fully managed means actually:

Make sure to evaluate the technical solution. Most Kafka cloud solutions market their offering as “fully managed”. However, almost all Kafka cloud offerings are only partially managed! In most cases, the customer must operate the Kafka infrastructure, fix bugs, and optimize scalability and performance.

Cloud-native Apache Kafka re-engineered for the cloud

Operating Apache Kafka as a fully-managed offering in the cloud requires several additional components to the core of open-source Kafka. A cloud-native Kafka SaaS has features like:

The latest stable version with non-disruptive rolling upgrades
Elastic scale (up and down!)
Self-balancing clusters that take over the complexity and risk of rebalancing partitions across Kafka brokers
Tiered storage for cost-efficient long-term storage and better scalability (as the cold storage does not require a rebalancing of partitions and other complex operations tasks)
Complete solution “on top of the infrastructure”, including connectors, stream processing, security, and data governance – all in a single fully-managed SaaS

To learn more about building a cloud-native Kafka service, I highly recommend reading the following paper: ”The Cloud-Native Chasm: Lessons Learned from Reinventing Apache Kafka as a Cloud-Native, Online Service”.

Comparison of Apache Kafka products and cloud services

Apache Kafka became the de facto standard for data streaming. The open-source community is vast. Various vendors added Kafka and related tooling to their offerings or provide a Kafka cloud service. I wrote a blog post in 2021: “Comparison of Open Source Apache Kafka vs. Vendors including Confluent, Cloudera, Red Hat, Amazon MSK“:

The article uses a car analogy – from the motor engine to the self-driving car – to explore the different Kafka offerings available on the market. I also cover a few other vehicles, meaning (partly) Kafka-compatible technologies. The goal is not a feature-by-feature comparison (that would be outdated the day after the publication). Instead, the intention is to educate about the different deployment models, product strategies, and trade-offs from the options.

The above post is worth reading to understand how comparing different Kafka solutions makes sense. However, products develop and innovate… Tech comparisons get outdated quickly. In the meantime, AWS released a new product: Amazon MSK Serverless. This blog post explores what it is, when to use it, and how it differs from other Kafka products. It compares especially Amazon MSK (the partially managed service) and Confluent Cloud (a fully-managed competitor to Amazon MSK Serverless).

How does Amazon MSK Serverless fit into the Kafka portfolio?

Keeping the car analogy of my previous post, I wonder: Is it a self-driving car, a complete car you drive by yourself, or just a car engine to build your own car? Interestingly, you can argue for all three. Let’s explore this in the following sections.

Introducing Amazon MSK Serverless

Amazon MSK Serverless is a cluster type for Amazon MSK to run Apache Kafka without having to manage and scale cluster capacity. MSK Serverless automatically provisions and scales compute and storage resources. Thus, you can use Apache Kafka on demand and pay for the data you stream and retain.

Amazon MSK is one of the hundreds of cloud services that AWS provides. AWS is a one-stop shop for all cloud needs. That’s definitely a key strength of AWS (and similar to Azure and GCP).

Amazon MSK Serverless is built to solve the problems that come with Amazon MSK (the partially managed Kafka service that is marketed as a fully-managed solution even though it is not): A lot of hidden ops, infra, and downtime costs. This AWS podcast has a great episode that introduces Amazon MSK Serverless and when to use it as a replacement for Amazon MSK.

What Amazon does NOT tell you about MSK Serverless

AWS has great websites, documentation, and videos for its cloud services. This is not different for Amazon MSK. However, a few important details are not obvious… Let’s explore a few key points to make sure everybody understands what Amazon MSK Serverless is and what it is not.

Amazon MSK Serverless is incomplete Kafka

If you follow my blogs, then this might be boring. Despite that, too many people think about Kafka as a message queue and data transportation pipeline. That’s what it is, but Kafka is much more:

Real-time messaging at any scale
Data integration with Kafka Connect
Data processing (aka stream processing) with Kafka Streams (or 3rd party Kafka-native components like KSQL)
True decoupling (the most underestimated feature of Kafka because of its built-in storage capabilities) and replayability of events with flexible retention times
Data governance with service contracts using Schema Registry (to be fair, this is not part of open source Kafka, but a separate component and accessible from GitHub or by vendors like Confluent or Red Hat – but it is used in almost all serious Kafka projects)

As I won’t repeat myself, here are a few articles explaining why Kafka is more than a message queue like you find it in Amazon MSK Serverless:

TL;DR: AWS provides a cloud service for every problem. You can glue them together to build a solution. However, similar to a monolithic application that provides inflexibility in a single package, a mesh of too many independent glued services using different technologies is also very hard to operate and maintain. And the cost for so many services plus networking and data transfer will bring up many surprises in such a setup.

You should ask yourself a few questions:

How do you implement data integration and business logic with Amazon MSK Serverless?
What’s the consequence regarding cost, SLAs, and end-to-end latency respectively delivery guarantees of combining Amazon MSK Serverless with various other products like Amazon Kinesis Data Analytics, AWS Glue, AWS Data Pipeline, or a 3rd party integration tool?
What is your security and data governance strategy around streaming data? How do you build an event-based data hub that enforces compliant communication between data producers and independent downstream consumers?

Spoilt for Choice: Amazon MSK and Amazon MSK Serverless are different products

Amazon MSK is NOT fully-managed. It is partially managed. After providing the brokers, you need to deploy, operate and monitor Kafka brokers and Kafka Connect connectors, and realize rebalancing with the open source tool Cruise Control. Check out AWS’ latest MSK sizing post: “Best practices for right-sizing your Apache Kafka clusters to optimize performance and cost“. Seriously? A ten pages long and very technical article explaining how to operate a “fully-managed cloud Kafka service”?

You might think that Amazon MSK Serverless is the successor of Amazon MSK to solve these problems. However, there are now two products to choose from: Amazon MSK and Amazon MSK Serverless.

Amazon does NOT recommend using Amazon MSK Serverless for all use cases! It is recommended if you don’t know the workloads or if they often change in volume.
Amazon recommends “the normal” Amazon MSK for predictable workloads as it is more cost-effective (and because it is not workable because of its many tough limitations). MSK Connect is also not supported yet and coming at some point in the future.

It is totally okay to provide different products for different use cases. Confluent also has different offerings for different SLAs and functional requirements in its cloud offering. Multi-tenant basic clusters and dedicated clusters are available, but you never have to self-manage the cluster or fix bugs or performance issues yourself.

You should ask yourself a few questions:

Which projects require Amazon MSK and which require Amazon MSK Serverless?
How will the project scale as your grows?
What’s the migration/ upgrade plan if your workload exceeds MSK Serverless partition/retention limits?
What is the total cost of ownership (TCO) for MSK plus all the other cloud services I need to combine it with?

Amazon MSK Serverless excludes Kafka support

Amazon MSK service level agreements say: “The Service Commitment DOES NOT APPLY to any unavailability, suspension or termination … caused by the underlying Apache Kafka or Apache Zookeeper engine software that leads to request failures …”

Amazon MSK Serverless is part of the Amazon MSK product and has the same limitation. Excluding Kafka support from the MSK offering is (or should be) a blocker for any serious data streaming project!

Not much more to add here… Do you really want to buy a specific product that excludes the support for its core capability? Please also ask your manager if he agrees and takes the risk.

You should ask yourself a few questions:

Who is responsible and takes the risk if you hit a Kafka issue in your project using Amazon MSK or Amazon MSK Serverless?
How do you react to security incidents related to the Apache Kafka open source project?
How do you fix performance or scalability issues (on both the client and server side)?

When NOT to use Amazon MSK Serverless?

Let’s go back to the car analogy. Is Amazon MSK Serverless a self-driving car?

Obviously, Amazon MSK Serverless is self-driving. That’s what a serverless product is. Similar to Amazon S3 for object storage or AWS Lambda for serverless functions.

However, Amazon MSK Serverless is NOT a complete car! It does not provide enterprise support for its functionality. And it does not provide more than just the core of data streaming.

Therefore, Amazon MSK Serverless is a great self-driving AWS product for some use cases. But you should evaluate the following facts before deciding for or against this cloud service.

24/7 Enterprise support for the product

MSK excludes Kafka support from its product Amazon MSK. Amazon MSK Serverless is part of Amazon MSK and uses its SLAs.

I am amazed at how many enterprises use Amazon MSK without reading the SLAs. Most people are not aware that Kafka support is excluded from the product.

This makes Amazon MSK Serverless a car engine, not a complete car, right? Do you really want to build your own car and take over the burden and risk of failures while driving on the street?

If you need to deploy mission-critical workloads with 24/7 SLAs, you can stop reading and qualify out Amazon MSK (including Amazon MSK Serverless) until AWS adds serious SLAs to this product in the future.

Complete data streaming platform

AWS has a service for everything. You can glue them together. In our cars analogy, it would be many cars or vehicles in your enterprise architecture. Most of us learned the hard way that distributed microservices are no free lunch.

The monolithic data lake (now pitched as lakehouse) from vendors like Databricks and Snowflake) is no better approach. Use the right technology for a problem or data product. Finding the right mix between focus and independence is crucial. Kafka’s role is the central or decentralized real-time data hub to transport events. This includes data integration and processing, and to decouple systems from each other.

A modern data flow requires a simple, reliable and governed way to integrate and process data. Leveraging Kafka’s ecosystem like Kafka Connect and Kafka Streams enables mission-critical end-to-end latency and SLAs in a cost-efficient infrastructure. Development, operations, and monitoring are much harder and more costly if you glue together several services to build a real-time data hub.

However, Kafka is not a silver bullet. Hence, you need to understand when NOT to use Kafka and how it relates to data warehouses, data lakes, and other applications.

After a long introduction to this aspect, long story short: If you use Amazon MSK Serverless, it is the data ingestion component in your enterprise architecture. No fully managed components other than Kafka and no native integrations to other 1st party cloud AWS services like S3 or Redshift, and 3rd party cloud services like Snowflake, Databricks, or MongoDB. You must combine Amazon MSK Serverless with several other AWS services for event processing and storage. Additionally, connectivity needs to be implemented and operated by your project team using Kafka Connect connectors, or another 1st or 3rd party ETL tools, or custom glue code).

Amazon MSK Serverless only supports AWS Identity and Access Management (IAM) authentication, which limits you to Java clients only. There is no way to use the open source clients for other programming languages. Python, C++, .NET, Go, JavaScript, etc. are not supported with Amazon MSK Serverless.

MSK Connect allows deploying Kafka Connect connectors (that are available open source, licensed from other vendors, or self-built) into this platform. Similar to Amazon MSK, this is not a fully-managed product. You deploy, operate, and monitor the connectors by yourself. Look at the fully managed connectors in Confluent Cloud to understand the difference. Also, note that AWS will only support Connect workers. But it will not support the connectors themselves even if running on MSK Connect.

Event-driven architecture with true decoupling between the microservices

An event-driven architecture powered by data streaming is great for single integration infrastructure. However, the story goes far beyond this. Modern enterprise architectures leverage principles like microservices, domain-driven design, and data mesh to build decentralized applications and data products.

A streaming data exchange enables such a decentralized architecture with real-time data sharing. A critical capability for such a strategic enterprise component is long-term data storage. It

decouples independent applications
handles backpressure for slow consumers (like batch systems or request-response web services)
enables replayability of historical events (e.g., for a Python consumer in the machine learning platform from data engineers).

The storage capability of Kafka is a key differentiator against message queues like IBM MQ, Rabbit MQ, or AWS SQS. Retention time is an important feature to set the right storage options per Kafka topic. Confluent makes Kafka even better by providing Tiered Storage to separate storage from compute for a much more cost-efficient and scalable solution with infinite storage capabilities.

Amazon MSK Serverless has a limited retention time of 24 hours. This is good enough for many data ingestion use cases but not for building a real-time data mesh across business units or even across organizations. Another tough requirement of Amazon MSK Serverless is the limitation of 120 partitions. Not really a limit that allows building a strategic platform around it.

As Amazon MSK Serverless is a new product, expect the limitations to change and improve over time. Check the docs for updates. UPDATE Q1 2023: Amazon MSK Serverless added unlimited retention time and support for more partitions. That’s excellent news for this service. With this update, if retention time is your critical criterion, Amazon MSK Serverless is stronger since 2023. However, check the storage costs and compare different cloud offerings for this.

But anyway, these limitations prove how hard it is to build a fully-managed Kafka offering (like Confluent Cloud) compared to a partially managed Kafka offering (like “the normal” Amazon MSK).

Hybrid AWS and multi-cloud Kafka deployments

The most obvious point: Amazon MSK Serverless is only a reasonable discussion if you run your apps in the public AWS cloud. For anything else, like multi-cloud with Azure or GCP, AWS edge offerings like AWS Outpost or Wavelength, hybrid environments, or edge deployments like a factory or retail store, AWS is no option.

If you need to deploy outside the public AWS cloud, check my comparison of Kafka offerings, including Confluent, IBM, Red Hat, and Cloudera.

I want to emphasize that no product or service is 100% cloud agnostic. For instance, building Confluent Cloud on AWS, Azure, and GCP includes unique challenges under the hood. Confluent Cloud is built on Kubernetes. Hence, the template and many automation mechanisms can be reused across cloud vendors. But storage, compute, pricing, support, and many other characteristics and features differ at each cloud service provider.

Having said this, as you leverage a SaaS like Confluent Cloud with no knowledge or access to the technical infrastructure. You don’t see these issues under the hood. On the developer level, you produce and consume messages with the Kafka API and configure additional features like fully-managed connectors, data governance, or cluster linking. All the operations complexity is handled by the vendor. No matter which cloud you run on.

Coopetition: The winners are AWS and Confluent Cloud

The reason for this post was the evolution of the Amazon MSK product line. Hence, if you read this a year later, the product features and limitations might look completely different once again. Use blog posts like this to understand how to evaluate different solutions and SaaS offerings. Then do your own accurate research before making a product decision.

Amazon MSK Serverless is a great new AWS service that helps AWS customers with some times of projects. But it has tough limitations for some other projects. Additionally, Amazon MSK (including Amazon MSK Serverless) excludes Kafka support! And it is not a complete data streaming platform. Be careful not to create a mess of glue code between tens of serverless cloud services and applications. Confluent Cloud is the much more sophisticated fully-managed Kafka cloud offering (on AWS and everywhere). I am not saying this because I am a Confluent employee but because almost everybody agrees on this And it is not really a surprise as Confluent only focuses on data streaming with 2000 people employees and employs many full-time committers to the Apache Kafka open source project. Amazon has zero, by the way

By the way: Did you know you can use your AWS credits to consume Confluent Cloud like any other native AWS service? This is because of the strong partnership between Confluent and AWS. Yes, there is coopetition. That’s how the world looks like today…

Confluent Cloud provides a complete cloud-native platform including 99.99% SLA, fully managed connectors and stream processing, and maybe most interesting to readers of this post, integration with AWS services (S3, Redshift, Lambda, Kinesis, etc.) plus AWS security and networking (VPC Peering, Private Link, Transit Gateway, etc.). Confluent and AWS work closely together on hybrid deployments, leveraging AWS edge services like AWS Wavelength for 5G scenarios.

Which Kafka cloud service do you use today? What are your pros and cons? Do you plan a migration – e.g., from Amazon MSK to Confluent Cloud or from open source Kafka to Amazon MSK Serverless? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post When NOT to choose Amazon MSK Serverless for Apache Kafka? appeared first on Kai Waehner.

Comparison: JMS Message Queue vs. Apache Kafka

Kai Waehner — Thu, 12 May 2022 05:13:19 +0000

Comparing JMS-based message queue (MQ) infrastructures and Apache Kafka-based data streaming is a widespread topic. Unfortunately, the battle is an apple-to-orange comparison that often includes misinformation and FUD from vendors. This blog post explores the differences, trade-offs, and architectures of JMS message brokers and Kafka deployments. Learn how to choose between JMS brokers like IBM MQ or RabbitMQ and open-source Kafka or serverless cloud services like Confluent Cloud.

Motivation: The battle of apples vs. oranges

I have to discuss the differences and trade-offs between JMS message brokers and Apache Kafka every week in customer meetings. What annoys me most is the common misunderstandings and (sometimes) intentional FUD in various blogs, articles, and presentations about this discussion.

I recently discussed this topic with Clement Escoffier from Red Hat in the “Coding over Cocktails” Podcast: JMS vs. Kafka: Technology Smackdown. A great conversation with more agreement than you might expect from such an episode where I picked the “Kafka proponent” while Clement took over the role of the “JMS proponent”.

These aspects motivated me to write a blog series about “JMS, Message Queues, and Apache Kafka”:

THIS POST – 10 Comparison Criteria for JMS Message Broker vs. Apache Kafka Data Streaming
Alternatives for a Dead Letter Queue (DLQ) in Apache Kafka
Implementing the Request-Reply Pattern with Apache Kafka
UPCOMING – A Decision Tree for Choosing the Right Messaging System (JMS vs. Apache Kafka)
UPCOMING – From JMS Message Broker to Apache Kafka: Integration, Migration, and/or Replacement

I will link the other posts here as soon as they are available. Please follow my newsletter to get updated in real-time about new posts. (no spam or ads)

Special thanks to my colleague and long-term messaging and data streaming expert Heinz Schaffner for technical feedback and review of this blog series. He has worked for TIBCO, Solace, and Confluent for 25 years.

10 comparison criteria: JMS vs. Apache Kafka

This blog post explores ten comparison criteria. The goal is to explain the differences between message queues and data streaming, clarify some misunderstandings about what an API or implementation is, and give some technical background to do your evaluation to find the right tool for the job.

The list of products and cloud services is long for JMS implementations and Kafka offerings. A few examples:

JMS implementations of the JMS API (open source and commercial offerings): Apache ActiveMQ, Apache Qpid (using AMQP), IBM MQ (formerly MQSeries, then WebSphere MQ), JBoss HornetQ, Oracle AQ, RabbitMQ, TIBCO EMS, Solace, etc.
Apache Kafka products, cloud services, and rewrites (beyond the valid option of using just open-source Kafka): Confluent, Cloudera, Amazon MSK, Red Hat, Redpanda, Azure Event Hubs, etc.

Here are the criteria for comparing JMS message brokers vs. Apache Kafka and its related products/cloud services:

Message broker vs. data streaming platform
API Specification vs. open-source protocol implementation
Transactional vs. analytical workloads
Push vs. pull message consumption
Simple vs. powerful and complex API
Storage for durability vs. true decoupling
Server-side data-processing vs. decoupled continuous stream processing
Complex operations vs. serverless cloud
Java/JVM vs. any programming language
Single deployment vs. multi-region (including hybrid and multi-cloud) replication

Let’s now explore the ten comparison criteria.

1. Message broker vs. data streaming platform

TL;DR: JMS message brokers provide messaging capabilities to produce and consume messages. Apache Kafka is a data streaming platform that combines messaging, storage, data integration, and stream processing capabilities.

The most important aspect first: The comparison of JMS and Apache Kafka is an apple to orange comparison for several reasons. I would even further say that not both can be fruit, as they are so different from each other.

JMS API (and implementations like IBM MQ, RabbitMQ, et al)

JMS (Java Message Service) is a Java application programming interface (API) that provides generic messaging models. The API handles the producer-consumer problem, which can facilitate the sending and receiving of messages between software systems.

Therefore, the central capability of JMS message brokers (that implement the JMS API) is to send messages from a source application to another destination in real-time. That’s it. And if that’s what you need, then JMS is the right choice for you! But keep in mind that projects must use additional tools for data integration and advanced data processing tasks.

Apache Kafka (open source and vendors like Confluent, Cloudera, Red Hat, Amazon, et al)

Apache Kafka is an open-source protocol implementation for data streaming. It includes:

Apache Kafka is the core for distributed messaging and storage. High throughput, low latency, high availability, secure.
Kafka Connect is an integration framework for connecting external sources/destinations to Kafka.
Kafka Streams is a simple Java library that enables streaming application development within the Kafka framework.

This combination of capabilities enables the building of end-to-end data pipelines and applications. That’s much more than what you can do with a message queue.

2. JMS API specification vs. Apache Kafka open-source protocol implementation

TL;DR: JMS is a specification that vendors implement and extend in their opinionated way. Apache Kafka is the open-source implementation of the underlying specified Kafka protocol.

It is crucial to clarify the terms first before you evaluate JMS and Kafka:

Standard API: Specified by industry consortiums or other industry-neutral (often global) groups or organizations specify standard APIs. Requires compliance tests for all features and complete certifications to become standard-compliant. Example: OPC-UA.
De facto standard API: Originates from an existing successful solution (an open-source framework, a commercial product, or a cloud service). Examples: Amazon S3 (proprietary from a single vendor). Apache Kafka (open source from the vibrant community).
API Specification: A specification document to define how vendors can implement a related product. There are no complete compliance tests or complete certifications for the implementation of all features. The consequence is a “standard API” but no portability between implementations. Example: JMS. Specifically for JMS, note that in order to be able to use the compliance suite for JMS, a commercial vendor has to sign up to very onerous reporting requirements towards Oracle.

The alternative kinds of standards have trade-offs. If you want to learn more, check out how Apache Kafka became the de facto standard for data streaming in the last few years.

Portability and migrations became much more relevant in hybrid and multi-cloud environments than in the past decades where you had your workloads in a single data center.

JMS is a specification for message-oriented middleware

JMS is a specification currently maintained under the Java Community Process as JSR 343. The latest (not yet released) version JMS 3.0 is under early development as part of Jakarta EE and rebranded to Jakarta Messaging API. Today, JMS 2.0 is the specification used in prevalent message broker implementations. Nobody knows where JMS 3.0 will go at all. Hence, this post focuses on the JMS 2.0 specification to solve real-world problems today.

I often use the term “JMS message broker” in the following sections as JMS (i.e., the API) does not specify or implement many features you know in your favorite JMS implementation. Usually, when people talk about JMS, they mean JMS message broker implementations, not the JMS API specification.

JMS message brokers and the JMS portability myth

The JMS specification was developed to provide a common Java library to access different messaging vendor’s brokers. It was intended to act as a wrapper to the messaging vendor’s proprietary APIs in the same way JDBC provided similar functionality for database APIs.

Unfortunately, this simple integration turned out not to be the case. The migration of the JMS code from one vendor’s broker to another is quite complex for several reasons:

Not all JMS features are mandatory (security, topic/queue labeling, clustering, routing, compression, etc.)
There is no JMS specification for transport
No specification to define how persistence is implemented
No specification to define how fault tolerance or high availability is implemented
Different interpretations of the JMS specification by different vendors result in potentially other behaviors for the same JMS functions
No specification for security
There is no specification for value-added features in the brokers (such as topic to queue bridging, inter-broker routing, access control lists, etc.)

Therefore, simple source code migration and interoperability between JMS vendors is a myth! This sounds crazy, doesn’t it?

Vendors provide a great deal of unique functionality within the broker (such as topic-to-queue mapping, broker routing, etc.) that provide architectural functionality to the application but are part of the broker functionality and not the application or part of the JMS specification.

Apache Kafka is an open-source protocol implementation for data streaming

Apache Kafka is an implementation to do reliable and scalable data streaming in real-time. The project is open-source and available under Apache 2.0 license, and is driven by a vast community.

Apache Kafka is NOT a standard like OPC-UA or a specification like JMS. However, Kafka at least provides the source code reference implementation, protocol and API definitions, etc.

Kafka established itself as the de facto standard for data streaming. Today, over 100,000 organizations use Apache Kafka. The Kafka API became the de facto standard for event-driven architectures and event streaming. Use cases across all industries and infrastructure. Including various kinds of transactional and analytic workloads. Edge, hybrid, multi-cloud. I collected a few examples across verticals that use Apache Kafka to show the prevalence across markets.

Now, hold on. I used the term Kafka API in the above section. Let’s clarify this: As discussed, Apache Kafka is an implementation of a distributed data streaming platform including the server-side and client-side and various APIs for producing and consuming events, configuration, security, operations, etc. The Kafka API is relevant, too, as Kafka rewrites like Azure Event Hubs and Redpanda use it.

Portability of Apache Kafka – yet another myth?

If you use Apache Kafka as an open-source project, this is the complete Kafka implementation. Some vendors use the full Apache Kafka implementation and build a more advanced product around it.

Here, the migration is super straightforward, as Kafka is not just a specification that each vendor implements differently. Instead, it is the same code, libraries, and packages.

For instance, I have seen several successful migrations from Cloudera to Confluent deployments or from self-managed Apache Kafka open-source infrastructure to serverless Confluent Cloud.

The Kafka API – Kafka rewrites like Azure Event Hubs, Redpanda, Apache Pulsar

With the global success of Kafka, some vendors and cloud services did not build a product on top of the Apache Kafka implementation. Instead, they made their implementation on top of the Kafka API. The underlying implementation is proprietary (like in Azure’s cloud service Event Hubs) or open-source (like Apache Pulsar’s Kafka bridge or Redpanda’s rewrite in C++).

Be careful and analyze if vendors integrate the whole Apache Kafka project or rewrote the complete API. Contrary to the battle-tested Apache Kafka project, a Kafka rewrite using the Kafka API is a completely new implementation!

Many vendors even exclude some components or APIs (like Kafka Connect for data integration or Kafka Streams for stream processing) completely or exclude critical features like exactly-once semantics or long-term storage in their support terms and conditions.

It is up to you to evaluate the different Kafka offerings and their limitations. Recently, I compared Kafka vendors such as Confluent, Cloudera, Red Hat, or Amazon MSK and related technologies like Azure Event Hubs, AWS Kinesis, Redpanda, or Apache Pulsar.

Just battle-test the requirements by yourself. If you find a Kafka-to-XYZ bridge with less than a hundred lines of code, or if you find a .exe Windows Kafka server download from a middleware vendor. Be skeptical!

All that glitters is not gold. Some frameworks or vendors sound too good to be true. Just saying you support the Kafka API, you provide a fully managed serverless Kafka offering, or you scale much better is not trustworthy if you are constantly forced to provide fear, uncertainty, and doubt (FUD) on Kafka and that you are much better. For instance, I was annoyed by Pulsar always trying to be better than Kafka by creating a lot of FUDs and myths in the open-source community. I responded in my Apache Pulsar vs. Kafka comparison two years ago. FUD is the wrong strategy for any vendor. It does not work. For that reason, Kafka’s adoption still grows like crazy while Pulsar grows much slower percentage-wise (even though the download numbers are on a much lower level anyway).

3. Transactional vs. analytical workloads

TL;DR: A JMS message broker provides transactional capabilities for low volumes of messages. Apache Kafka supports low and high volumes of messages supporting transactional and analytical workloads.

JMS – Session and two-phase commit (XA) transactions

Most JMS message brokers have good support for transactional workloads.

A transacted session supports a single series of transactions. Each transaction groups a set of produced messages and a set of consumed messages into an atomic unit of work.

Two-phase commit transactions (XA transactions) work on a limited scale. They are used to integrate with other systems like Mainframe CICS / DB2 or Oracle database. But it is hard to operate and not possible to scale beyond a few transactions per second.

It is important to note that support for XA transactions is not mandatory with the JMS 2.0 specification. This differs from the session transaction.

Kafka – Exactly-once semantics and transaction API

Kafka is a distributed, fault-tolerant system that is resilient by nature (if you deploy and operate it correctly). No downtime and no data loss can be guaranteed, like in your favorite database, mainframe, or other core platforms.

And even better: Kafka’s Transaction API, i.e., Exactly-Once Semantics (EOS), has been available since Kafka 0.11 (GA’ed many years ago). EOS makes building transactional workloads even easier as you don’t need to handle duplicates anymore.

Kafka supports atomic writes across multiple partitions through the transactions API. This allows a producer to send a batch of messages to multiple partitions. Either all messages in the batch are eventually visible to any consumer, or none are ever visible to consumers.

Kafka transactions work very differently than JMS transactions. But the goal is the same: Each consumer receives the produced event exactly once. Find more details in the blog post “Analytics vs. Transactions in Data Streaming with Apache Kafka“.

4. Push vs. pull message consumption

TL;DR: JMS message brokers push messages to consumer applications. Kafka consumers pull messages providing true decoupling and backpressure handling for independent consumer applications.

Pushing messages seems to be the obvious choice for a real-time messaging system like JMS-based message brokers. However, push-based messaging has various drawbacks regarding decoupling and scalability.

JMS expects the broker to provide back pressure and implement a “pre-fetch” capability, but this is not mandatory. If used, the broker controls the backpressure, which you cannot control.

With Kafka, the consumer controls the backpressure. Each Kafka consumer consumes events in real-time, batch, or only on demand – in the way the particular consumer supports and can handle the data stream. This is an enormous advantage for many inflexible and non-elastic environments.

So while JMS has some kind of backpressure, the producer stops if the queue is full. In Kafka, you control the backpressure on the consumer. There is no way to scale a producer with JMS (as there are no partitions in a JMS queue or topic).

JMS consumers can be scaled, but then you lose guaranteed ordering. Guaranteed ordering in JMS message brokers only works via a single producer, single consumer, and transaction.

5. Simple JMS API vs. powerful and complex Kafka API

TL;DR: The JMS API provides simple operations to produce and consume messages. Apache Kafka has a more granular API that brings additional power and complexity.

JMS vendors hide all the cool stuff in the implementation under the spec. You only get the 5% (no control, built by the vendor). You need to make the rest by yourself. On the other side, Kafka exposes everything. Most developers only need 5%.

In summary, be aware that JMS message brokers are built to send messages from a data source to one or more data sinks. Kafka is a data streaming platform that provides many more capabilities, features, event patterns, and processing options; and a much larger scale. With that in mind, it is no surprise that the APIs are very different and have different complexity.

If your use case requires just sending a few messages per second from A to B, the JMS is the right choice and simple to use! If you need a streaming data hub at any scale, including data integration and data processing, that’s only Kafka.

Asynchronous request-reply vs. data in motion

One of the most common wishes of JMS developers is to use are request-response function in Kafka. Note that this design pattern is different in messaging systems from an RPC (remote procedure call) as you know it from legacy tools like Corba or web service standards like SOAP/WSDL or HTTP. Request-reply in messaging brokers is an asynchronous communication that leverages a correlation ID.

Asynchronous messaging to get events from a producer (say a mobile app) to a consumer (say a database) is a very traditional workflow. No matter if you do fire-and-forget or request-reply. You put data at rest for further processing. JMS supports request-reply out-of-the-box. The API is very simple.

Data in motion with event streaming continuously processes data. The Kafka log is durable. The Kafka application maintains and queries the state in real-time or in batch. Data streaming is a paradigm shift for most developers and architects. The design patterns are very different. Don’t try to reimplement your JMS application within Kafka using the same pattern and API. That is likely to fail! That is an anti-pattern.

Request-reply is inefficient and can suffer a lot of latency depending on the use case. HTTP or better gRPC is suitable for some use cases. Request-reply is replaced by the CQRS (Command and Query Responsibility Segregation) pattern with Kafka for streaming data. CQRS is not possible with JMS API, since JMS provides no state capabilities and lacks event sourcing capability.

A Kafka example for the request-response pattern

CQRS is the better design pattern for many Kafka use cases. Nevertheless, the request-reply pattern can be implemented with Kafka, too. But differently. Trying to do it like in a JMS message broker (with temporary queues etc.) will ultimately kill the Kafka cluster (because it works differently).

The Spring project shows how you can do better. The Kafka Spring Boot Kafka Template libraries have a great example of the request-reply pattern built with Kafka.

Check out “org.springframework.kafka.requestreply.ReplyingKafkaTemplate“. It creates request/reply applications using the Kafka API easily. The example is interesting since it implements the asynchronous request/reply, which is more complicated to write if you are using, for example, JMS API). Another nice DZone article talks about synchronous request/reply using Spring Kafka templates.

The Spring documentation for Kafka Templates has a lot of details about the Request/Reply pattern for Kafka. So if you are using Spring, the request/reply pattern is pretty simple to implement with Kafka. If you are not using Spring, you can learn how to do request-reply with Kafka in your framework.

6. Storage for durability vs. true decoupling

TL;DR: JMS message brokers use a storage system to provide high availability. The storage system of Kafka is much more advanced to enable long-term storage, back-pressure handling and replayability of historical events.

Kafka storage is more than just the persistence feature you know from JMS

When I explain the Kafka storage system to experienced JMS developers, I almost always get the same response: “Our JMS message broker XYZ also has storage under the hood. I don’t see the benefit of using Kafka!”

JMS uses an ephemeral storage system, where messages are only persisted until they are processed. Long-term storage and replayability of messages are not a concept JMS was designed for.

The core Kafka principles of append-only logs, offsets, guaranteed ordering, retention time, compacted topics, and so on provide many additional benefits beyond the durability guarantees of a JMS. Backpressure handling, true decoupling between consumers, the replayability of historical events, and more are huge differentiators between JMS and Kafka.

Check the Kafka docs for a deep dive into the Kafka storage system. I don’t want to touch on how Tiered Storage for Kafka is changing the game even more by providing even better scalability and cost-efficient long-term storage within the Kafka log.

7. Server-side data-processing with JMS vs. decoupled continuous stream processing with Kafka

TL;DR: JMS message brokers provide simple server-side event processing, like filtering or routing based on the message content. Kafka brokers are dumb. Its data processing is executed in decoupled applications/microservices.

Server-side JMS filtering and routing

Most JMS message brokers provide some features for server-side event processing. These features are handy for some workloads!

Just be careful that server-side processing usually comes with a cost. For instance:

JMS Pre-filtering scalability issues: The broker has to handle so many things. This can kill the broker in a hidden fashion
JMS Selectors (= routing) performance issues: It kills 40-50% of performance

Again, sometimes, the drawbacks are acceptable. Then this is a great functionality.

Kafka – Dumb pipes and smart endpoints

Kafka intentionally does not provide server-side processing. The brokers are dumb. The processing happens at the smart endpoints. This is a very well-known design pattern: Dumb pipes and smart endpoints.

The drawback is that you need separate applications/microservices/data products to implement the logic. This is not a big issue in serverless environments (like using a ksqlDB process running in Confluent Cloud for data processing). It gets more complex in self-managed environments.

However, the massive benefit of this architecture is the true decoupling between applications/technologies/programming languages, separation of concerns between business units for building business logic and operations of infrastructure, and the much better scalability and elasticity.

Would I like to see a few server-side processing capabilities in Kafka, too? Yes, absolutely. Especially for small workloads, the performance and scalability impact should be acceptable! Though, the risk is that people misuse the features then. The future will show if Kafka will get there or not.

8. Complex operations vs. serverless cloud

TL;DR: Self-managed operations of scalable JMS message brokers or Kafka clusters are complex. Serverless offerings (should) take over the operations burden.

Operating a cluster is complex – no matter if JMS or Kafka

A basic JMS message broker is relatively easy to operate (including active/passive setups). However, this limits scalability and availability. The JMS API was designed to talk to a single broker or active/passive for high availability. This concept covers the application domain.

More than that (= clustering) is very complex with JMS message brokers. More advanced message broker clusters from commercial vendors are more powerful but much harder to operate.

Kafka is a powerful, distributed system. Therefore, operating a Kafka cluster is not easy by nature. Cloud-native tools like an operator for Kubernetes take over some burdens like rolling upgrades or handling fail-over.

Both JMS message brokers and Kafka clusters are the more challenging, the more scale and reliability your SLAs demand. The JMS API is not specified for a central data hub (using a cluster). Kafka is intentionally built for the strategic enterprise architecture, not just for a single business application.

Fully managed serverless cloud for the rescue

As the JMS API was designed to talk to a single broker, it is hard to build a serverless cloud offering that provides scalability. Hence, in JMS cloud services, the consumer has to set up the routing and role-based access control to the specific brokers. Such a cloud offering is not serverless but cloud-washing! But there is no other option as the JMS API is not like Kafka with one big distributed cluster.

In Kafka, the situation is different. As Kafka is a scalable distributed system, cloud providers can build cloud-native serverless offerings. Building such a fully managed infrastructure is still super hard. Hence, evaluate the product, not just the marketing slogans!

Every Kafka cloud service is marketed as “fully managed” or “serverless” but most are NOT. Instead, most vendors just provision the infrastructure and let you operate the cluster and take over the support risk. On the other side, some fully managed Kafka offerings are super limited in functionality (like allowing a very limited number of partitions).

Some cloud vendors even exclude Kafka support from their Kafka cloud offerings. Insane, but true. Check the terms and conditions as part of your evaluation.

9. Java/JVM vs. any programming language

TL;DR: JMS focuses on the Java ecosystem for JVM programming languages. Kafka is independent of programming languages.

As the name JMS (=Java Message Service) says: JMS was written only for Java officially. Some broker vendors support their own APIs and clients. These are proprietary to that vendor. Almost all severe JMS projects I have seen in the past use Java code.

Apache Kafka also only provides a Java client. But vendors and the community provide other language bindings for almost every programming language, plus a REST API for HTTP communication for producing/consuming events to/from Kafka. For instance, check out the blog post “12 Programming Languages Walk into a Kafka Cluster” to see code examples in Java, Python, Go, .NET, Ruby, node.js, Groovy, etc.

The true decoupling of the Kafka backend enables very different client applications to speak with each other, no matter what programming languages one uses. This flexibility allows for building a proper domain-driven design (DDD) with a microservices architecture leveraging Kafka as the central nervous system.

10. Single JMS deployment vs. multi-region (including hybrid and multi-cloud) Kafka replication

TL;DR: The JMS API is a client specification for communication between the application and the broker. Kafka is a distributed system that enables various architectures for hybrid and multi-cloud use cases.

JMS is a client specification, while multi-data center replication is a broker function. I won’t go deep here and put it simply: JMS message brokers are not built for replication scenarios across regions, continents, or hybrid/multi-cloud environments.

Multi-cluster and cross-data center deployments of Apache Kafka have become the norm rather than an exception. Various scenarios require multi-cluster Kafka solutions. Specific requirements and trade-offs need to be looked at.

Kafka technologies like MirrorMaker (open source) or Confluent Cluster Linking (commercial) enable use cases such as disaster recovery, aggregation for analytics, cloud migration, mission-critical stretched deployments and global Kafka deployments.

I covered hybrid cloud architectures in various other blog posts. “Low Latency Data Streaming with Apache Kafka and Cloud-Native 5G Infrastructure” is a great example.

Slide deck and video recording

I created a slide deck and video recording if you prefer learning or sharing that kind of material instead of a blog post:

Fullscreen Mode

JMS and Kafka solve distinct problems!

The ten comparison criteria show that JMS and Kafka are very different things. While both overlap (e.g., messaging, real-time, mission-critical), they use different technical capabilities, features, and architectures to support additional use cases.

In short, use a JMS broker for simple and low-volume messaging from A to B. Kafka is usually a real-time data hub between many data sources and data sinks. Many people call it the central real-time nervous system of the enterprise architecture.

The data integration and data processing capabilities of Kafka at any scale with true decoupling and event replayability are the major differences from JMS-based MQ systems.

However, especially in the serverless cloud, don’t fear Kafka being too powerful (and complex). Serverless Kafka projects often start very cheaply at a very low volume, with no operations burden. Then it can scale with your growing business without the need to re-architect the application.

Understand the technical differences between a JMS-based message broker and data streaming powered by Apache Kafka. Evaluate both options to find the right tool for the problem. Within messaging or data streaming, do further detailed evaluations. Every message broker is different even though they all are JMS compliant. In the same way, all Kafka products and cloud services are different regarding features, support, and cost.

Do you use JMS-compliant message brokers? What are the use cases and limitations? When did you or do you plan to use Apache Kafka instead? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Comparison: JMS Message Queue vs. Apache Kafka appeared first on Kai Waehner.

Is Apache Kafka an iPaaS or is Event Streaming its own Software Category?

Kai Waehner — Wed, 03 Nov 2021 06:57:54 +0000

Enterprise integration is more challenging than ever before. The IT evolution requires the integration of more and more technologies. Applications are deployed across the edge, hybrid, and multi-cloud architectures. Traditional middleware such as MQ, ETL, ESB does not scale well enough or only processes data in batch instead of real-time. This post explores why Apache Kafka is the new black for integration projects, how Kafka fits into the discussion around cloud-native iPaaS solutions, and why event streaming is a new software category. A concrete real-world example shows the difference between event streaming and traditional integration platforms respectively iPaaS.

What is iPaaS (Enterprise Integration Platform as a Service)?

iPaaS (Enterprise Integration Platform as a Service) is a term coined by Gartner. Here is the official Gartner definition: “Integration Platform as a Service (iPaaS) is a suite of cloud services enabling development, execution, and governance of integration flows connecting any combination of on-premises and cloud-based processes, services, applications, and data within individual or across multiple organizations.” The acronym eiPaaS (Enterprise Integration Platform as a Service)” is used in some reports as a replacement for iPaaS.

The Gartner Magic Quadrant for iPaaS shows various vendors:

Three points stand out for me:

Many very different vendors provide a broad spectrum of integration solutions.
Many vendors (have to) list various products to provide an iPaaS; this means different technologies, codebases, support teams, etc.
No Kafka offering (like Confluent, Cloudera, Amazon MSK) is in the magic quadrant.

The last bullet point makes me wonder if Kafka-based solutions should be considered iPaaS or not?

Is Apache Kafka an iPaaS?

I don’t know. It depends on the definition of the term “iPaaS”. Yes, Kafka solutions fit into the iPaaS, but it is just a piece of the event streaming success story.

Kafka is an event streaming platform. Use cases differ from traditional middleware like MQ, ETL, ESB, or iPaaS. Check out real-world Kafka deployments across industries if you don’t know the use cases yet.

Kafka does not directly compete with ETL tools like Talend or Informatica, MQ frameworks like IBM MQ or RabbitMQ, API Management platforms like MuleSoft, and cloud-based iPaaS like Boomi or TIBCO. At least not if people understand the differences between Kafka and traditional middleware. For that reason, many people (including me) think that Event Streaming should be its Magic Quadrant.

Having said this, all these very different vendors are in the iPaaS Magic Quadrant. So, should Kafka respectively its vendors be in here? I think so because I have seen hundreds of customers leverage the Kafka ecosystem as a cloud-native, scalable, event-driven integration platform, often in hybrid and multi-cloud architectures. And that’s an iPaaS.

What’s different with Kafka as an Integration Platform?

If you are new to this discussion, check out the article “Apache Kafka vs. MQ, ETL, ESB” or the related slides and video. Here is my summary on why Kafka is unique for integration scenarios and therefore adopted everywhere:

A unique selling point for event streaming is the ability to leverage a single platform. In contrast, other iPaaS solutions require different products (including codebases, support teams, integration between the additional tech, etc.).

Kafka as Cloud-native and Serverless iPaaS

Fundamental differences exist between modern iPaaS solutions and traditional legacy middleware; this includes the software architecture, the scalability and operations of the platform, and data processing capabilities. On a high level, an “Kafka iPaaS” requires the following characteristics:

Cloud-native Infrastructure: Elasticity is vital for success in the modern IT world. The ability to scale up and down (= expanding and shrinking Kafka clusters) is mandatory. This capability enables starting with a small pilot project and scaling up or handling load spikes (like Christmas business in retail).
Automated Operations: Truly serverless SaaS should always be the preferred option if the software runs in a public cloud. Orchestration tools like Kubernetes and related tools (like a Kafka operator for Kubernetes) are the next best option in a self-managed data center or at the edge outside a data center.
Complete Platform: An integration platform requires real-time messaging and storage for backpressure handling and long-running processes. Data integration and continuous data processing are mandatory, too. Hence, an “Kafka iPaaS” is only possible if you have access to various pre-built Kafka-native connectors to open standards, legacy systems, and modern SaaS interfaces. Otherwise, Kafka needs to be combined with another middleware like an iPaaS or an ETL tool like Apache Nifi.
Single Solution: It sounds trivial, but most other middleware solutions use several codebases and products under the hood. Just look at stacks from traditional players such as IBM and Oracle or open-source-driven Cloudera. The complex software stack makes it much harder to provide end-to-end integration, 24/7 operations, and mission-critical support. Don’t get me wrong: Kafka-native solutions like Confluent Cloud also include different products with additional pricing (like a fully-managed connector or data governance add-ons), but all run on a single Kafka-native platform.

From this perspective, some Kafka solutions are modern, cloud-native, scalable, iPaaS. Having said this, would I be happy if you consider some Kafka solutions as an iPaaS on your technology radar? No, not really!

Event Streaming is its Software Category!

While some Kafka solutions can be used as iPaaS, this is only one of many usage scenarios for event streaming. However, as explained above, Kafka-based solutions differ greatly from other iPaaS solutions in the Gartner Magic Quadrant. Hence, event streaming deserves its software category.

If you still wonder what I mean, check out event streaming use cases across industries to understand the difference between Kafka and traditional iPaaS, MQ, ETL, ESB, API tools. Here is a relatively old but still fantastic diagram that summarizes the broad spectrum of use cases for event streaming:

TL;DR: Kafka provides capabilities for various use cases, not just integration scenarios. Many new innovative business applications are built with Kafka. It is not just an integration platform but a unique suite of capabilities for end-to-end data processing in real-time at scale.

New concepts like Data Mesh also prove the evolution. The basic principles are not unique: Domain-driven design, microservices, true decoupling of services, but now with much more focus on data as a product. The latter means it is turning from a cost center into a profit center and innovative new services. Event streaming is a critical component of a data mesh-powered enterprise architecture as real-time almost always beats slow data across use cases.

The Non-Existing Event Streaming Gartner Magic Quadrant or Forrester Wave

Unfortunately, a Gartner Magic Quadrant or Forrester Wave for Event Streaming does NOT exist today. While some event streaming solutions fit into some of these reports (like the Gartner iPaaS Magic Quadrant or the Forrester Wave for Streaming Analytics), it is still an apple to orange comparison.

Event Streaming is its category. Many software vendors built their entire business around this category. Confluent is the leader in this space – note that I am biased as a Confluent employee, but I guess there is no question around this statement Many other companies emerge around Kafka, or in a related way using the Kafka protocol, or competitive event streaming offerings such as Amazon Kinesis or Apache Pulsar.

The following Event Streaming Landscape 2021 summarizes the current status:

I hope to see a Gartner Magic Quadrant for Event Streaming and a Forrester Wave for Event Streaming soon, too.

Open Source vs. Partially Managed vs. Fully-Managed Event Streaming

One more aspect to point out: You might have noticed that I said, “some event streaming solutions can be considered an iPaaS”. The word “some” is a crucial detail. Just providing an open-source framework is not enough.

iPaaS requires a complete offering, ideally as fully-managed services. Many vendors for event streaming use Kafka, Pulsar, or another framework but do NOT provide a complete offering with operations tooling, commercial 24/7 support, user interfaces, etc. The following resources should help you learn more about the event streaming landscape in 2021:

TL;DR: Evaluate the various offerings. A lot of capabilities are just marketing! Many “fully-managed services” are only partially managed instead of serverless and with very limited SLAs and support. Some other offerings provide plenty of cool features but are more an alpha version and overselling than a mature battle-service solution. A counterexample is the complexity in T-Mobile’s report about upgrading Amazon MSK. This shows the difference between “promoting and selling a fully managed service” and the “not at all fully-managed reality”. A truly fully-managed offering does NOT require the end user to upgrade the infrastructure.

Kafka as Event Streaming iPaaS at Deutsche Bahn (German Railway)

Let’s now look at a practicable example to understand why a traditional iPaaS cannot help in use cases that require event streaming and why this combination of capabilities in a single technology sets a new software category.

This section explores a real-world use case with the journey of Deutsche Bahn (German Railway) providing a great customer experience to their customers. This example uses Event Streaming as iPaaS (regarding my definition of these terms).

Use Case: Improve Customer Experience with Real-time Notifications

The use case sounds very simple: Improve the customer experience by providing real-time information and notifications to customers across various channels like websites, smartphones, and displays at train stations.

Delays and cancellations happen in a complex rail network like in Germany. Frequent travelers accept this downside. Nobody can do anything against lousy weather, self-murder using a traveling train, and technical defects.

However, customers at least expect real-time information and notifications so that they can wait in a coffee shop or lounge instead of freezing at the station platform for minutes or even hours. The reality at Deutsche Bahn was a different sequence of notifications: 10min delay, 20min delay, 30min delay, train canceled – take the next train.

The goal of the project Reisendeninformation (= traveler information system) was to improve the day-to-day experience of millions of travelers across Germany by delivering up-to-date, accurate, and consistent travel information in any location.

Initial Project: A mess of different Integration Technologies

Again, the use case sounds simple. Let’s send real-time notifications to customers if a train is delayed or canceled. Every magic black iPaaS box can do this:

Is this true or just the marketing of all the integration vendors? Can each black box integrate all the different systems to correlate events in real-time?

Deutsche Bahn started with an open-source messaging platform to send real-time notifications. Unfortunately, the team quickly found out that not every piece of information was coming in in real-time. So, a caching and storage system was added to the project to handle the late-arriving information from some legacy batch or file-based systems. Now, the data from the legacy systems needed to be integrated. Hence, an integration framework was installed to connect to files, databases, and other applications. Now, the data needed to be processed, correlating real-time and non-real-time data from various systems. A stream processing engine can do this.

The pilot project included several different frameworks. A few skeptical questions came up:

How to scale this combination of very different technologies?
How to get end-to-end support across various platforms?
Is this cost-efficient?
What is the time-to-market for new features?

Deutsche Bahn re-evaluated their tech stack and found out that Apache Kafka provides all the required capabilities out-of-the-box within one platform.

The Migration to Cloud-native Kafka

The team at Deutsche Bahn re-architected their pilot project. Here is the new solution leveraging Kafka as the single point of truth between various systems, technologies, and communication paradigms:

A traditional iPaaS can implement this scenario. But with several codebases, technologies, and clusters, even if you select one software vendor! Some iPaaS might even do well in the beginning but struggle to scale up. Only event streaming allows to start small but scales up with no need to re-architect the infrastructure.

Today, the project is in production. Check your DB Navigator mobile app to get real-time updates about all trains in Germany. Living in Germany, I appreciate this new service to have a much better traveling experience.

Learn more about the project from the Deutsche Bahn team. They gave several public talks at different conferences and wrote on the Confluent Blog about their Kafka journey. Though, the journey did not end here As described in their blog post, Deutsche Bahn is now evaluating the migration from a self-managed Confluent Platform deployment in the public cloud to the fully-managed, truly serverless Confluent Cloud offering to reduce TCO and improve time-to-market.

Complete Platform: Kafka is more than just Messaging

A project like the one described above is only possible with a complete platform. Many people still think about Kafka as an ingestion layer into a data lake or data warehouse, as this was one of the first prominent Kafka use cases. Data ingestion is still an excellent use case today. Many projects already use more than just the core of Kafka to implement this. Kafka Connect provides out-of-the-box connectivity between Kafka and the data store. If you are in the public cloud, you even get integrated in a fully-managed, serverless manner whether you need to integrate with a 1st party cloud service like Amazon S3, Google Cloud BigQuery, Azure Cosmos DB, or other 3rd SaaS like MongoDB Atlas, Snowflake, or Databricks.

Continous Kafka-native stream processing is the next level of a complete platform. For instance, Deutsche Bahn leverages Kafka Streams a lot for their data correlations in real-time at scale. Other companies use ksqlDB as a fully-managed function in Confluent Cloud. The enormous benefit is that you don’t need yet another platform or service for streaming analytics. A complete platform makes the architecture more cost-efficient, and end-to-end integration is easier from SLA and support perspective.

A complete platform requires many additional services “on top”, like visualization of data flows, data governance, role-based access control, audit logs, self-service capabilities, user interfaces, visual coding/code generation tools, etc. Visual coding is the point where traditional middleware and iPaaS tools are stronger today than event streaming offerings.

3rd Party integration via Open API and non-Kafka Tools

So far, you learned why event streaming is its software category and how Deutsche Bahn is an excellent example to show this. However, event streaming is NOT the silver bullet for every problem! When exploring if Kafka and MQ/ETL/ESB are friends, enemies, or frenemies, I already pointed this out. For instance, MQ or an ESB can complement event streaming in an integration project, depending on your project requirements.

Let’s go back to Deutsche Bahn. As mentioned, their real-time traveler information platform is live, with Kafka as the single point of truth. Recently, Deutsche Bahn announced a partnership with Google and 3rd Party Integration with Google Maps:

Real-time Schedule Updates to 3rd Party Google Maps API

The integration provides real-time train schedule updates to Google Maps users:

The integration allows to reach new people and expand the business. Users can buy train tickets via one click from the Google Maps page.

I don’t know what technology or product this 3rd party integration uses. The heart of Deutsche Bahn’s real-time infrastructure enables new innovative business models and collaboration with partners.

Likely, this integration between Deutsche Bahn and Google Maps does not directly use the Kafka protocol (even though this is done sometimes, for instance, see Here Technologies Open API for their mapping service).

Event Streaming is complementary to other services. In this example, the project team might have used an API Management platform to provide internal APIs to external consumers, including access control, billing, and reporting. The article “Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies?” explores the relationship between event streaming and API Management.

Event Streaming Everywhere – Cloud, Data Center, Edge

Real-time beats slow data everywhere. This is a new software category because we don’t just send events into another database via a messaging system. Instead, we use and correlate data from different data source in real-time. That’ the real added value and game changer in innovative projects.

Hence, event streaming must be possible in every location. While cloud-first is a viable strategy for many IT projects, edge and hybrid scenarios are and will continue to be very relevant.

Think about a project related to the Deutsche Bahn example above (but being completely fictive): A hybrid architecture with real-time applications the cloud and edge computing within the trains:

I covered this in other articles, including “Edge Use Cases for Apache Kafka Across Industries“. TL;DR: Leverage the open architecture of event streaming for real-time data processing everywhere, including multi-cloud, data centers, and edge deployments (i.e., outside a data center). The enterprise architecture does not need various technologies and products to implement real-time data processing and integration with separate iPaaS, ETL tools, ESBs, MQ systems.

However, once again, it is crucial to understand how event streaming fits into the enterprise architecture. For instance, Kafka is often combined with IoT technologies such as MQTT for the last mile integration with IoT devices in these edge scenarios.

Slide Deck and Video for “Apache Kafka vs. Cloud-native iPaaS Middleware”

Here are the related slides for this topic:

And the video recording of this presentation:

Kafka is a cloud-native iPaaS, and much more!

Kafka is the new black for integration projects across industries because of its unique combination of capabilities. Some Kafka solutions are part of the iPaaS category, with trade-offs like any other integration platform.

However, event streaming is its software category. Hence, iPaaS is just one usage of Kafka or other similar event streaming platforms. Real-time data beats slow data. For that reason, event streaming is the backbone for many projects to process data in motion (but also integrate with other systems that store data at rest for reporting, model training, and other use cases).

How do you leverage event streaming and Kafka as an integration platform? What projects did you already work on or are in the planning? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Is Apache Kafka an iPaaS or is Event Streaming its own Software Category? appeared first on Kai Waehner.

Kafka API is the De Facto Standard API for Event Streaming like Amazon S3 for Object Storage

Kai Waehner — Sun, 09 May 2021 14:32:49 +0000

Real-time beats slow data in most use cases across industries. The rise of event-driven architectures and data in motion powered by Apache Kafka enables enterprises to build real-time infrastructure and applications. This blog post explores why the Kafka API became the de facto standard API for event streaming like Amazon S3 for object storage, and the tradeoffs of these standards and corresponding frameworks, products, and cloud services.

Event-Driven Architecture: This Time It’s Not A Fad

The Forbes’ article “Event-Driven Architecture: This Time It’s Not A Fad” from April 2021 explained why enterprises are not just talking about event-driven real-time applications, but finally building them. Here are some arguments:

REST limitations can limit your business strategy
Data needs to be fluid and real-time
Microservices and serverless need event-driven architectures

Real-time Data in Motion beats Slow Data

Use cases for event-driven architectures exist across industries. Some examples:

Transportation: Real-time sensor diagnostics, driver-rider match, ETA updates
Banking: Fraud detection, trading, risk systems, mobile applications/customer experience
Retail: Real-time inventory, real-time POS reporting, personalization
Entertainment: Real-time recommendations, a personalized news feed, in-app purchases
The list goes on across verticals…

Real-time data in motion beats data at rest in databases or data lakes in most scenarios. There are a few exceptions that require batch processing:

Reporting (traditional business intelligence)
Batch analytics (processing high volumes of data in a bundle, for instance, Hadoop and Spark’s map-reduce, shuffling, and other data processing only make sense in batch mode)
Model training as part of a machine learning infrastructure (while model scoring and monitoring often requires real-time predictions, the model training is batch in almost all currently available ML algorithms)

Beyond these exceptions, almost everything is better in real-time than batch.

Be aware that real-time data processing is more than just sending data from A to B in real-time (aka messaging or pub/sub). Real-time data processing requires integration and processing capabilities. If you send data into a database or data lake in real-time but have to wait until it is processed there in batch, it does not solve the problem.

With the ideas around real-time in mind, let’s explore what a de facto standard API is.

What is a (De Facto) Standard API?

The answer is longer than you might expect and needs to be separated into three sections:

API
Standard API
De facto standard API

What is an API?

An application programming interface (API) is an interface that defines interactions between multiple software applications or mixed hardware-software intermediaries. It defines the kinds of calls or requests that can be made, how to make them, the data formats that should be used, the conventions to follow, etc. It can also provide extension mechanisms so that users can extend existing functionality in various ways and to varying degrees.

An API can be entirely custom, specific to a component, or designed based on an industry-standard to ensure interoperability. Through information hiding, APIs enable modular programming, allowing users to use the interface independently of the implementation.

What is a Standard API?

Industry consortiums or other industry-neutral (often global) groups or organizations specify standard APIs. A few characteristics show the trade-offs:

Vendor-agnostic interfaces
Slow evolution and long specification process
Most vendors add proprietary features because a) too slow process of the standard specification or more often b) to differentiate their commercial offering
Acceptance and success depend on the complexity and added value (this sounds obvious but is often the key blocker for success)

Examples for Standard APIs

Here are some examples of standard APIs. I also add my thoughts if I think they are successful or not (but I fully understand that there are good arguments against my opinion).

Generic Standards

SQL: Domain-specific language used in programming and designed for managing data held in a relational database management system. Successful as almost every database somehow supports SQL or tries to build a similar syntax. A good example is ksqlDB, the Kafka-native streaming SQL engine. ksqlDB (like most other streaming SQL engines) is not ANSI SQL, but still understood easily by people that know SQL.
J2EE / Java EE / Jakarta EE: Successful as most vendors adopted at least parts of it for Java frameworks. While early versions were very heavyweight and complex, the current APIs and implementations are much more lightweight and user-friendly. JMS is a great example where vendors added proprietary add-ons to add features and differentiate. No vendor-lockin is only true in theory!
HTTP: Successful as application layer protocol for distributed, collaborative, hypermedia information systems. While not 100% correct, people typically interpret HTTP as REST Web Services. HTTP is often misused for things it is not built for.
SOAP / WSDL: Partly successful in providing XML-based web service standard specifications. Some vendors built good tooling around it. However, this is typically only true for the basic standards such as SOAP and WSDL, not so much for all the other complex add-ons (often called WS-* hell).

Standards for a Specific Problem or Industry

OPC-UA for Industrial IoT (IIoT): Partly successful machine-to-machine communication protocol for industrial automation developed. Adopted by almost every vendor in the industrial space. The drawback (similarly to HTTP) is that it is often misused. For instance, MQTT is a much better and more lightweight choice in some scenarios. OPC-UA is a great example where the core is successful, but the industry-specific add-ons are not prevalent and not supported by tools. Also, OPC-UA is too heavyweight for many of the use cases it is used in.
PMML for Machine Learning: Not successful as an XML-based predictive model interchange format. The idea is great: Train an analytic model once and then deploy it across platforms and programming languages. In practice, it did not work. Too many limitations and unnecessary complexity for a project. Most real-world machine learning deployments I have seen in the wild avoid it and deploy models to production with a standard wrapper. ONNX and other successors are not more prevalent yet either.

In summary, some standard APIs are successful and adopted well; many others are not. Contrary to these standards specified by consortiums, there is another category emerging: De Facto Standard APIs.

What is a De Facto Standard API?

De Facto standard APIs originate from an existing successful solution (that can be an open-source framework, a commercial product, or a cloud service). Two ways exist how these de facto standard APIs emerge:

Driven by a single vendor (often proprietary), for example: Amazon S3 for object storage.
Driven by a huge community around a successful open-source project, for example: Apache Kafka for event streaming.

No matter how a de facto standard API originated, they typically have a few characteristics in common:

Creation of a new category of software, something that did not exist before
Adoption by other frameworks, products, or cloud services as the API because became the de facto standard
No complex, formal, long-running standard processes; hence innovation is possible in a relatively flexible and agile way
Practical processes and rules are in place to ensure good quality and consensus (either controlled by the owner company for a proprietary standard API or across the open source community)

Let’s now explore two de facto standard APIs: Amazon S3 and Apache Kafka. Both are very successful but very different regarding being a standard. Hence, the trade-offs are very different.

Amazon S3: De Facto Standard API for Object Storage

Amazon S3 or Amazon Simple Storage Service is a service offered by Amazon Web Services (AWS) that provides object storage through a web service interface in the public AWS cloud. It uses the same scalable storage infrastructure that Amazon.com uses to run its global e-commerce network. Amazon S3 can be employed to store any type of object, which allows for uses like storage for internet applications, backup and recovery, disaster recovery, data archives, data lakes for analytics, and hybrid cloud storage. Additionally, S3 on Outposts provides on-premises object storage for on-premises applications that require high-throughput local processing.

“Amazon CTO on Past, Present, Future of S3” is a great read about the evolution of this fully-managed cloud service. While the public API was kept stable, the internal backend architecture under the hood changed several times significantly. Plus, new features were developed on top of the API, for instance, AWS Athena for analytics and interactive queries using standard SQL. I really like how Werner Vogels describes his understanding of a good cloud service:

Vogels doesn’t want S3 users to even think for a moment about spindles or magnetic hardware. He doesn’t want them to care about understanding what’s happening in those data centers at all. It’s all about the services, the interfaces, and the flexibility of access, preferably with the strongest consistency and lowest latency when it really matters.

So, we are talking about a very successful proprietary cloud service by AWS. Hence, what’s the point?

Most Object Storage Vendors Support the Amazon S3 API

Many enterprises use the Amazon S3 API. Hence, it became the de facto standard. If other storage vendors want to sell object storage, supporting the S3 interface is often crucial to get through the evaluations and RFPs. If you don’t support the S3 API, it is much harder for companies to adopt the storage and implement the integration (as most companies already use Amazon S3 and have built tools, scripts, testing around this API).

For this reason, many applications have been built to support the Amazon S3 API natively. This includes applications that write data to Amazon S3 and Amazon S3-compatible object stores.

S3 compatible solutions include client backup, file browser, server backup, cloud storage, cloud storage gateway, sync&share, hybrid storage, on-premises storage, and more.

Many vendors sell S3-compatible products: Oracle, EMC, Microsoft, NetApp, Western Digital, MinIO, Pure Storage, and many more. Check out the Amazon S3 site from Wikipedia for a more detailed and complete list.

So why has the S3 API become so ubiquitous?

The creation of a new software category is a dream for every vendor! Let’s understand how and why Amazon was successful in establishing S3 for object storage. The following is a quote from Chris Evan’s great article from 2016: “Has S3 become the de facto API standard?”

So why has the S3 API become so ubiquitous? I suspect there are a number of reasons. These include:

First to market – When S3 was launched in 2006, most enterprises were familiar with object storage as “content addressable storage” through EMC’s Centera platform. Other than that, applications were niche and not widely adopted except for specific industries like High Performance Computing where those users were used to coding to and for the hardware. S3 quickly became a platform everyone could use with very little investment. That made it easy to consume and experiment with. By comparison, even today the leaders in object storage (as ranked by the major analysts) still don’t make it easy (or possible) to download and evaluate their products, even though most are software only implementations.
Documentation – following on from the previous point, S3 has always been well documented, with examples on how to run API commands. There’s a document history listing changes over the past 6-7 years that shows exactly how the API has evolved.
A Single Agenda – the S3 API was designed to fit a single agenda – that of storing and retrieving objects from S3. As such, Amazon didn’t have to design by committee and could implement the features they required and evolve from there. Contrast that with the CDMI (Cloud Data Management Interface) from SNIA. The SNIA website is difficult to navigate, the standard itself is only on the 4th published iteration in six years, while the documentation runs to 264 pages! (Note that the S3 API runs into more pages, but is infinitely more consumable, with simple examples from page 11 onwards).

Cons of a Proprietary De Facto Standard like Amazon S3

Many people might say: “Better a proprietary standard than no standard.” I partly agree with this. The possibility to learn one API and use it across multi-cloud and on-premise systems and vendors is great. However, Amazon S3 has several disadvantages as it is NOT an open standard:

Other vendors (have to) build their implementation on a best guess about the behavior of the API. There is no official standard specification they can rely on.
Customers cannot be sure what they buy. At least, they should not expect the same behavior of 3rd party S3 implementations that they get from their experiences using Amazon S3 on AWS.
Amazon can change APIs and features as it likes. Other vendors need to “reverse engineer the API” and adjust their products.
Amazon could sue competitors for using S3 API branding – even though this is not likely to happen as the benefits are probably bigger (I am not a lawyer; hence this statement might be wrong and is just my personal opinion)

Let’s now look at an open-source de facto standard: Kafka.

Kafka API: De Facto Standard API for Event Streaming

Apache Kafka is mainstream today! The Kafka API became the de facto standard for event-driven architectures and event streaming. Two proof points:

Use cases across all industries and infrastructure. Including various kinds of transactional and analytics workloads. Edge, hybrid, multi-cloud. I collected a few examples across verticals that use Apache Kafka to show the prevalence across markets.
Adoption by various open-source frameworks and many software/cloud vendors. Check out my blog post if you are interested in a comparison of Kafka vendors such as Confluent, Cloudera, Red Hat or Amazon MSK and related technologies like Azure Event Hubs, AWS Kinesis, RedPanda, or Apache Pulsar.

The Kafka API (aka Kafka Protocol)

Kafka became the de facto event streaming API. Similar like the S3 API became the de facto standard for object storage. Actually, the situation is even better for the Kafka API as the S3 API is a proprietary protocol from AWS. In contrast, the Kafka API and protocol are open source under Apache 2.0 license.

The Kafka protocol covers the wire protocol implemented in Kafka. It defines the available requests, their binary format, and the proper way to make use of them to implement a client.

One of my favorite characteristics of the Kafka protocol is backward compatibility. Kafka has a “bidirectional” client compatibility policy. In other words, new clients can talk to old servers, and old clients can talk to new servers. This allows users to upgrade either clients or servers without experiencing any downtime or data loss. This makes Kafka ideal for microservice architectures and domain-driven design (DDD). Kafka really decouples the applications from each other in contrary to web service/REST-based architectures).

Pros of an Open Source De Facto Standard like the Kafka API

The huge benefit of an open-source de facto standard API is that it is open and usually follows a collaborative standardized process to make changes to the API. This brings various benefits to the community and software vendors.

The following facts about the Kafka API make many developers and enterprises happy:

Changes occur in a visible process enforced by a committee. For Apache Kafka, the Apache Software Foundation (ASF) is the relevant organization. Apache projects are managed using a collaborative, consensus-based process with members from various countries and enterprises. Check out how it works if you don’t know it yet.
Frameworks and vendors can implement against the open protocol and validate the implementation. That is significantly different from proprietary de facto standards like Amazon S3. Having said this, not every product that says it uses the Kafka API is 100% compatible and consequently is limited in the feature set and provides different behavior.
Developers can test the underlying behavior against the same API. Hence, unit and performance tests for different implementations can use the same code.
The Apache 2.0 license makes sure that the user does not have to worry about infringing any patents by using the software.

Frameworks, Products, and Cloud Services using the Kafka API

Many frameworks and vendors adopted the Kafka API. Let’s take a look at a few very different alternatives available today that use the Kafka API:

Open-source Apache Kafka from the Apache website
Self-managed Kafka-based vendor solutions for on-premises or cloud deployments from Confluent, Cloudera, Red Hat
Partially managed Kafka-based cloud offerings from Amazon MSK, Red Hat, Azure HD Insight’s Kafka, Aiven, cloudkarafka, Instaclustr.
Fully managed Kafka cloud offerings such as Confluent Cloud – actually, there is no other serverless, fully compatible Kafka SaaS offering on the market today (even though many marketing departments try to sell it like this)
Partly protocol-compatible, self-managed solutions such Apache Pulsar (with a simple, very limited Kafka wrapper class) or RedPanda for embedded / WebAssembly (WASM) use cases
Partly protocol-compatible, fully managed offerings like Azure EventHubs

Just be aware that the devil is in the details. Many offerings only implement a fraction of the Kafka API. Additionally, many offerings only support the core messaging concept, but exclude key features such as Kafka Connect for data integration, Kafka Streams for stream processing, or exactly-once semantics (EOS) for building transactional systems.

The Kafka API Dominates the Event Streaming Landscape

If you look at the current event streaming landscape, you see that more and more frameworks and products adopt the Kafka API. Even though the following is not a complete list (and other non-Kafka offerings exist), it is imposing:

If you want to learn more about the different Kafka offerings on the market, check out my Kafka vendor comparison. It is crucial to understand what Kafka offering is right for you. Do you want to focus on business logic and consume the Kafka infrastructure as a service? Or do you want to implement security, integration, monitoring, etc., by yourself?

The Kafka API is here to stay…

The Kafka API became the de facto standard API for event streaming. The usage of an open protocol creates huge benefits for corresponding frameworks, products, and cloud services leveraging the Kafka API.

Vendors can implement against the open standard and validate their implementation. End users can choose the best solution for their business problem. Migration between different Kafka services is also possible relatively easily – as long as each vendor is compliant with the Kafka protocol and implements it completely and correctly.

Are you using the Kafka API today? Open source Kafka (“car engine”), a commercial self-managed offering (“complete car”), or the serverless Confluent Cloud (“self-driving car) to focus on business problems? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Kafka API is the De Facto Standard API for Event Streaming like Amazon S3 for Object Storage appeared first on Kai Waehner.

Comparison of Open Source Apache Kafka vs Vendors including Confluent, Cloudera, Red Hat, Amazon MSK

Kai Waehner — Tue, 20 Apr 2021 07:49:39 +0000

Apache Kafka became the de facto standard for event streaming. The open-source community is huge. Various vendors added Kafka and related tooling to their offerings or provide a Kafka cloud service. This blog post uses the car analogy – from the motor engine to the self-driving car – to explore the different Kafka offerings available on the market. I also cover a few other vehicles, meaning (partly) Kafka-compatible technologies. The goal is not a feature-by-feature comparison (that would be outdated the day after the publication). Instead, the intention is to educate about the different deployment models, product strategies, and trade-offs from the available options.

Disclaimer: I work for Confluent. However, the post is not about comparing features but explaining the concepts behind the alternatives. I talk to enterprises across the globe every week. I can assure you that many people I talk to are not aware or mislead about what you read in the following sections. Hence, I hope that the following helps you to make the right decision. Either choose to run open-source Apache Kafka or one of the various commercial Kafka offerings, or even a combination of both.

UPDATE (August 2022): This blog post was written before Amazon MSK Serverless was released. The below is still accurate and worth a read for comparing Kafka products and cloud services. Additionally, please check out the article “When NOT to choose Amazon MSK Serverless for Apache Kafka?“

Apache Kafka Components and Use Cases

The goal is not to introduce Kafka here. The minimum you should know is that Kafka is NOT just a messaging layer for data ingestion into a data lake. This is just a fraction of today’s usages.

Kafka is an open-source framework under Apache 2.0 license. It provides a combination of messaging, storage, processing, and integration of high volumes of data at scale in real-time and fault-tolerant. That’s what makes Kafka unique compared to other MQ, ETL, ESB, and API platforms.

Kafka is deployed in production for various use cases across industries. This includes analytical and mission-critical workloads. Different deployments require different SLAs. You should always ask yourself what happens if the Kafka infrastructure is in trouble. What are your RTO (Recovery Time Objective) and RPO (Recovery Point Objective)? Or in other words: How much data is okay to lose? How much downtime is acceptable? Start your Kafka projects with these questions in mind when you start your comparison of the options!

Kafka is the De Facto Standard API for Event Streaming like S3 API for Object Storage

Apache Kafka is mainstream! The latest proof: Check out the new ThoughtWorks Technology Radar: “Kafka API without Kafka“:

Kafka became the de facto event streaming API. Similar to S3 API became the de facto standard for object storage. Actually, the situation is even better for the Kafka API as the S3 API is a proprietary protocol from AWS. In contrast, the Kafka API and protocol are open source under Apache 2.0 license.

Check out the blog “Kafka API is the De Facto Standard API for Event Streaming like Amazon S3 for Object Storage” for more details.

Let’s take a look at a few very different Kafka alternatives available today:

Open-source Apache Kafka from the Apache website under Apache 2.0 license
Self-managed vendor offerings from Confluent, Cloudera, Red Hat, Amazon MSK, and many more
Fully-managed cloud offerings such as Confluent Cloud
Partly protocol-compatible products such as RedPanda for embedded / WebAssembly (WASM) use cases
Partly protocol-compatible SaaS offerings like Azure EventHubs

That’s a lot of options. So, how do you make a Kafka comparison to choose the right one? Before we go into more detail, let’s explore how complex Kafka actually is and when you do have to care about this at all.

Should you care how complex or heavyweight your event streaming technology is?

Complexity matters (only) if you need to operate the infrastructure by yourself. The beauty of SaaS is that you just consume the service and focus on your business problems. For instance, the AWS S3 object storage is a simple API with a fully managed service under the hood. You do not need to worry about operations or monitoring. You just use the cloud service.

Having said this, it is a little bit strange that ThoughtWorks mentions the barriers and complexity of Kafka but then refers to the Pulsar wrapper. That is an (immature) single class mapping implementation that only maps a small part of Kafka’s protocol (here is the producer wrapper as an example). Developers can use that wrapper to move data between Kafka clients and Pulsar brokers. However, Pulsar has a much more complex three-tier distributed architecture with ZooKeeper, BookKeeper, and Pulsar clusters. What is the benefit here? Is this really what you want to do in mission-critical workloads? Why? Please let me know if you seriously consider using such a wrapper architecture. Also, please read my post about the “Myths of Kafka vs. Pulsar“. A lot of arguments like the Kafka wrapper are simply just marketing and not usable for real-world projects!

Therefore, when I think about using the Kafka API without operating Kafka, then I have fully managed SaaS offerings such as Confluent Cloud or Azure Event Hubs in my mind.

Having said this, a fully managed cloud service is not always an option. For instance, Kafka at the edge is the new black. Plenty of use cases exist for a single broker or highly-available Kafka clusters at the edge.

But even if you want or need to operate Kafka by yourself: With KIP-500 and the removal of ZooKeeper, it gets easier and more lightweight than ever before. A lot of arguments do not exist anymore to move to a more “lightweight alternative”. There might be good reasons to choose something like RedPanda. But the main argument of having a more simple and lightweight deployment is not given anymore. Check out this video showing Kafka without ZooKeeper.

How to Choose the Right Kafka Distribution or Cloud Service?

So, how to make a comparison to find out which Kafka distribution or cloud service is the right one for your project?

The answer is simpler than you might think: The ultimate goal is to focus and solve your business problem.

How do you do that? By implementing business logic. Ideally, you don’t have to worry about infrastructure, operations, security, scalability, reliability, and non-business characteristics. Hence, SaaS with fully managed Kafka should be the first choice.

Unfortunately, SaaS is not always possible or the best option for many reasons:

Missing features
Technical limitations
Cost
Security requirements
On-premise or edge use cases

Therefore, we need to go one step back and understand what options you have to deploy and operate Kafka. Understanding these concepts without all the marketing fluff from the vendors is crucial to make the right decision!

The Kafka Car: An Analogy for a Product Comparison

It is often easier to compare technology by using an analogy from real life. Something everybody understands. No matter what industry you are in. No matter how technical you are. First, I thought I use the analogy of pizza, including self-made pizza, pizza ingredients, restaurants, delivery services, and other related topics. But pizza is used so often in the IT world. This originated in the early days of Amazon. Jeff Bezos instituted a rule: Every internal team should be small enough to be fed with two pizzas.

Finally, I choose to use the analogy of a car because I think many of the arguments are less debatable this way. I guess we could never agree on what would be the best pizza option for most people…

Hence, let’s talk about car engines, car brands, self-driving, connected fleets, and vintage vehicles in the following sections.

Give me a self-driving car, please!

Obviously, most people would prefer a self-driving car (if the price is right). It is safe, cost-efficient, and comfortable.

In the Kafka context, this means that the Kafka infrastructure would be

cloud-native (= elastic, scalable and automated, ideally fully managed)
complete (= entire set of security and operational features that enterprises require)
everywhere (= available in multiple public clouds, private cloud, on-premise, edge outside the data center)

Unfortunately, not every Kafka setup can be self-driving. We need to disassemble a car into its parts to understand what’s going on under the hood. Then we can choose the right car for our business problem.

Car Brands: Comparison of Confluent, Cloudera, Red Hat, Amazon MSK

Competition creates innovation. Hence, it is great to see many car brands and car models on the streets. Similarly, many competing companies fight for market share around Kafka business. Let’s quickly think about the available car brands (= Kafka vendors) on the market.

I only focus on the most relevant ones that either care about the Kafka project and community, have a lot of market power, or ideally both.

The car brands are Confluent, Cloudera, Red Hat, and Amazon MSK. I have a section on other Kafka and non-Kafka streaming vendors at the end of the blog post to provide a more detailed comparison.

Again, the idea is NOT to have a feature-by-feature comparison or flame war. The following are a few facts about each vendor. I only focus on Kafka-related points. Hence, it is no surprise that Confluent looks best in the following list as they only focus on event streaming. But obviously, each vendor has strengths and weaknesses. For instance, if you want to discuss the overall cloud infrastructure capabilities and strategy, well, then AWS would look much stronger than all the other Kafka vendors…

Confluent – The Leading Apache Kafka Vendor

A few facts about Confluent:

Focus on event streaming
Original creators of Kafka
The main contributor to the Apache Kafka project with 80% of Kafka commits
Always the latest Kafka version (without limitations) and full support
Rich Kafka ecosystem (connectors, governance, security, etc.)
Hybrid architectures (including the only true fully-managed and complete Kafka service)
Partnership and 1st party integration into cloud providers (AWS, GCP, Azure) – e.g., you can use your cloud provider credits and account to consume Confluent Cloud
Certified for self-managed operations on cloud providers’ edge offerings (e.g., AWS Outpost including Wavelength, Google’s Anthos)

Cloudera – Big Data Analytics Suite

A few facts about Cloudera:

Focus on big data analytics
Provides a platform around tens of different big data frameworks for storage, batch, and real-time analytics
Kafka is part of the platform (Hadoop, Spark, Flume, Flink, many more) with tooling and support for the whole platform
Hybrid architectures (but no fully-managed Kafka service)
Partnership and 3rd party integration into cloud providers (AWS, GCP, Azure)

Red Hat (IBM) – Cloud-native PaaS Infrastructure

A few facts about Red Hat (IBM):

Focus on infrastructure (mainly around Linux and Kubernetes)
Kafka is available as part of the Red Hat AMQ product portfolio, combined with other open-source frameworks like ActiveMQ or Camel
OpenShift Streams for Apache Kafka provides integration with Kubernetes
Focus on open source frameworks; working actively with the community (for Kafka, Red Hat, e.g., contributes to Debezium for CDC and the Strimzi Kubernetes Operator)
Hybrid Architectures (but no fully-managed Kafka service)
Partnership and 3rd party integration into cloud providers (AWS, GCP, Azure)

Interesting side notes for the relationship between Confluent, Red Hat, and IBM:

IBM acquired Red Hat in 2019.
Confluent and IBM announced a strategic partnership in 2020
IBM deprecated its own Kafka offering (IBM Streams) in March 2021.
Confluent is the way to go with IBM as part of the IBM Cloud Pak for Integration. Even IBM’s salespeople sell Confluent.

Amazon Web Services (AWS) – The Leading Cloud Provider

AWS focuses on cloud infrastructure and 1st party fully managed cloud services (S3, Kinesis, Lambda, etc.)

A few facts about Amazon MSK, the AWS offering for Kafka:

MSK misses several key Kafka features, including Kafka Connect or Kafka Streams
Cloud-only (but only self-managed, not fully managed)
MSK is not cloud-native (like S3 or Kinesis) but just provisioned infrastructure
Obviously only available on AWS
For on-premise deployments (like AWS Outpost or AWS Wavelength), the recommended Kafka product is the certified Confluent Platform

Interesting side note about the commercial support and SLAs of AWS’s Kafka offering: Kafka is excluded from MSK support! Quote from the MSK SLAs: “The Service Commitment DOES NOT APPLY to any unavailability, suspension, or termination… caused by the underlying Apache Kafka or Apache ZooKeeper engine software that leads to request failures…”

Event Streaming Technology and Cloud-native Infrastructure are Complementary!

The above showed a few facts for the main Kafka vendors: Confluent, Cloudera, Red Hat, AWS. However, it is worth explicitly pointing out that these vendors are often complementary. For instance, most Confluent Platform deployments I see on Kubernetes on-premise are actually on Red Hat OpenShift. And with AWS’s huge market share, most self-managed Confluent deployments in the cloud are on AWS.

Also, Confluent Platform is certified on AWS Outpost and Google Anthos. Hence, you can even combine cloud-native technologies at the edge. A great example is smart factory 5G use cases leveraging Confluent Platform on AWS Wavelength. Consequently, a Kafka comparison does typically not eliminate all the other Kafka vendors from the project.

The following architecture depicts the combination of Confluent Cloud in AWS plus Confluent Platform on AWS Wavelength leveraging 5G Carrier networks:

This is not just theory. The joint teams from AWS and Confluent are working on this example in the real world while I am writing this blog post.

Cloud-native? Complete? Everywhere? What Kafka should I buy?

After exploring different vendors, let’s now walk through the different deployment options and commercial offerings.

Again, I will not make a feature-by-feature comparison. Way more important is to understand the different concepts and architecture principles: First of all, you need to decide if a self-driving car (= fully managed Kafka) works for you. In that case, why bother at all about Kafka operations? Otherwise, project teams must evaluate partially managed (= complete car) or self-managed (car engine) Kafka offerings.

Here is an overview showing the event streaming landscape. It contains native Kafka offerings, (partly) Kafka-protocol compatible products, and a few relevant non-Kafka solutions:

Let’s now take a deeper look into these alternatives to find out how to choose the right one for your next project.

Car Engine: Self-managed Open Source Apache Kafka

The car engine is the heart of the car. It provides the power. It brings you from your source to your destination. However, a lot of work is needed around the motor engine. Tires, steering wheel, breaks, and much more are required. Hence, this is a great solution for playing around, learning how a car works, or building a car by enthusiastic car fanatics.

If you download open-source Apache Kafka from the Apache website or related Docker images, then you can use it for free in all your projects. No limitations. You should be able to get it running quickly. However, be aware that similar to the motor engine of a car, there is much more to do: Operating and monitoring the ZooKeeper and Kafka Clusters, rebalancing partitions, scaling up and down, managing storage, securing and encryption the end-to-end communication between producer, Kafka cluster and consumers, and so much more.

If you can handle the operations burden and risk of downtime, open-source Apache Kafka might be a good option. Some tech giants from Silicon Valley do exactly this. They have hired masses of tech experts (or car fanatics to keep the analogy) to run huge Kafka clusters to process trillions of messages and gigabytes of data per second.

Free Kafka add-ons to build your car

Plenty of open-source Kafka add-ons originated through this. Just to name a few tools for very different purposes: kafkacat, Kafka Manager, Kafdrop, burrow, cruise control, and so many more. Some are maintained well, others not at all. Of course, you will never get guarantees to get a version upgrade or bug fix. Often built by a tech giant for their specific scenario. Not easily usable outside that organization and without a big community.

Alternatively, there are well-maintained community projects like Confluent’s Schema Registry, REST Proxy, and ksqlDB, all under Confluent Community License (CCL). This is not open source but free to use if you are not a cloud provider like AWS. Confluent also provides some components under Apache 2.0 license, such as the widely used non-Java Kafka clients based on librdkafka or the parallel-consumer to integrate with non-scalable interfaces like web services in a scalable and performant way.

Tuned Car Engine: Self-managed Kafka Product

If you want or need to self-manage your Kafka infrastructure, then you still have more options than just using open-source Apache Kafka and (well or not so well maintained) open-source add-ons:

Open-source Apache Kafka with additional commercial tools for operations and monitoring. For instance, Lenses or Conduktor.
Complete commercial platforms. For instance, Confluent Platform, Red Hat AMQ, Cloudera DataFlow.

These “tuned car engines” are based on top of Apache Kafka (or at least parts of it) and provide additional tooling for development, operations, monitoring, security, etc. Maturity of the tools, support SLAs, expertise, and consulting vary a lot between vendors. I recommend to talk to your potential vendors. Ask the right questions to understand if they really understand what they seem to sell and support.

Shiny user interfaces attract many people. Just be careful. The underlying technology needs to work reliably and scale for your needs. The UI is nice to have on top of the infrastructure. Nevertheless, a good UI can improve the developer experience, increase time to market, and bring other benefits.

Should I use a (Tuned) Car Engine and Build my own Car?

All the explored options above are still self-managed. If you consider building your own car with a car engine, always evaluate the cost-benefit equation.

Remember, at the beginning of this post, I talked about solving business problems. Hence, don’t forget to consider all the impacts on:

Total Cost of Ownership (TCO)
Risk (downtime, data loss, security, governance, etc.)
Return on Investment (ROI)
Time-to-market (Increased developer velocity and increased business agility)

At Confluent, we do TCO assessments with our prospects and customers so that they understand the complete costs and risks of a Kafka project. Such an assessment should be part of every Kafka comparison!

Hence, don’t forget to evaluate other alternatives to open-source Apache Kafka seriously. If self-managed Apache Kafka still works for you after the evaluation, then do it! But be aware that even the tech giants from Silicon Valley consider and buy other options today. Many had to build Kafka infrastructure because there was no other alternative when they built it years ago.

Does my favorite open-source vendor really provide open-source?

Also, be careful: Some open source solutions don’t provide an easy way to build the product. So, evaluate what exactly is available from a so-called “open source offering”: Only a binary download? Docker images? Or can you also build and deploy everything from scratch easily and documented (not just in theory !!!) using Maven, Gradle, Terraform, Ansible, or similar build and automation tools?

Before building your own car with an available car engine, why not buy a car? Let’s consider next if a complete car might make more sense for you.

Complete Car: Kafka Products and Kubernetes-based PaaS

I am not a car fanatic. I want to buy a complete car that I can drive everywhere. You should also at least consider this option!

In Kafka terms, this means you get help from the product for running the Kafka infrastructure. Buzzwords from vendors include terms like “platform as a service”, “private cloud”, “fully managed”, and “cloud-native”. In the end, the products help you with provisioning, operating, and monitoring everything.

The main benefit compared to the (tuned) car engines is that these products give you a more elastic, scalable, and automated infrastructure. Two options exist today:

Kubernetes-based products that can run everywhere on-premise and across multiple cloud providers. Examples: Confluent Platform, Cloudera DataFlow (CDF), Red Hat AMQ respectively Red Hat OpenShift Streams for Apache Kafka.
Proprietary cloud offerings that are typically tied to the related hyperscaler. Example: Amazon MSK.

This is similar to buying a car: It comes preassembled. Although, of course, you are still responsible for operating and maintaining it.

In Kafka terms, for many scenarios, these self-managed Kafka products (= car) are a better choice than the self-managed Kafka (= car engine) because they partially reduce the operations burden, risk, and (hopefully) TCO.

A complete car is still not self-driving!

However, as you know, today’s cars still need a lot of manual work: Driving, refueling, maintenance, and more. In Kafka terms: How much works do you still have to do by yourself? Do you have to handle rolling upgrades manually? Do you have to rebalance partitions on brokers? How do you scale up and down? Who fixes security issues and bugs? And so on.

Hence, it is really a pity that most vendors use incorrect marketing intentionally! No one of the above solutions are fully managed. All of them require work to do by you to operate the Kafka cluster. All of them! Confluent Platform. Cloudera DataFlow. Red Hat AMQ. Red Hat OpenShift Streams for Apache. Kafka Amazon MSK. None of these services is fully managed!

Each car brand and model is different. If you buy a Porsche, you probably have very different expectations than buying a small medium-priced car from another brand. The same is true for all the self-managed Kafka products on the market. Each product is very different: Confluent Platform. Cloudera DataFlow. Red Hat AMQ / OpenShift Streams for Apache Kafka. Amazon MSK. All of them have strengths and weaknesses. Make sure it fits your expectations so that you can solve your business problem within your required SLAs and budget.

Having said this, wouldn’t it be nice if you don’t have to worry about all these things? Let’s explore the self-driving Kafka car next.

Self-driving Car: Fully-managed Cloud Kafka Service

A self-driving car provides a complete solution. You just tell it where you want to go. It drives you automatically. Chooses the best route. Allows relaxing, reading, playing games, or similar things. Of course, an autonomous car with level 5 automation is not mature yet (beyond some early stages, like Waymo operating in the desert in Phoenix where no rain and other weather or traffic issues occur).

In Kafka terms, the solution needs to be fully managed by the vendor.

Fully managed means serverless (i.e., you don’t have to care and even don’t get access to the Kafka Brokers at all). Mission-critical SLAs. Usage-based billing. And so on. Like you know it from other really fully managed cloud offerings such as AWS S3 or AWS Kinesis. These are fully managed. Amazon MSK is not!

Checklist to compare partially managed and fully-managed Kafka cloud services

Please compare different Kafka cloud offerings by yourself. Here are some bullet points to check:

Infrastructure management

Upgrades (latest stable version of Kafka)
Patching
Maintenance

Kafka-specific management

Sizing (retention, latency, throughput, storage, etc.)
Data balancing for optimal performance
Performance tuning for real-time and latency requirements
Fixing Kafka bugs
Uptime monitoring and proactive remediation of issues
Recovery support from data corruption

Scaling

Scaling the cluster as needed
Data balancing the cluster as nodes are added
Support for any Kafka issue with less than 60-minute response time

Most “Kafka as a service” offerings are only partially managed. That’s like a self-driving car which you actually have to control by yourself (more like level 3, not level 5 in autonomous driving terminology).

At this point, I have to do marketing for my employer. However, it is not an advertisement, but reality: Confluent Cloud is the only offering on the market that provides a complete, fully-managed Kafka SaaS offering. And it is available everywhere – on all major cloud provides (AWS, Azure, GCP). All the other Kafka offerings are NOT fully managed – even though most vendors claim it!

Other Vehicles on the Street: Comparison of Kafka-Compatible and Non-Kafka Offerings

On the street, we don’t see just one car brand or car model. Plenty of different ones exist. Nevertheless, they have to drive on the same streets. Competition creates innovation and tackles different markets and personal interests. That’s great. The same is true for Kafka!

I focused on the “mainstream Kafka vendors” in the above sections. Namely, Confluent, Cloudera, Red Hat, Amazon MSK. Obviously, more Kafka offerings exist on the market. Some are really good for some use cases. Others are more like an April Fools’ Joke, in my opinion. Let’s quickly walk through a few other offerings.

A few more car brands: Azure HD Insight’s Kafka, Aiven, cloudkarafka, Instaclustr. These Kafka-native PaaS vendors provision Kafka clusters for you. Similarly to Amazon MSK. These offerings slightly differ from each other. In summary, they typically ask you to do storage management, scalability configuration, performance tuning, etc., by yourself. This is definitely not self-driving!
A self-driving car: Azure Event Hubs is a SaaS offering from Microsoft supporting the Kafka protocol. It has several limitations regarding support of the Kafka API and infrastructure. A solid product. Contrary to Confluent Cloud, you don’t get additional capabilities such as fully-managed connectors, Schema Registry, RBAC, Audit Logs, and much more. And obviously, this product is only available on the Azure cloud.
A vintage car: TIBCO focuses on their legacy messaging solutions like TIBCO EMS. They (try to) provide support for Kafka (and Pulsar) to sell their proprietary technologies. Zero expertise or interest in Kafka. They even provide Kafka as .exe Windows file even though this does not work well in reality. If you need to run Kafka brokers on Windows (e.g., for development), only use Kafka Docker containers and the Windows Subsystem for Linux 2 (WSL 2).

Non-Kafka offerings

Self-driving scooters: AWS Kinesis, GCP Pub/Sub, etc., are solid SaaS offerings that work well if you don’t need to be vendor-agnostic and if the feature set, scalability, and pricing work for you.
A few bicycles, motorbikes, and cars: Non-Kafka solutions, including message queuing (IBM MQ, RabbitMQ, NATS), stream processing (Flink, Spark Streaming), event streaming (Pulsar, Pravega), integration middleware (many open-source/proprietary and self-managed/SaaS). These are solid frameworks and products that you can compare to Kafka. There is no silver bullet! Make sure to understand the differences between MQ/ETL/ESB and Kafka when you do your evaluation.

Connected Car Fleet: Multiple Kafka Clusters and Hybrid Integration

The digital transformation around connected vehicles is a real game-changer. Vehicles talk to each other (V2V), their infrastructure like traffic lights (V2I), and to many other backend systems (V2X).

As a side note: If you are interested in the relation of Kafka and connected vehicles/mobility services, I covered use cases for connected vehicles and V2X in my blog series about Kafka and MQTT in more detail.

Today, we usually have to drive by ourselves. This is expected to change in the next five to ten years. However, even if Waymo, Telsa, and the likes successfully deploy level 4 and level 5 cars to the street (including legal allowance), we will still only see a fraction of all cars driving themselves. It will be a connected fleet with regular cars and self-driving cars for at least a few decades. Not even sure if self-driving cars can ever go to India

The same is true for Kafka. Self-managed open-source Kafka is still mainstream today. Many enterprises move to Kubernetes and private or public cloud infrastructure, though. In parallel, most new Kafka clusters in the cloud are consumed from vendors that provide partially or fully managed services so that the enterprise can focus on their business problems.

Kafka is deployed across infrastructures. Often, new projects have a cloud-first strategy. But there are still a lot of data centers out there. Not just for legacy reasons. For instance, in Russia, there is no public cloud provider at all. Kafka has to be deployed on-premise. And there is the trend of deploying Kafka at the edge (i.e., outside a data center).

Architectures for Hybrid Kafka (SaaS + PaaS + Self-Managed)

Hence, a connected car fleet with various brands and operation types is required. Most enterprises use different vendors and cloud providers. Most enterprises have their own data centers and a multi-cloud strategy. Hybrid Kafka includes various architectures. This includes:

Kafka in one or multiple clouds. There is no Azure or GCP in China, only Alibaba and Tencent Cloud. This is why Audi built their connected car infrastructure in the cloud, but with Kafka instead of proprietary cloud services. They need to deploy globally.
Kafka at the edge outside the data center, e.g., in a smart factory, oil rigs, ships, retail stores, etc. Often deployed as a single broker on very lightweight hardware, without high availability.
Kafka stretched across regions, i.e., one single cluster operating across the US west, east, central. Confluent’s Multi-Region Clusters (MRC) is mainly used for this architecture.
Replication between different Kafka clusters. Use cases include aggregation, disaster recovery, global deployments, and more. Kafka-native technologies such as MirrorMaker 2, Confluent Replicator, or Confluent Cluster Linking enable these architectures.

“Architecture patterns for distributed, hybrid, edge and global Apache Kafka deployments” explores this topic in more detail.

Focus on the Business Problem when making your Kafka Comparison!

This blog post explored the different deployment options for Kafka. Several open-source and commercial options exist.

If you want to remember one thing from this post: A fully-managed Kafka service (= real SaaS) takes over all the operations complexity and risk for you, similarly like a self-driving car handles all the actions on the street. However, most services available today only provide self-managed Kafka clusters. Fully managed is often only a marketing term.

A hybrid architecture is the norm in most enterprises. A combination of fully-managed Kafka in the public cloud with self-managed Kafka on premise or at the edge works very well and is the way to go for most enterprises across industries.

What Kafka car do you drive today? What is your plan for the future? Maybe you are already planning to migrate to a self-driving car to focus on your business problems – and consequently reducing cost and risk this way, too? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Comparison of Open Source Apache Kafka vs Vendors including Confluent, Cloudera, Red Hat, Amazon MSK appeared first on Kai Waehner.

Pulsar vs Kafka – Comparison and Myths Explored

Kai Waehner — Tue, 09 Jun 2020 15:32:29 +0000

Pulsar vs Kafka – which one is better? This blog post explores pros and cons, popular myths, and non-technical criteria to find the best tool for your business problem.

My discussions are usually around Apache Kafka and its ecosystem as I work for Confluent. The only questions I got about Pulsar in the last years came from Pulsar committers and contributors. They asked me deep technical questions so as to be able to explain where Kafka sucks and why Pulsar is the much better option. Discussions about this topic on platforms like Reddit are typically very opinionated, often inaccurate, and brutal. The following is my point of view based on years of experience with open source streaming platforms.

Tech comparisons are the new black: Kafka vs. Middleware, Event Streaming and API Platforms

Tech comparisons are meant to guide people to choose the right solution and architecture for their business problem. There is no all-rounder, and there should be no bias. Choose the right tool for the problem.

However, technical comparisons are almost always biased. Even if the author does not work for a vendor and is an “independent” consultant, he or she is still likely to have a biased opinion from past experiences and knowledge, whether purposely or unknowingly. Still, comparisons from different perspectives are useful, and we’ve seen Apache Pulsar discussed in a few places on the internet, so I wanted to share my personal views of how Kafka and Pulsar compare. I work for Confluent, the leading experts behind Apache Kafka and its ecosystem, so keep that in mind, but the aim of this post is not to provide opinion, it’s to weigh up facts rather than myths.

Technical comparisons of open source frameworks and commercial software products happen all the time. I did several comparisons in the past on my blog or other platforms like InfoQ, including a Comparison of integration frameworks, Choosing the right ESB for your integration needs, Kafka vs. ETL / ESB / MQ, Kafka vs. Mainframe and Apache Kafka and API Management / API Gateway. All these comparisons were done because customers wanted to understand when to use which tool.

For Pulsar vs. Kafka, the situation is a little bit different.

Why compare Pulsar and Kafka?

Talking to prospects or customers, I rarely get asked about Pulsar. To be fair, this increased slightly in the last months. I guess the question comes up in every ~15th or ~20th meeting due to the overlapping feature set and use cases. However, this seems to be mostly due to a few posts on the internet that claim Pulsar is in some ways better than Kafka. There is no fact-checking and very little material, if any, for the opposing view.

I have not talked to a single organization that seriously considered deploying Pulsar in production, although I know there are a large number of users out there in the world who need a distributed messaging technology like Kafka or Pulsar. But I also think that Pulsar’s alleged reference users are not particularly accurate.

For example, their flagship user is Tencent, a large Chinese tech company, but Tencent is a huge Kafka user, whereas Pulsar’s use is limited to just one project. Tencent processes trillion messages per day (in digits: 10,000,000,000,000) with Kafka. As it turns out, Tencent uses Kafka 1000x more than Pulsar (ten trillion msg/day vs. tens of billion msg/day). The Tencent team discussed their Kafka deployment in more detail: How Tencent PCG Uses Apache Kafka to Handle 10 Trillion+ Messages Per Day.

Comparison of two competitive open source frameworks

Apache Kafka and Apache Pulsar are two exciting and competing technologies. Therefore, it makes a lot of sense to compare them. Period.

Both Apache Kafka and Apache Pulsar have very similar feature sets. I recommend that you evaluate both frameworks for available features, maturity, market adoption, open source tools and projects, training material, availability of local meetups, videos, blog posts, etc. Reference use cases from your industry or business problems help making the right decision.

Confluent published such a comparison of “Kafka vs. Pulsar vs. RabbitMQ: Performance, Architecture, and Features Compared“. I was involved in creating this comparison. So we have that comparison already…

What is this blog post here about then?

I want to explore the myths from some ‘Kafka vs. Pulsar’ arguments which I see regularly in blog posts and forum discussions. Afterwards, I will give a more comprehensive comparison beyond just technical aspects because most Pulsar discussions focus purely on tech features.

Kafka vs Pulsar – Technology myths explored

The following discusses some myths I have come across. I agree with some of them, but also counter some others with hard facts. Of course, different opinions can exist for some of these statements. Again, this is totally fine. The following is my point of view.

Myth 1: “Pulsar has differentiating built-in features compared to Kafka”?

True.

If you compare Apache Kafka to Apache Pulsar, features like its tiered architecture, queuing, and multi-tenancy are mentioned as differentiators.

But:

Kafka has many differentiating features, too:

Half as many servers to run
Data saved to disk only once
Data cached in memory only once
Battle-tested replication protocol
Zero copy performance
Transactions
Built-in stream processing
Long term storage
In the works: ZooKeeper removal (KIP-500), which makes Kafka even more simple to operate and deploy than Pulsar (which has a four-component architecture of Pulsar, ZooKeeper, BookKeeper, and RocksDB), apart from making Kafka more scalable, more resilient, etc. etc..)
In the works: Tiered Storage (KIP-405), which makes Kafka more elastic and cost-efficient.

Also ask yourself: Should you really compare just the open source frameworks or products and vendors with their complete offering?

It is easy to add new features if you don’t have to provide mission-critical support for it. Don’t just evaluate features in a checklist, but also evaluate how they are battle-tested in production scenarios. How many “differentiating features” are low-quality and implemented quickly vs. high-quality implementations?

For instance: It took a few years to implement and battle-test Kafka Streams as Kafka-native stream processing engine. Do you really want to compare this to Pulsar Functions? The latter is a feature to add user-defined functions (UDF); without any relation to “real stream processing”. Or is this more like Single Message Transformations (SMT), a core feature of Kafka Connect? Just be sure to a) compare apples to apples (instead of apples to oranges) and b) don’t forget to think about the maturity of a feature. The more powerful and critical, the more mature it should be…

The Kafka community spends a large amount of efforts to improve the core project and its ecosystem. Confluent alone has over 200 full time engineers working on the Kafka project, additional community components, commercial products and the SaaS offering on major cloud providers.

Myth 2: “Pulsar has a few very big users like Tencent in China”?

True.

But: Tencent actually uses Kafka more than Pulsar. The billing department, which uses Pulsar, is only a small fraction at Tencent, whereas a large portion of the core business is using Kafka, and they have a Global-Kafka like architecture that combines 1000+ brokers into a single logical cluster.

Always be cautious with open source projects. Check out the success at “normal companies”. Just because a tech giant uses it, does not mean it will work for your company well. How many Fortune 2000 companies shared their success stories around Pulsar in the past?

Look for proof points beyond tech giants!

Proof points beyond the tech giants are helpful to get insights and lessons learned from other people. Not from the software vendors. The Kafka website gives many examples about mission-critical deployments. Even more impressive: At the past Kafka Summit conferences in San Francisco, New York and London, every year various enterprises from different industries present their use cases and success stories. Including fortune 2000 companies, mid-size enterprises and startups.

Just to give you one specific example in the Kafka world: Various different implementations exist for replication of data in real time between separate Kafka clusters, including MirrorMaker 1 (part of the Apache Kafka project), MirrorMaker 2 (part of the Apache Kafka project), Confluent Replicator (built by Confluent and only available as part of Confluent Platform or Confluent Cloud), uReplicator (open sourced by Uber), Mirus (open sourced by Salesforce), Brooklin (open sourced by LinkedIn).

In practice, only two options are reasonable if you don’t want to maintain and improve the code by yourself: MirrorMaker 2 (very new, not mature yet, but a great option mid and long term) and Confluent Replicator (battle-tested in many mission-critical deployments, but not open source). All the other options work, too. But who maintains the projects? Who solves bugs and security issues? Who do you call when you have a problem in production? Deployment in production for mission-critical deployments is different from evaluating and trying out an open source project.

Myth 3: “Pulsar provides message queuing and event streaming in a single solution”?

Partly.

Message queues are used for point-to-point communication. They provide an asynchronous communications protocol, meaning that the sender and receiver of the message do not need to interact with the message queue at the same time.d

Pulsar has only limited support for message queuing, and limited support for event streaming. If it wants to compete in either area, it still has a long way to go for two reasons:

1) Pulsar has only limited support for message queuing because it misses popular messaging features like message XA transactions, routing, message filtering, etc. that are commonly used with messaging systems like IBM MQ, RabbitMQ, and ActiveMQ. Pulsar’s “adapters” for messaging systems are similarly limited. While they may look nice on paper, they are less useful in practice.

2) Pulsar has only limited support for event streaming. For example, it does not support exactly-once delivery and processing semantics, which disqualifies it for most use cases in practice – you would never implement, say, a payment processing system with Pulsar as it may cause duplicate payments, or lose payments. It also lacks functionality to perform stream processing with features like joins, aggregations, windowing, fault-tolerant state management, and event-time based processing. Pulsar’s “topics” functionality is also different to Kafka’s, and suffers from BookKeeper’s origins, as it was conceived and designed in 2008 as a write ahead log for Hadoop’s HDFS namenode, with only short-lived data storage in mind.

Side note: Pulsar’s “Kafka adapter”, like its messaging siblings, is similarly limited. While it may look nice on paper, it is less useful in practice because it supports only a small subset of Kafka functionality.

Like Pulsar, Kafka has only limited support for message queuing.

In Kafka, different workarounds can be used to realize “real queuing” behavior. If you want to use separate message queues instead of shared Kafka topics for:

Security? => Use Kafka’s ACLs (and optional tools like Confluent’s role-based access control aka RBAC).
Semantics (i.e. separate applications)? => Use Kafka’s consumer groups.
Load balancing? => Use Kafka’s partitions.

I typically ask customers what exactly they want to do with queuing. Often, Kafka provides out-of-the-box solutions for use cases which simply require thinking of the solution in new terms. Also, the number of high throughput use cases that need queuing is relatively small.

Having explained all these workarounds and limitations of Pulsar and Kafka for messaging, let’s be clear: Neither Kafka nor Pulsar provide a “real messaging solution”.

If you really need a messaging solution, shouldn’t you better choose a “real messaging framework” like RabbitMQ or NATS for a messaging problem anyway?

There is no ‘yes or no’ answer to this. I see many customers replacing existing messaging systems like IBM MQ with Kafka (for scalability and cost reasons). Know the options, their trade-offs, and do an evaluation to solve your problem the best way…

Myth 4: “Pulsar provides stream processing”?

False.

Or to be fair: It depends on your definition of stream processing. Is it only rudimentary features, or full-fledged stream processing?

In one sentence, I typically explain stream processing as continuous consumption, processing, and aggregation of events from different data sources. In real time. At scale. And, of course, in a fault-tolerant manner, including (and especially) for any stateful processing operations.

Pulsar provides only rudimentary functionality for stream processing, using its Pulsar Functions interface. This is suited for simple callbacks, but it isn’t a true stream processing offering like you get it with Kafka Streams or ksqlDB for building streaming applications that include stateful information, sliding windows, and other stream processing concepts. Use cases exist in every industry. For instance, check out the Kafka Streams website for examples from the New York Times, Pinterest, Trivago, Zalando, and others.

Streaming analytics examples with Pulsar typically use Pulsar in conjunction with another “proper” stream processing framework like Apache Spark or Apache Flink, which of course means you now need to operate even more additional pieces of distributed infrastructure and to understand their complex interactions.

Myth 5: “Pulsar provides exactly-once semantics like Kafka”?

False.

Pulsar provides a deduplication feature that ensures that a message will not be stored in the Pulsar broker twice, but nothing prevents a consumer from reading this message multiple times. This is insufficient for any form of stream processing use case where both input and output are from Pulsar.

Also, unlike Kafka’s Transactions feature, it is not possible to accurately tie messages committed to state recorded inside a stream processor.

Exactly-Once Semantics (EOS) are available since Kafka 0.11 (released three years ago) and used in many production deployments. Kafka’s EOS supports the whole Kafka ecosystem, including Kafka Connect, Kafka Streams, ksqlDB and clients like Java, C, C++, Go or Python. Kafka Summit had several talks about Kafka’s EOS functionality, including this great intro for everybody, with slides and video recording.

Myth 6: “Pulsar’s performance is much better than Kafka’s”?

False.

I am not a fan of most “benchmarks” of performance and throughput. Benchmarks are almost always opinionated and configured for a specific problem (no matter if a vendor, independent consultant, or researcher conducts them).

For example, there is one benchmark published by GIGAOM, which compares the latency and performance of Kafka versus Pulsar. But this benchmark deliberately slowed Kafka down by forcing it to synchronize-to-disk on every single message by setting the Kafka config ‘flush.messages = 1’ (this makes every request cause an fsync). The benchmark also forces the Kafka Consumer to acknowledge synchronously while the Pulsar consumer acknowledges asynchronously. Unsurprisingly, this benchmark setup makes Pulsar the seemingly clear “winner”. But this benchmark does not mention or explain this significant configuration difference in the setup and measurements. This is what some people call apples-to-oranges comparison.

Pulsar’s architecture actually requires higher network utilization (due to the Pulsar broker tier which acts as a proxy in front of BookKeeper bookies) as well as twice the I/O (as BookKeeper writes data to a write ahead log as well as to the main segment).

Confluent did some benchmarks, too. More an apple-to-apple comparison. Not surprisingly, the results were different. But should you really care about these benchmark fights from software vendors?

Think about your performance requirements. Do a proof of concept (POC) with Kafka and Pulsar, if you must. I bet that in 99% of scenarios, both will show acceptable performance for your use case. Don’t trust opinionated benchmarks from others! Your use case will have different requirements and characteristics anyway, and typically performance is just one of many evaluation dimensions.

Myth 7: “Pulsar is easier to operate than Kafka”?

False.

Both Kafka and Pulsar are hard to operate if you don’t use additional tooling.

Kafka includes two distributed systems: Kafka itself and Apache ZooKeeper.

But: Pulsar includes three distributed systems and an additional storage technology: Pulsar, ZooKeeper, and Apache BookKeeper. Like Pulsar, BookKeeper uses ZooKeeper, too. And lastly, RocksDB is used for certain storage tasks. This means that Pulsar has a significantly higher complexity to understand, tweak, and tune than Kafka. Additionally, Pulsar also has more configuration parameters than Kafka.

Kafka is firmly going into the opposite direction and is removing ZooKeeper (see KIP-500) so that you have just one distributed system to deploy, operate, scale and monitor:

ZooKeeper is Kafka’s biggest scalability bottleneck and comes with operational challenges — This is true for Kafka but even more so for Pulsar!

One of the key issues of my customers is how to run ZooKeeper in mission-critical deployments at scale. Therefore I am really looking forward to Kafka’s simplified architecture, where you will deploy Kafka brokers only. This also establishes a unified security model, as ZooKeeper’s security no longer needs to be separately configured. This is a huge benefit, especially for larger organizations and regulated industries. Compliance and information security departments will thank you for this simplified architecture.

Operations is NOT just about Architecture!

Kafka is significantly better documented, has a tremendously larger community of experts, and a vast array of supporting tooling that make operations easier.

Additionally, there are many options for local and online Kafka training, including online courses, books, meetups, and conferences. You won’t find much for Pulsar, unfortunately.

Myth 8: “An architecture with three tiers is better than two tiers”?

It depends.

Personally, I am skeptical that Pulsar’s three tier architecture (using Pulsar brokers, ZooKeeper and BookKeeper) is an advantage for most projects. It is a trade-off!

Twitter described their move away from BookKeeper + DistributedLog (the latter a system very similar to Pulsar, with comparable architecture and design) just over a year ago, citing the advantages of Kafka’s single-tier architecture, such as cost efficiency and better performance, over a two-tier architecture that decouples storage and serving.

Like Pulsar, DistributedLog is built on top of BookKeeper and adds streaming-like functionality with an architecture and concepts similar to Pulsar (e.g., using decoupled storage and serving tiers). DistributedLog was originally a standalone project but eventually became a sub-project of BookKeeper, though nowadays it appears to be no longer actively developed (only a few commits in the past 12 months). The main reasons Twitter cited for switching to Kafka were (1) significant cost savings and performance gains and (2) Kafka’s huge community and adoption. For example, they concluded: “For single consumer use cases, we saw a 68% resource savings, and for fanout cases with multiple consumers, we saw a 75% resource savings.”

There are benefits from a three-tier architecture to build a scalable infrastructure. But the extra layer also increases network utilization by (at least) 33%, and data held in Pulsar’s brokers must additionally be cached in both layers for equivalent performance, and also written to disk twice because the storage format of Bookkeeper is not based on a log.

On the cloud, where most Kafka deployments are being run, the best backing storage tier is in fact not a niche technology like BookKeeper, but a widely used and battle-tested object store like AWS S3 or GCP GCS.

Tiered Storage in Confluent Platform, which is backed by the likes of AWS S3 and GCP GCS, provides the same benefits without Pulsar’s extra layer of BookKeeper and the resulting extra network transfer cost and latency that this architecture incurs. It took Confluent two years to build and make Tiered Storage for Kafka generally available, including global 24/7 support for your most mission-critical data. Tiered Storage is not available yet for open source Apache Kafka, but Confluent is working with the rest of the Kafka community (including some major tech companies like Uber) on KIP-405 to add Tiered Storage to Kafka with different storage options.

There are always pros and cons for both architectures. Personally, I think that 95% of projects do not need a complex three-tier architecture. And where they make sense it is to add the advantages of external, price-efficient storage. You should care about 24/7 service level agreements (SLA), scalability, and throughout. Plus integration into your ecosystem as well as security, management tooling, and support. If your requirements require a three-tier architecture, then of course give it a go!

Sub-Myth: “Pulsar is better for lagging consumers because of its caching layer and storage layer”?

False.

The main problem with lagging consumers is that they exhaust the page cache i.e. recent messages are already cached. Reads from older segments replace these reducing the performance of consumers reading from the head of the log.

Pulsar’s architecture is actually worse in this regard. It retains the same issue around cache-flushing, but now the reads must do an extra network hop + and IO rather than just reading from the local media.

Myth 9: “Kafka does not scale as well as Pulsar”?

False.

This is one of the key arguments by the Pulsar community. As I said before, this always depends on the chosen benchmark. For example, I have seen tests with equivalent computing resources where Kafka did significantly better at high throughputs than Pulsar. Here is a “Pulsar vs. Kafka benchmark” where Kafka is much faster than Pulsar:

Scalability is not a problem for most use cases. You can easily scale up Kafka to process several gigabytes per second, as you can see in a demo to “Scale Apache Kafka to 10+ GB Per Second in Confluent Cloud“:

Honestly speaking, less than 1% of users should be worried about this discussion at all. If you have requirements like Netflix (processing Petabytes per day) or LinkedIn (processing trillions of messages), let’s talk about and discuss the best architecture, hardware, and configuration for such a deployment. For anybody else, don’t be worried.

Sub-Myth: “Kafka’s current approach means it can only store ~ 500K partitions per cluster”?

True.

Kafka today has not yet the best architecture for large scale deployments with hundreds of thousands of Kafka Topics and Partitions.

But: Pulsar, too, does not allow for unlimited scale. It just has different limits.

Kafka’s partition limit is imposed by Zookeeper. Removing Zookeeper from Kafka through the work in KIP-500 removes this upper bound.

As a side note:

The right design of your architecture is critical for success!

Most of the customers I have seen in trouble with Kafka partition counts and scalability are because they designed their architecture and applications in the wrong way (they’d run into the same issues if they were using Pulsar)!

Kafka is an event streaming platform, and not the next IBM MQ. If you try to recreate your favorite MQ solution and architecture with Kafka, you will likely fail. I have seen several customers failing here and then succeeding by re-architecting their setup with our help.

Chances are very high that you will not have any issues with partition numbers and scalability, even today with Kafka’s usage of ZooKeeper, if you design your use case right and understand Kafka’s basic concepts. This experience of customers is a common theme for any technology, like Kafka, that introduces a new technology level and paradigm well beyond what was done before (a prime example is the adoption hurdles faced by companies when they first began to move their use cases to the cloud).

Sub-Myth: “Pulsar supports a practically infinite number of partitions”?

False.

BookKeeper has the same 1-file-per-ledger limitation Kafka has, but there are multiple ledgers in one partition. Pulsar’s broker layer groups partitions into bundles, but it’s storage layer, Bookkeeper, stores data in segments with many segments for each partition.

Like for Kafka, the metadata for these segments is stored in Zookeeper, which imposes a limit on the total number that can be stored. Kafka is removing this dependency, thus allowing it to scale significantly further. I am really looking forward to seeing KIP-500 being implemented until ~ the end of 2020. “Apache Kafka Needs No Keeper: Removing the Apache ZooKeeper Dependency” walks you through the implementation details and planned timelines.

Sub-Myth: “Kafka scaling needs to be defined when creating a Kafka Topic”?

Partly true.

If more scalability is needed, Kafka topics can either be over-partitioned (i.e., you configure a topic with more partitions than you initially need for a use case; see Streams and Tables in Apache Kafka: Topics, Partitions, and Storage Fundamentals), or they can be re-configured to use more partitions if there are requirements to scale in the future. This is not perfect, but a consequence of how distributed event streaming works (and why it scales much better than traditional messaging systems like IBM MQ).

Best practices for creating topics and procedures for changing topic configurations during production are available. So no worries!

But: Pulsar topics have this restriction, too!

Write throughput is based on the number of partitions allocated in a Pulsar topic in the exact same way it is in a Kafka topic, so Pulsar topics must be over-provisioned for exactly the same reasons. That’s because, for each partition, only a single ledger (of the partition’s potentially many ledgers) is writable at the same time. Also, increasing the number of partitions dynamically impacts message ordering just like it does in Kafka (i.e. the message order is lost).

Both Kafka and Pulsar scale like crazy. This is sufficient for almost all use cases!

If you need even more extreme scale, I think a ZooKeeper-free implementation is the best choice. KIP-500 is thus the most anticipated Kafka change I see in the community and in Confluent’s customer base.

Myth 10: “Pulsar recovers from machine failure instantly but Kafka has to reload data”?

True and false.

Killing a Pulsar broker is indeed seamless, but (in contrast to a Kafka broker) the Pulsar broker doesn’t store any data but is only a proxy fronting the actual storage layer, which is BookKeeper. So highlighting that a Pulsar broker failure can easily be resolved is a marketing distraction, because actually one must talk about what happens when a BookKeeper node (a “bookie”) fails.

Killing and restarting a BookKeeper bookie requires the same redistribution of data seen in Kafka’s case. This is the nature of distributed systems, with concepts like replication and partitions.

Elastic Kafka is here already!

Elasticity is important. Confluent’s founder Jay Kreps has recently blogged about this topic: Elastic Apache Kafka Clusters in Confluent Cloud. In a SaaS cloud service like Confluent Cloud, the end user shouldn’t have to care at all about machine failure. 24/7 uptime is expected and should be guaranteed with 99.xx SLAs. Consumption-based pricing (i.e., pay as you go) means you do not have to worry about issues like broker management, sizing broker nodes, expanding or shrinking clusters, etc. under the hood at all.

Self-managed Kafka clusters also need similar capabilities. Tiered Storage for Kafka is huge because most of the data is not stored on the broker anymore to allow almost instant recovery from failures. In conjunction with tools like Self-Balancing Kafka (a Confluent feature coming in Q3 and discussed in the above link blog post), users don’t have to worry about elasticity in their self-managed clusters at all.

Unfortunately, if you are looking for such a modern offering for Pulsar, there is none available.

Myth 11: “Pulsar has better Inter-Cluster (Geo) Replication than Kafka”?

False.

Every distributed system has to solve problems like the CAP theorem and quorum in distributed computing. The quorum is the minimum number of votes that a distributed transaction has to obtain in order to be allowed to perform an operation in a distributed system. A quorum-based technique is implemented to enforce consistent operation in a distributed system.

Kafka requires ZooKeeper to solve the quorum problem. Even after KIP-500 and ZooKeeper removal, the universal laws of real-world physics are still the same: There are latency issues deploying a distributed system over regions like the US East, Central and West or even globally. That’s because the speed of light, though very high, does have a limit.

Various deployment options exist to work around this problem, including real time replication tools like Apache Kafka’s MirrorMaker 2, Confluent’s Replicator or Confluent’s Multi-Region-Clusters. Check out “Architecture patterns for distributed, hybrid, edge and global Apache Kafka deployments” for various different deployment options and best practices:

There is no single pattern or implementation to provide global replication AND zero downtime + zero data loss! For the most critical applications, Confluent’s Multi-Region-Clusters allows RTO=0 and RPO=0 (i.e. zero downtime and zero data loss) with automatic disaster recovery and client fail-over even if a complete data center or cloud region goes down.

Here, Pulsar’s architecture requires even more complexity than a “basic” Pulsar deployment. That’s because, for geo-replication, Pulsar requires an additional “global” Zookeeper cluster, which makes Pulsar inappropriate for geo-distribution over large distances. There is a workaround, but the problem around CAP theorem and physics do not go away.

No matter if you use Kafka or Pulsar, you need a battle-tested design to fight the laws of physics in your global deployments!

Myth 12: “Pulsar is compatible with Kafka’s interface and API”?

Partially True.

Pulsar provides a very basic implementation that is compatible with only minor parts of the Kafka v2.0 protocol.

Pulsar has a converter for basic parts of the Kafka protocol.

So, while alleged “Kafka compatibility” sounds nice on paper, one shouldn’t seriously consider this for migrating your running Kafka infrastructure to Pulsar? I doubt someone will take the risk…

We have seen “Kafka compatibility” claims in other examples such as the much more mature Azure Event Hubs service. Check out the limiting factors of their Kafka API, and be surprised! No support for core Kafka features like transactions (and thus exactly-once semantics), compression, or log compaction.

As it is not Kafka under the hood, also expect further diverging and unexpected behaviors when you connect your existing Kafka applications against such a “compatible” setup. No matter if Azure Event Hubs, Pulsar, or any other wrapper.

Kafka vs. Pulsar – Comprehensive Comparison

The last sections explored various technology myths we find in many other blog posts. I think I brought some clarity into these discussions.

Now, let’s not forget to take a look beyond the technical details of Kafka and Pulsar. Non-functional aspects are as important when choosing a technology.

I will cover three critical aspects in the following: Market traction, enterprise support and cloud offerings.

Market Traction of Apache Kafka and Apache Pulsar

Taking a look at Google Trends from the last five years confirms my personal experience, I see the interest in Apache Pulsar is very limited compared to Apache Kafka:

The picture looks very similar when you take a look at Stack Overflow and similar platforms, number and size of supporting vendors, the open ecosystem (tool integrations, wrapper frameworks like Spring Kafka), and similar characteristics for technology trends.

Job openings is another very good indicator of adoption of technology. Not many job openings for Pulsar means not many companies are using it. Search in your favorite job search engine. If you search globally, you will find <100 job openings for Pulsar, but thousands of jobs for Kafka. Additionally, most of the ones showing Pulsar say something like “looking for experience with Kafka, Pulsar, Kinesis or similar technologies”.

In most cases, these characteristics are much more relevant for the success of your next project than the subtle technical differences. The key goal is to solve your business problem, isn’t it?

So with the lack of adoption, why is Pulsar coming up in conversations at all? One reason is that independent consulting companies, research analysts, and bloggers (including me) need to talk about new cutting-edge technologies to keep their audience interested… And to be honest, it makes a good story.

Enterprise Support for Kafka and Pulsar

There is enterprise support for Kafka and Pulsar!

Though, the situation is not what you might expect. Here are the vendors you can call and ask for a meeting to discuss the potential next steps for working together on your Pulsar journey:

Streamlio (now acquired by Splunk), the former company behind Apache Pulsar. Splunk did not yet announce a future Pulsar strategy to support people working on their own Pulsar-based projects. Splunk is well-known for their widely-adopted analytics platform. That’s their core business (~ $1.8B in 2019). The only thing people complain about Splunk is the pricing. Splunk is a heavy Kafka user under the hood and now incorporates Pulsar into their Splunk Data Stream Processor (DSP). It is very doubtful that Splunk will jump on the open source bandwagon to support your next standalone Pulsar project (but a broader-scope DSP might be coming, of course). The future will show us…
StreamNative, founded by one of the original developers of Apache Pulsar, provides an event streaming platform based on Pulsar. At the time of writing this in June 2020, StreamNative has 13 (!) employees on LinkedIn. I am not sure if this is the right scale to support your next mission-critical deployment in 2020 but they do offer it.
TIBCO announced support for Pulsar in December 2019. Their core strategy moved from integration to analytics in the last years. TIBCO’s middleware customers are migrating away in high numbers. Their middleware team had to do some desperate strategy decisions: Support other platforms even though having zero contribution and experience with the projects. You are right, this might be a myth. But hey, a fact is that TIBCO also does the same for Kafka. And here is a nice trivia: TIBCO provides Kafka and ZooKeeper to you on Windows! Something nobody else does – because others know that this is not stable and creates inconsistencies all the time. But hey, TIBCO can support you now with Kafka and Pulsar. Why evaluate these two frameworks if one single vendor allows you to use both? Even on Windows; with .exe download and .bat scripts for starting the server components:

The number of vendors supporting Kafka grows every quarter!

Kafka has incredible huge market adoption in the meantime. The best proof for this is when the biggest software vendors provide support and tools around it. IBM, Oracle, Amazon, Microsoft and many other software companies support Kafka and build integration capabilities and own products around it.

The latest “wake-up call” for me was at Oracle OpenWorld 2019 in San Francisco where I attended a roadmap session from the Oracle product manager for GoldenGate (Oracle’s well-known great but also very expensive CDC tool). Most of the talk focused on opening GoldenGate to make it the data integration platform for everything. Half the talk was about event streaming, Kafka and how GoldenGate will provide integration with different databases / data lakes and Kafka in both directions.

Fully-Managed Cloud Offerings for Kafka and Pulsar

Let’s take a look at the cloud offerings available for Kafka and Pulsar.

There is a cloud service available for Apache Pulsar. It has a very innovative name:

Kafkaesque.

No kidding. Check the link… [Update: On ~June 17th, they rebranded the service: KAFKAESQUE is now KESQUE – probably they realized how embarrassing the name was.]

Maybe you also check out the various cloud offerings for Apache Kafka to find out which offering fits you better:

Confluent Cloud (SaaS) is a fully-managed service providing consumption-based pricing, 24/7 SLAs and elastic, serverless characteristics for Apache Kafka and its ecosystem (e.g. Schema Registry, Kafka Connect connectors and ksqlDB for stream processing).
Amazon MSK (PaaS) provisions ZooKeeper and Kafka Brokers so that the end user can operate it, fix bugs, do rolling upgrades, etc. One important fact everybody should be aware of: AWS excludes Kafka issues from its 99.95 SLAs and support!
Azure Event Hubs (SaaS) provides a Kafka endpoint (with a proprietary implementation under the hood) to interact with Kafka applications. It is very scalable and performant. As it is not really Kafka, but just an emulation, it misses several core features of Kafka like exactly-once semantics, log compaction, and compression. Not to mention the surrounding capabilities like Kafka Connect and Kafka Streams
Big Blue (IBM) and Big Red (Oracle) have cloud offerings around Kafka and its APIs. I have no idea if anyone is using them and how good they are. Never seen them in the wild by myself.
Plenty of smaller players like Aiven, CloudKarafka, Instaclustr, and others.

As you can see, the current cloud offerings show relatively clear how the market adoption of Kafka and Pulsar look like.

Conclusion – Apache Kafka or Apache Pulsar?

TL;DR: Pulsar is still a long way from Kafka’s level of maturity in terms of being proven for high scale use cases and building a community.

You should also question whether Pulsar is actually better.

Evaluate Kafka and Pulsar if you are going the purely open source way. Find out which fits you best. In your evaluation, include the technical feature set, maturity, vendors, developer community, and other relevant factors. Which one fits your situation best?

If you need an enterprise solution that covers much more than what both of these two open source systems offer, Kafka is the only option: Choose a Kafka-based offering from one of the various vendors or a suitable cloud offering. Pulsar, unfortunately, is not ready for this today and the foreseeable future.

How do you think about Apache Kafka vs. Apache Pulsar? What is your strategy? Let’s connect on LinkedIn and discuss! Stay informed about new blog posts by subscribing to my newsletter.

The post Pulsar vs Kafka – Comparison and Myths Explored appeared first on Kai Waehner.