Amazon MSK Archives - Kai Waehner https://www.kai-waehner.de/blog/category/amazon-msk/ Technology Evangelist - Big Data Analytics - Middleware - Apache Kafka Mon, 07 Apr 2025 04:08:34 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.2 https://www.kai-waehner.de/wp-content/uploads/2020/01/cropped-favicon-32x32.png Amazon MSK Archives - Kai Waehner https://www.kai-waehner.de/blog/category/amazon-msk/ 32 32 The Importance of Focus: Why Software Vendors Should Specialize Instead of Doing Everything (Example: Data Streaming) https://www.kai-waehner.de/blog/2025/04/07/the-importance-of-focus-why-software-vendors-should-specialize-instead-of-doing-everything-example-data-streaming/ Mon, 07 Apr 2025 03:31:55 +0000 https://www.kai-waehner.de/?p=7527 As real-time technologies reshape IT architectures, software vendors face a critical decision: specialize deeply in one domain or build a broad, general-purpose stack. This blog examines why a focused approach—particularly in the world of data streaming—delivers greater innovation, scalability, and reliability. It compares leading platforms and strategies, from specialized providers like Confluent to generalist cloud ecosystems, and highlights the operational risks of fragmented tools. With data streaming emerging as its own software category, enterprises need clarity, consistency, and deep expertise. In this post, we argue that specialization—not breadth—is what powers mission-critical, real-time applications at global scale.

The post The Importance of Focus: Why Software Vendors Should Specialize Instead of Doing Everything (Example: Data Streaming) appeared first on Kai Waehner.

]]>
As technology landscapes evolve, software vendors must decide whether to specialize in a core area or offer a broad suite of services. Some companies take a highly focused approach, investing deeply in a specific technology, while others attempt to cover multiple use cases by integrating various tools and frameworks. Both strategies have trade-offs, but history has shown that specialization leads to deeper innovation, better performance, and stronger customer trust. This blog explores why focus matters in the context of data streaming software, the challenges of trying to do everything, and how companies that prioritize one thing—data streaming—can build best-in-class solutions that work everywhere.

The Importance of Focus for Software and Cloud Vendors - Data Streaming with Apache Kafka and Flink

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases, including customer stories across all industries.

Specialization vs. Generalization: Why Data Streaming Requires a Focused Approach

Data streaming enables real-time processing of continuous data flows, allowing businesses to act instantly rather than relying on batch updates. This shift from traditional databases and APIs to event-driven architectures has become essential for modern IT landscapes.

Event-driven Architecture for Data Streaming with Apache Kafka and Flink

Data streaming is no longer just a technique—it is a new software category. The 2023 Forrester Wave for Streaming Data Platforms confirms its role as a core component of scalable, real-time architectures. Technologies like Apache Kafka and Apache Flink have become industry standards. They power cloud, hybrid, and on-premise environments for real-time data movement and analytics.

Businesses increasingly adopt streaming-first architectures, focusing on:

  • Hybrid and multi-cloud streaming for real-time edge-to-cloud integration
  • AI-driven analytics powered by continuous optimization and inference using machine learning models
  • Streaming data contracts to ensure governance and reliability across the entire data pipeline
  • Converging operational and analytical workloads to replace inefficient batch processing and Lambda architecture with multiple data pipelines

The Data Streaming Landscape

As data streaming becomes a core part of modern IT, businesses must choose the right approach: adopt a purpose-built data streaming platform or piece together multiple tools with limitations. Event-driven architectures demand scalability, low latency, cost efficiency, and strict SLAs to ensure real-time data processing meets business needs.

Some solutions may be “good enough” for specific use cases, but they often lack the performance, reliability, and flexibility required for large-scale, mission-critical applications.

The Data Streaming Landscape 2025 with Kafka Flink Confluent Amazon MSK Cloudera Event Hubs and Other Platforms

The Data Streaming Landscape highlights the differences—while some vendors provide basic capabilities, others offer a complete Data Streaming Platform (DSP)designed to handle complex, high-throughput workloads with enterprise-grade security, governance, and real-time analytics. Choosing the right platform is essential for staying competitive in an increasingly data-driven world.

The Challenge of Doing Everything

Many software vendors and cloud providers attempt to build a comprehensive technology stack, covering everything from data lakes and AI to real-time data streaming. While this offers customers flexibility, it often leads to overlapping services, inconsistent long-term investment, and complexity in adoption.

A few examples (from the perspective of data streaming solutions).

Amazon AWS: Multiple Data Streaming Services, Multiple Choices

AWS has built the most extensive cloud ecosystem, offering services for nearly every aspect of modern IT, including data lakes, AI, analytics, and real-time data streaming. While this breadth provides flexibility, it also leads to overlapping services, evolving strategies, and complexity in decision-making for customers, resulting in frequent solution ambiguity.

Amazon provides several options for real-time data streaming and event processing, each with different capabilities:

  • Amazon SQS (Simple Queue Service): One of AWS’s oldest and most widely adopted messaging services. It’s reliable for basic decoupling and asynchronous workloads, but it lacks native support for real-time stream processing, ordering, replayability, and event-time semantics.
  • Amazon Kinesis Data Streams: A managed service for real-time data ingestion and simple event processing, but lacks the full event streaming capabilities of a complete data streaming platform.
  • Amazon MSK (Managed Streaming for Apache Kafka): A partially managed Kafka service that mainly focuses on Kafka infrastructure management. It leaves customers to handle critical operational support (MSK does NOT provide SLAs or support for Kafka itself) and misses capabilities such as stream processing, schema management, and governance.
  • AWS Glue Streaming ETL: A stream processing service built for data transformations but not designed for high-throughput, real-time event streaming.
  • Amazon Flink (formerly Kinesis Data Analytics): AWS’s attempt to offer a fully managed Apache Flink service for real-time event processing, competing directly with open-source Flink offerings.

Each of these services targets different real-time use cases, but they lack a unified, end-to-end data streaming platform. Customers must decide which combination of AWS services to use, increasing integration complexity, operational overhead, and costs.

Strategy Shift and Rebranding with Multiple Product Portfolios

AWS has introduced, rebranded, and developed its real-time streaming services over time:

  • Kinesis Data Analytics was originally AWS’s solution for stream processing but was later rebranded as Amazon Flink, acknowledging Flink’s dominance in modern stream processing.
  • MSK Serverless was introduced to simplify Kafka adoption but also introduces various additional product limitations and cost challenges.
  • AWS Glue Streaming ETL overlaps with Flink’s capabilities, adding confusion about the best choice for real-time data transformations.

As AWS expands its cloud-native services, customers must navigate a complex mix of technologies—often requiring third-party solutions to fill gaps—while assessing whether AWS’s flexible but fragmented approach meets their real-time data streaming needs or if a specialized, fully integrated platform is a better fit.

Google Cloud: Multiple Approaches to Streaming Analytics

Google Cloud is known for its powerful analytics and AI/ML tools, but its strategy in real-time stream processing has been inconsistent:

Customers looking for stream processing in Google Cloud now have three competing services:

  • Google Managed Service for Apache Kafka (Google MSK) (a managed Kafka offering). Google MSK is very early stage in the maturity curve and has many limitations.
  • Google Dataflow (built on Apache Beam)
  • Google Pub/Sub (event messaging)
  • Apache Flink on Dataproc (a managed service)

While each of these services has its use cases, they introduce complexity for customers who must decide which option is best for their workloads.

BigQuery Flink was introduced to extend Google’s analytics capabilities into real-time processing but was later discontinued before exiting the preview.

Microsoft Azure: Shifting Strategies in Data Streaming

Microsoft Azure has taken multiple approaches to real-time data streaming and analytics, with an evolving strategy that integrates various tools and services.

  • Azure Event Hubs has been a core event streaming service within Azure, designed for high-throughput data ingestion. It supports the Apache Kafka protocol (through Kafka version 3.0, so its feature set lags considerably), making it a flexible choice for (some) real-time workloads. However, it primarily focuses on event ingestion rather than event storage, data processing and integration–additional capabilities of a complete data streaming platform.
  • Azure Stream Analytics was introduced as a serverless stream processing solution, allowing customers to analyze data in motion. Despite its capabilities, its adoption has remained limited, particularly as enterprises seek more scalable, open-source alternatives like Apache Flink.
  • Microsoft Fabric is now positioned as an all-in-one data platform, integrating business intelligence, data engineering, real-time streaming, and AI. While this brings together multiple analytics tools, it also shifts the focus away from dedicated, specialized solutions like Stream Analytics.

While Microsoft Fabric aims to simplify enterprise data infrastructure, its broad scope means that customers must adapt to yet another new platform rather than continuing to rely on long-standing, specialized services. The combination of Azure Event Hubs, Stream Analytics, and Fabric presents multiple options for stream processing, but also introduces complexity, limitations and increased cost for a combined solution.

Microsoft’s approach highlights the challenge of balancing broad platform integration with long-term stability in real-time streaming technologies. Organizations using Azure must evaluate whether their streaming workloads require deep, specialized solutions or can fit within a broader, integrated analytics ecosystem.

I wrote an entire blog series to demystify what Microsoft Fabric really is.

Instaclustr: Too Many Technologies, Not Enough Depth

Instaclustr has positioned itself as a managed platform provider for a wide array of open-source technologies, including Apache Cassandra, Apache Kafka, Apache Spark, Apache ZooKeeper, OpenSearch, PostgreSQL, Redis, and more. While this broad portfolio offers customers choices, it reflects a horizontal expansion strategy that lacks deep specialization in any one domain.

For organizations seeking help with real-time data streaming, Instaclustr’s Kafka offering may appear to be a viable managed service. However, unlike purpose-built data streaming platforms, Instaclustr’s Kafka solution is just one of many services, with limited investment in stream processing, schema governance, or advanced event-driven architectures.

Because Instaclustr splits its engineering and support resources across so many technologies, customers often face challenges in:

  • Getting deep technical expertise for Kafka-specific issues
  • Relying on long-term roadmaps and support for evolving Kafka features
  • Building integrated event streaming pipelines that require more than basic Kafka infrastructure

This generalist model may be appealing for companies looking for low-cost, basic managed services—but it falls short when mission-critical workloads demand real-time reliability, zero data loss, SLAs, and advanced stream processing capabilities. Without a singular focus, platforms like Instaclustr risk becoming jacks-of-all-trades but masters of none—especially in the demanding world of real-time data streaming.

Cloudera: A Broad Portfolio Without a Clear Focus

Cloudera has adopted a distinct strategy by incorporating various open-source frameworks into its platform, including:

  • Apache Kafka (event streaming)
  • Apache Flink (stream processing)
  • Apache Iceberg (data lake table format)
  • Apache Hadoop (big data storage and batch processing)
  • Apache Hive (SQL querying)
  • Apache Spark (batch and near real-time processing and analytics)
  • Apache NiFi (data flow management)
  • Apache HBase (NoSQL database)
  • Apache Impala (real-time SQL engine)
  • Apache Pulsar (event streaming, via a partnership with StreamNative)

While this provides flexibility, it also introduces significant complexity:

  • Customers must determine which tools to use for specific workloads.
  • Integration between different components is not always seamless.
  • The broad scope makes it difficult to maintain deep expertise in each area.

Rather than focusing on one core area, Cloudera’s strategy appears to be adding whatever is trending in open source, which can create challenges in long-term support and roadmap clarity.

Splunk: Repeated Attempts at Data Streaming

Splunk, known for log analytics, has tried multiple times to enter the data streaming market:

Initially, Splunk built a proprietary streaming solution that never gained widespread adoption.

Later, Splunk acquired Streamlio to leverage Apache Pulsar as its streaming backbone.This Pulsar-based strategy ultimately failed, leading to a lack of a clear real-time streaming offering.

Splunk’s challenges highlight a key lesson: successful data streaming requires long-term investment and specialization, not just acquisitions or technology integrations.

Why a Focused Approach Works Better for Data Streaming

Some vendors take a more specialized approach, focusing on one core capability and doing it better than anyone else. For data streaming, Confluent became the leader in this space by focusing on improving the vision of a complete data streaming platform.

Confluent: Focused on Data Streaming, Built for Everywhere

At Confluent, the focus is clear: real-time data streaming. Unlike many other vendors and the cloud providers that offer fragmented or overlapping services, Confluent specializes in one thing and ensures it works everywhere:

  • Cloud: Deploy across AWS, Azure, and Google Cloud with deep native integrations.
  • On-Premise: Enterprise-grade deployments with full control over infrastructure.
  • Edge Computing: Real-time streaming at the edge for IoT, manufacturing, and remote environments.
  • Hybrid Cloud: Seamless data streaming across edge, on-prem, and cloud environments.
  • Multi-Region: Built-in disaster recovery and globally distributed architectures.

More Than Just “The Kafka Company”

While Confluent is often recognized as “the Kafka company,” it has grown far beyond that. Today, Confluent is a complete data streaming platform, combining Apache Kafka for event streaming, Apache Flink for stream processing, and many additional components for data integration, governance and security to power critical workloads.

However, Confluent remains laser-focused on data streaming—it does NOT compete with BI, AI model training, search platforms, or databases. Instead, it integrates and partners with best-in-class solutions in these domains to ensure businesses can seamlessly connect, process, and analyze real-time data within their broader IT ecosystem.

The Right Data Streaming Platform for Every Use Case

Confluent is not just one product—it matches the specific needs, SLAs, and cost considerations of different streaming workloads:

  • Fully Managed Cloud (SaaS)
    • Dedicated and multi-tenant Enterprise Clusters: Low latency, strict SLAs for mission-critical workloads.
    • Freight Clusters: Optimized for high-volume, relaxed latency requirements.
  • Bring Your Own Cloud (BYOC)
    • WarpStream: Bring Your Own Cloud for flexibility and cost efficiency.
  • Self-Managed
    • Confluent Platform: Deploy anywhere—customer cloud VPC, on-premise, at the edge, or across multi-region environments.

Confluent is built for organizations that require more than just “some” data streaming—it is for businesses that need a scalable, reliable, and deeply integrated event-driven architecture. Whether operating in a cloud, hybrid, or on-premise environment, Confluent ensures real-time data can be moved, processed, and analyzed seamlessly across the enterprise.

By focusing only on data streaming, Confluent ensures seamless integration with best-in-class solutions across both operational and analytical workloads. Instead of competing across multiple domains, Confluent partners with industry leaders to provide a best-of-breed architecture that avoids the trade-offs of an all-in-one compromise.

Deep Integrations Across Key Ecosystems

A purpose-built data streaming platform plays well with cloud providers and other data platforms. A few examples:

  • Cloud Providers (AWS, Azure, Google Cloud): While all major cloud providers offer some data streaming capabilities, Confluent takes a different approach by deeply integrating into their ecosystems. Confluent’s managed services can be:
    • Consumed via cloud credits through the cloud provider marketplace
    • Integrated natively into cloud provider’s security and networking services
    • Fully-managed out-of-the-box connectivity to cloud provider services like object storage, lakehouses, and databases
  • MongoDB: A leader in NoSQL and operational workloads, MongoDB integrates with Confluent via Kafka-based change data capture (CDC), enabling real-time event streaming between transactional databases and event-driven applications.
  • Databricks: A powerhouse in AI and analytics, Databricks integrates bi-directionally with Confluent via Kafka and Apache Spark, or object storage and the open table format from Iceberg / Delta Lake via Tableflow. This enables businesses to stream data for AI model training in Databricks and perform real-time model inference directly within the streaming platform.

Rather than attempting to own the entire data stack, Confluent specializes in data streaming and integrates seamlessly with the best cloud, AI, and database solutions.

Beyond the Leader: Specialized Vendors Shaping Data Streaming

Confluent is not alone in recognizing the power of focus. A handful of other vendors have also chosen to specialize in data streaming—each with their own vision, strengths, and approaches.

WarpStream, recently acquired by Confluent, is a Kafka-compatible infrastructure solution designed for Bring Your Own Cloud (BYOC) environments. It re-architects Kafka by running the protocol directly on cloud object storage like Amazon S3, removing the need for traditional brokers or persistent compute. This model dramatically reduces operational complexity and cost—especially for high-ingest, elastic workloads. While WarpStream is now part of the Confluent portfolio, it remains a distinct offering focused on lightweight, cost-efficient Kafka infrastructure.

StreamNative is the commercial steward of Apache Pulsar, aiming to provide a unified messaging and streaming platform. Built for multi-tenancy and geo-replication, it offers some architectural differentiators, particularly in use cases where separation of compute and storage is a must. However, adoption remains niche, and the surrounding ecosystem still lacks maturity and standardization.

Redpanda positions itself as a Kafka-compatible alternative with a focus on performance, especially in low-latency and resource-constrained environments. Its C++ foundation and single-binary architecture make it appealing for edge and latency-sensitive workloads. Yet, Redpanda still needs to mature in areas like stream processing, integrations, and ecosystem support to serve as a true platform.

AutoMQ re-architects Apache Kafka for the cloud by separating compute and storage using object storage like S3. It aims to simplify operations and reduce costs for high-throughput workloads. Though fully Kafka-compatible, AutoMQ concentrates on infrastructure optimization and currently lacks broader platform capabilities like governance, processing, or hybrid deployment support.

Bufstream is experimenting with lightweight approaches to real-time data movement using modern developer tooling and APIs. While promising in niche developer-first scenarios, it has yet to demonstrate scalability, production maturity, or a robust ecosystem around complex stream processing and governance.

Ververica focuses on stream processing with Apache Flink. It offers Ververica Platform to manage Flink deployments at scale, especially on Kubernetes. While it brings deep expertise in Flink operations, it does not provide a full data streaming platform and must be paired with other components, like Kafka for ingestion and delivery.

Great Ideas Are Born From Market Pressure

Each of these companies brings interesting ideas to the space. But building and scaling a complete, enterprise-grade data streaming platform is no small feat. It requires not just infrastructure, but capabilities for processing, governance, security, global scale, and integrations across complex environments.

That’s where Confluent continues to lead—by combining deep technical expertise, a relentless focus on one problem space, and the ability to deliver a full platform experience across cloud, on-prem, and hybrid deployments.

In the long run, the data streaming market will reward not just technical innovation, but consistency, trust, and end-to-end excellence. For now, the message is clear: specialization matters—but execution matters even more. Let’s see where the others go.

How Customers Benefit from Specialization

A well-defined focus provides several advantages for customers, ensuring they get the right tool for each job without the complexity of navigating overlapping services.

  • Clarity in technology selection: No need to evaluate multiple competing services; purpose-built solutions ensure the right tool for each use case.
  • Deep technical investment: Continuous innovation focused on solving specific challenges rather than spreading resources thin.
  • Predictable long-term roadmap: Stability and reliability with no sudden service retirements or shifting priorities.
  • Better performance and reliability: Architectures optimized for the right workloads through the deep experience in the software category.
  • Seamless ecosystem integration: Works natively with leading cloud providers and other data platforms for a best-of-breed approach.
  • Deployment flexibility: Not bound to a single environment like one cloud provider; businesses can run workloads on-premise, in any cloud, at the edge, or across hybrid environments.

Rather than adopting a broad but shallow set of solutions, businesses can achieve stronger outcomes by choosing vendors that specialize in one core competency and deliver it everywhere.

Why Deep Expertise Matters: Supporting 24/7, Mission-Critical Data Streaming

For mission-critical workloads—where downtime, data loss, and compliance failures are not an optiondeep expertise is not just an advantage, it is a necessity.

Data streaming is a high-performance, real-time infrastructure that requires continuous reliability, strict SLAs, and rapid response to critical issues. When something goes wrong at the core of an event-driven architecture—whether in Apache Kafka, Apache Flink, or the surrounding ecosystem—only specialized vendors with proven expertise can ensure immediate, effective solutions.

The Challenge with Generalist Cloud Services

Many cloud providers offer some level of data streaming, but their approach is different from a dedicated data streaming platform. Take Amazon MSK as an example:

  • Amazon MSK provides managed Kafka clusters, but does NOT offer Kafka support itself. If an issue arises deep within Kafka, customers are responsible for troubleshooting it—or must find external experts to resolve the problem.
  • The terms and conditions of Amazon MSK explicitly exclude Kafka support, meaning that, for mission-critical applications requiring uptime guarantees, compliance, and regulatory alignment, MSK is not a viable choice.
  • This lack of core Kafka support poses a serious risk for enterprises relying on event streaming for financial transactions, real-time analytics, AI inference, fraud detection, and other high-stakes applications.

For companies that cannot afford failure, a data streaming vendor with direct expertise in the underlying technology is essential.

Why Specialized Vendors Are Essential for Mission-Critical Workloads

A complete data streaming platform is much more than a hosted Kafka cluster or a managed Flink service. Specialized vendors like Confluent offer end-to-end operational expertise, covering:

  • 24/7 Critical Support: Direct access to Kafka and Flink experts, ensuring immediate troubleshooting for core-level issues.
  • Guaranteed SLAs: Strict uptime commitments, ensuring that mission-critical applications are always running.
  • No Data Loss Architecture: Built-in replication, failover, and durability to prevent business-critical data loss.
  • Security & Compliance: Encryption, access control, and governance features designed for regulated industries.
  • Professional Services & Advisory: Best practices, architecture reviews, and operational guidance tailored for real-time streaming at scale.

This level of deep, continuous investment in operational excellence separates a general-purpose cloud service from a true data streaming platform.

The Power of Specialization: Deep Expertise Beats Broad Offerings

Software vendors will continue expanding their offerings, integrating new technologies, and launching new services. However, focus remains a key differentiator in delivering best-in-class solutions, especially for operational systems with critical SLAs—where low latency, 24/7 uptime, no data loss, and real-time reliability are non-negotiable.

For companies investing in strategic data architectures, choosing a vendor with deep expertise in one core technology—rather than one that spreads across multiple domains—ensures stability, predictable performance, and long-term success.

In a rapidly evolving technology landscape, clarity, specialization, and seamless integration are the foundations of lasting innovation. Businesses that prioritize proven, mission-critical solutions will be better equipped to handle the demands of real-time, event-driven architectures at scale.

How do you see the world of software? Better to specialize or become an allrounder? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter. And download my free book about data streaming use cases.

The post The Importance of Focus: Why Software Vendors Should Specialize Instead of Doing Everything (Example: Data Streaming) appeared first on Kai Waehner.

]]>
The Data Streaming Landscape 2025 https://www.kai-waehner.de/blog/2024/12/04/the-data-streaming-landscape-2025/ Wed, 04 Dec 2024 13:49:37 +0000 https://www.kai-waehner.de/?p=6909 Data streaming is a new software category. It has grown from niche adoption to becoming a fundamental part of modern data architecture, leveraging open source technologies like Apache Kafka and Flink. With real-time data processing transforming industries, the ecosystem of tools, platforms, and cloud services has evolved significantly. This blog post explores the data streaming landscape of 2025, analyzing key players, trends, and market dynamics shaping this space.

The post The Data Streaming Landscape 2025 appeared first on Kai Waehner.

]]>
Data streaming is a new software category. It has grown from niche adoption to becoming a fundamental part of modern data architecture. With real-time data processing transforming industries, the ecosystem of tools, platforms, and services has evolved significantly. This blog post explores the data streaming landscape of 2025, analyzing key players, trends, and market dynamics shaping this space.

The data streaming landscape of 2025 categorizes solutions by their adoption and completeness as fully-featured data streaming platforms, as well as their deployment models, which range from self-managed setups to BYOC (Bring Your Own Cloud) and fully managed cloud services like PaaS and SaaS. While Apache Kafka remains the backbone of this ecosystem, the landscape also includes stream processing engines like Apache Flink and competitive technologies such as Pulsar and Redpanda that are built on the Kafka protocol.

This blog also explores the latest market trends and provides an outlook for 2025 and beyond, highlighting potential new entrants and evolving use cases. By the end, you’ll gain a clear understanding of the data streaming platform landscape and its trajectory in the years to come.

The Data Streaming Landcape 2025 with Kafka Flink Confluent Amazon MSK Cloudera Event Hubs and Other Platforms

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download by free ebook about data streaming use cases and industry-specific success stories.

Data Streaming in 2025: The Rise of a New Software Category

Real-time data beats slow data. That’s true across almost all use cases in any industry. Event-driven applications powered by data streaming continuously process data from any data source. This approach increases the business value as the overall goal by increasing revenue, reducing cost, reducing risk, or improving the customer experience. And the event-driven architecture ensures a future-ready architecture.

Even top researchers and advisory firms such as Forrester, Gartner, and IDG recognize data streaming as a new software category. In December 2023, Forrester released “The Forrester Wave™: Streaming Data Platforms, Q4 2023,” highlighting Microsoft, Google, and Confluent as leaders, followed by Oracle, Amazon, and Cloudera.

Data Streaming is NOT just another data integration tool. Plenty of software categories and related data platforms exist to process and analyze data. I explored in a few dedicated series how data streaming. differs:

The Business Value of Data Streaming

A new software category opens use cases and adds business value across all industries:

Use Cases for Data Streaming with Apache Kafka by Business Value
Source: Lyndon Hedderly (Confluent)

Adding business value is crucial for any enterprise. With so many potential use cases, it is no surprise that more and more software vendors add Kafka support to their products.

Search my blog for your favorite industry to find plenty of case studies and architectures. Or read about use cases for Apache Kafka across industries to get started.

The Data Streaming Landscape of 2025

Data Streaming is a separate software category of data platforms. Many software vendors built their entire businesses around this category. Several mature players in the data market added support for data streaming in their platforms or cloud service ecosystem. Various SaaS startups have emerged in this category in the last few years.

The Data Streaming Landcape 2025 with Kafka Flink Confluent Amazon MSK Cloudera Event Hubs and Other Platforms

It all began with the open-source framework Apache Kafka, and today, the Kafka protocol is widely adopted across various implementations, including proprietary ones. What truly matters now is leveraging the capabilities of a complete data streaming platform—one that is fully compatible with the Kafka protocol. This includes built-in features like connectors, stream processing, security, data governance, and the elimination of self-management, reducing risks and operational effort.

The Kafka Protocol is the De Facto Standard of Data Streaming

Most software vendors use Kafka (or its protocol) at the core of their data streaming platformsApache Kafka has become the de facto standard for data streaming.

Apache Kafka is the de facto standard API for data streaming in 2024

Additionally, “benchmarketing” (i.e., picking a sweet spot or niche scenario where you perform better than your competitor) is the favorite marketing technique to “prove” differentiators to the real Apache Kafka. Some vendors also present misleading cost-efficiency comparisons by excluding critical cloud costs such as data transfer or storage, giving an incomplete picture of the true expenses.

Apache Kafka vs. Data Streaming Platform

Many still use Kafka merely as a dumb ingestion pipeline, overlooking its potential to power sophisticated, real-time data streaming use cases. One reason is that Kafka alone lacks the full capabilities of a comprehensive data streaming platform.

A complete solution requires more than “just” Kafka. Apache Flink is becoming the de facto standard for stream processing. Data integration capabilities (connectors, clients, APIs), data governance, security, and critical 24/7 SLAs and support are important for many data streaming projects.

The Data Streaming Landscape 2025 summarizes the current status of relevant products and cloud services, focusing on deployment models and the adoption/completeness of the data streaming platforms.

Data Streaming Vendors and Categories for the 2025 Landscape

The data streaming landscape changed this year. As most solutions evolve, I do not distinguish anymore between Kafka, non-Kafka, and stream processing as categories. Instead, I look at the adoption and completeness to assess the maturity of a data streaming solution from an open-source framework to a complete platform.

The deployment models also changed in the 2025 landscape. Instead of categorizing it into Self Managed, Partially Managed, and Fully Managed, I sort as follows: Self Managed, Bring Your Own Cloud (BYOC), and Cloud. The Cloud category is separated into PaaS (Platform as a Service) and SaaS (Software as a Service) to indicate that many Kafka cloud offerings are still NOT fully managed!

Please note: Intentionally, this data streaming landscape is not a complete list of frameworks, cloud services, or vendors. It is also not an official research. There is no statistical evidence. You might miss your favorite technology in this diagram. Then I did not see it in my conversations with customers, prospects, partners, analysts, or the broader data streaming community.

Also, note that I focus on general data streaming infrastructure. Brilliant solutions exist for using and analyzing streaming data for specific scenarios, like time-series databases, machine learning engines, observability platforms, or purpose-built IoT solutions. These are usually complementary, often connected out of the box via a Kafka connector, or even built on top of a data streaming platform (invisible for the end user).

Adoption and Completeness of Data Streaming (X-Axis)

Data streaming is adopted more and more across all industries. The concept is not new. In “The Past, Present and Future of Stream Processing“, I explored how the data streaming journey started decades ago with research and the first purpose-built proprietary products for specific use cases like stock trading.

Open source stream processing frameworks emerged during the big data and Hadoop era to make at least the ingestion layer a bit more real-time. Is anyone still remembering (or even still using) Apache Storm? 🙂

Today, most enterprises are realizing the value of data streaming for both analytical and operational use cases across industries. The cloud has brought a transformative shift, enabling businesses to start streaming and processing data with just a click, using fully managed SaaS solutions and intuitive UIs. Complete data streaming platforms now offer many built-in features that users previously had to develop themselves, including connectors, encryption, access control, governance, data sharing, and more.

Capabilities of a Complete Data Streaming Platform

Data streaming vendors are on the way to building a complete Data Streaming Platform (DSP). Capabilities include:

  • Messaging (“Streaming”): Transfer messages in real-time and persist for durability, decoupling, and slow consumers (near real-time, batch, API, file).
  • Data Integration: Connect to any legacy and cloud-native sources and sinks.
  • Stream Processing: Correlate events with stateless and stateful transformation or business logic.
  • Data Governance:  Ensure security, observability, data sovereignty, and compliance.
  • Developer Tooling: Enable flexibility for different personas such as software engineers, data scientists, and business analysts by providing different APIs (such as Java, Python, SQL, REST/HTTP), graphical user interfaces, and dashboards.
  • Operations Tooling and SaaS: Ease infrastructure management on premise respectively take over the entire operations burden in the public cloud with serverless offerings.
  • Uptime SLAs and Support: Provide the required guarantees and expertise for critical use cases.

Evolution from Open Source Adoption to a Data Streaming Organization

Modern data streaming is not just about adopting a product; it’s about transforming the way organizations operate and derive value from their data. Hence, the adoption goes beyond product features:

  • From open source and self-operations to enterprise-grade products and SaaS.
  • From human scale to automated, serverless elasticity with consumption-based pricing.
  • From dumb ingestion pipes to complex data pipelines and business applications.
  • From analytical workloads to critical transactional (and analytical) use cases.
  • From a single data streaming cluster to a powerful edge, hybrid, and multi-cloud architecture, including integration, migration, aggregation, and disaster recovery scenarios.
  • From wild adoption across business units with uncontrolled growth using various frameworks, cloud services, and integration tools to a center of excellence (CoE) with a strategic approach with standards, best practices, and knowledge sharing in an internal community.
  • From effortful and complex human management to enterprise-wide data governance, automation, and self-service APIs.

Data Streaming Deployment Models: Self-Managed vs. BYOC vs. Cloud (Y-Axis)

Different data streaming categories exist regarding the deployment model:

  • Self-Managed: Operate nodes like Kafka Broker, Kafka Connect, and Schema Registry by yourself with your favorite scripts and tools. This can be on-premise or in the public cloud in your VPC. Reduce the operations burden via a cloud-native platform (usually Kubernetes) and related operator tools that automate operations tasks like rolling upgrades or rebalancing Kafka Partitions.
  • Bring Your Own Cloud (BYOC): Allow organizations to host Kafka within their own cloud VPC. BYOC combines some of the benefits of cloud flexibility with enhanced security and control, while it outsources most of Kafka’s management to specialized vendors. The data plane is still customer-managed, but in contrast to self-managed Kafka, the customer does not need to worry about the complexity under the hood (like rebalancing, rolling upgrades, backups) – that is what cloud-native object storage and other magic code of the BYOC control plane service take over.
  • Cloud (PaaS or SaaS): Both PaaS and SaaS solutions operate within the cloud provider’s VPC. Fully managed SaaS for data streaming takes overall operational responsibilities, including scaling, failover handling, upgrades, and performance tuning, allowing users to focus solely on integration and business logic. In contrast, partially managed PaaS reduces the operational burden by automating certain tasks like rolling upgrades and rebalancing, but still requires some level of user involvement in managing the infrastructure. Fully Managed SaaS typically provides critical SLAs for support and uptime while partially managed PaaS cannot provide such guarantees.

Most organizations prefer SaaS for data streaming when business and technical requirements allow, as it minimizes operational complexity and maximizes scalability. Other deployment models are chosen when specific constraints or needs require them.

The Evolution of BYOC Kafka Cloud Services

Cloud and On-Premise deployment options are typically well understood, but BYOC (Bring Your Own Cloud) often requires more explanation due to its unique operating model and varying implementations across vendors.

In last year’s data streaming landscape 2024, I wrote the following about BYOC for Kafka:

I do NOT believe in this approach as too many questions and challenges exist with BYOC regarding security, support, and SLAs in the case of P1 and P2 tickets and outages. Hence, I put this in the category of self-managed. That is what it is, even though the vendor touches your infrastructure. In the end, it is your risk because you have to and want to control your environment.

This statement made sense because BYOC vendors at that time required access to the customer VPC and offered a shared support model. While this is still true for some BYOC solutions, my mind changed with the innovation of BYOC by one emerging vendor: WarpStream.

WarpStream’s BYOC Operating Model with Stateless Agents in the Customer VPC

WarpStream published a new operating model for BYOC: The customer only deploys stateless agents in its VPC and provides an object storage bucket to store the data. The control plane and metadata store are fully managed by the vendor as SaaS and the vendor takes over all the complexity.

Cloud-Native BYOC for Apache Kafka with WarpStream in the Public Cloud
Source: Confluent

With this innovation, BYOC is now a worthwhile third deployment option besides a self-managed and fully managed data streaming platform. It brings several benefits:

  • No access is needed by the BYOC cloud vendor to the customer VPC. The data plane (i.e., the “brokers” in the customer VPC) is stateless. The metadata/consensus is in the control plane (i.e., the cloud service in the WarpStream VPC).
  • The architecture solves sovereignty challenges and is a great fit for security and compliance requirements.
  • The cost of the BYOC offering is cheaper than self-managed Apache Kafka because it is built with cloud-native concepts and technologies in mind (e.g., zero disks and zero interzone networking fees, leveraging cloud object storage such as Amazon S3, Azure Blog Storage, or Google Cloud Storage).
  • The stateless architecture in the customer VPC makes autoscaling and elasticity very easy to implement/configure.

When to use BYOC?

WarpStream introduced an innovative share-nothing operating model that makes BYOC practical, secure, and cost-efficient. With that being said, I still recommend only looking at BYOC options for Apache Kafka in the public cloud if a fully managed and serverless data streaming platform does NOT work for you because of cost, security, or compliance reasons! When it comes to simplicity and ease of operation, nothing beats a fully managed cloud service.

And please keep in mind that NOT every BYOC cloud service provides these characteristics and benefits. Make sure to make a proper evaluation of your favorite solutions. For more details, look at my blog post: “Deployment Options for Apache Kafka: Self-Managed, Fully-Managed / Serverless and BYOC (Bring Your Own Cloud)“.

Changes in the Data Streaming Landscape from 2024 to 2025

My goal is NOT a growing landscape with tens or even hundreds of vendors and cloud services. Plenty of these pictures exist. Instead, I focus on a few technologies, vendors, and cloud offerings that I see in the field, with adoption by the broader open-source and cloud community.

I already discussed the important conceptual changes in the data streaming landscape:

  • Deployment Model: From self-managed, partially managed, and fully managed to self-managed, BYOC and cloud.
  • Streaming Categories: From different streaming categories to a single category for all data streaming platforms sorted by adoption and completeness.

Additionally, every year I modified the list of solutions compared to the data streaming landscape 2024 published one year ago.

Additions to the Data Streaming Landscape 2025

The following data streaming services were added:

  • Alibaba (Cloud): Confluent Data Streaming Service on Alibaba Cloud is an OEM partnership between Alibaba Cloud and Confluent to offer a fully managed SaaS in Mainland China. The service was announced end of 2021 and sees more and more traction in Asia. Alibaba is the contractor and first-level support for the end user.
  • Google Managed Service for Kafka (Cloud): Google announced this Kafka PaaS recently. The strategy looks very similar to Amazon’s MSK. Even the shortcut is the same: MSK. I explored when (not) to choose Google’s Kafka cloud service after the announcement. The service is still in preview, but available to a huge customer base already.
  • Oracle Streaming with Apache Kafka (Cloud): A partially managed Apache Kafka PaaS on Oracle Cloud Infrastructure (OCI). The service is in early access, but available to a huge customer base already.
  • WarpStream (BYOC): WarpStream was acquired by Confluent. It is still included with its logo as Confluent continues to keep the brand and solution separated (at least for now).

Removals from the Data Streaming Landscape 2025

There are NO REMOVALS this year, BUT I was close to removing two technologies:

  • Apache Pulsar and StreamNative: I was close to removing Apache Pulsar as I see zero traction in the market. Around 2020, Pulsar had some traction but focused too much on Kafka FUD instead of building a vibrant community. While Kafka simplified its architecture (ZooKeeper removal), Pulsar still includes three distributed systems (ZooKeeper or alternatives like etcd, BookKeeper, and Pulsar Broker). It also pivots to the Kafka protocol trying to get some more traction again. But it seems to be too late.
  • ksqlDB (formerly called KSQL): The product is feature complete. While it is still supported by Confluent, it will not get any new features. ksqlDB is still a great piece of software for some (!) Kafka-native stream processing projects but might be removed in the future. Confluent co-founder and Kafka co-creator Jay Kreps commented on X (former Twitter): “Confluent went through a set of experiments in this area. What we learned is that for *platform* layers you want a clean separation. We learned this the hard way: our source available stream processing layer KSQL, lost to open-source Apache Flink. We pivoted to Flink.

Vendor Overview for Data Streaming Platforms

All vendors of the landscape do some kind of data streaming. However, the offerings differ a lot in adoption, completeness, and vision. And many solutions are not available everywhere but only in one public cloud or only as self-managed. For detailed product information and experiences, the vendor websites and other blogs/conference talks are the best and most up-to-date resources. The following is just a summary to get an overview.

Before we do the deep dive, here again, the entire data streaming landscape for 2025:

The Data Streaming Landcape 2025 with Kafka Flink Confluent Amazon MSK Cloudera Event Hubs and Other Platforms

Self-Managed Data Streaming with Open Source and Proprietary Products

Self Managed Data Streaming Platforms including Kafka Flink Spark IBM Confluent Cloudera

The following list describes the open-source frameworks and proprietary products for self-managed data streaming deployments (in order of adoption and completeness):

  • Apache Pulsar: A competitor to Apache Kafka. Similar story and use cases, but different architecture. Kafka is a single distributed cluster – after removing the ZooKeeper dependency in 2022. Pulsar is three (!) distributed clusters: Pulsar brokers, ZooKeeper, and BookKeeper. Pulsar vs. Kafka explored the differences. And Kafka catches up to some missing features like Queues for Kafka.
  • StreamNative: The primary vendor behind Apache Pulsar. Not much market traction.
  • ksqlDB (usually called KSQL, even after Confluent’s rebranding): An abstraction layer on top of Kafka Streams to provide stream processing with streaming SQL. Hence, also Kafka-native. It comes with a Confluent Community License and is free to use. Sweet spot: Streaming ETL.
  • Redpanda: Implements the Kafka protocol with C++. Trying out different market strategies to define Redpanda as an alternative to a Kafka-native offering. Still in the early stage in the maturity curve. Adding tons of (immature) product features in parallel to find the right market fit in a growing Kafka market. Recently acquired Benthos to provide connectivity to data sources and sinks (similar to Kafka Connect).
  • Ververica: Well-known Flink company. Acquired by Chinese tech giant Alibaba in 2019. Not much traction in Europe and the US. Sweet spot: Flink in Mainland China.
  • Apache Flink: Becoming the de facto standard for stream processing. Open-source implementation. Provides advanced features including a powerful scalable compute engine, freedom of choice for developers between SQL, Java, and Python, APIs for Complex Event Processing (CEP), and unified APIs for stream and batch workloads.
  • Spark Streaming: The streaming part of the open-source big data processing framework Apache Spark. The enormous installed base of Spark clusters in enterprises broadens adoption thanks to solutions from Cloudera, Databricks, and the cloud service providers. Sweet spot: Analytics in (micro)batches with data stored at rest in the data lake/lakehouse.
  • Apache Kafka: The de facto standard for data streaming. Open-source implementation with a vast community. Almost all vendors rely on (parts of) this project. Often underestimated: Kafka includes data integration and stream processing capabilities with Kafka Connect and Kafka Streams, making even the open-source Apache download already more powerful than many other data streaming frameworks and products.
  • IBM / Red Hat AMQ Streams: Provides Kafka as self-managed Kafka on Kubernetes via OpenShift. Kafka is part of the integration portfolio that includes other open-source frameworks like Apache Camel. Sweet spot: Existing IBM customers.
  • Cloudera: Provides Kafka, Flink, and other open-source data and analytics frameworks as a self-managed offering. The main strategy is offering one product with a vast combination of many open-source frameworks that can be deployed on any infrastructure. Sweet spot: Analytics.
  • Confluent Platform: Focuses on a complete data streaming platform including Kafka and Flink, and various advanced data streaming capabilities for operations, integration, governance, and security. Sweet spot: Unifying operational and analytical workloads, and combination with the fully managed cloud service.

Data Streaming with Bring Your Own Cloud (BYOC)

Bring Your Own Cloud BYOC Data Streaming Platforms Redpanda Databricks WarpStream

BYOC is an emerging category and is mainly used for specific challenges such as strict data security and compliance requirements. The following vendors provide dedicated BYOC offerings for data streaming (in order of adoption and completeness)

  • WarpStream (Confluent): A new entrant into the data streaming market. The cloud service is a Kafka-compatible data streaming platform built directly on top of S3. Innovated the BYOC model to enable secure and cost-effective data streaming for workloads that don’t have strong latency requirements.
  • Redpanda: The first BYOC offering on the market for data streaming. The biggest concern is the shared responsibility model of this solution because the vendor requires access to the customer VPC for operations and support. This is against the key principles of BYOC regarding security and compliance and why organizations (have to) look for BYOC instead of SaaS solutions.
  • Databricks: Cloud-based data platform that provides a collaborative environment for data engineering, data science, and machine learning, built on top of Apache Spark. Data Streaming is enabled by Spark Streaming and focuses mainly on analytical workloads that are optimized from batch to near real-time.

Partially Managed Data Streaming Cloud Platforms (PaaS)

PaaS Cloud Streaming Platforms like Red Hat Amazon MSK Google Oracle Aiven

Here is an overview of relevant PaaS data streaming cloud services (in order of adoption and completeness):

  • Google Cloud Managed Service for Apache Kafka (MSK): Initially branded as Google Managed Kafka for BigQuery (likely for a better marketing push), the service enables data ingestion into lakehouses on GCP such as Google BigQuery.
  • Amazon Managed Service for Apache Flink (MSF): A partially managed service by AWS that allows customers to transform and analyze streaming data in real-time with Apache Flink. It still provides some (costly) gaps for auto-scaling and is not truly serverless. Supports all Flink interfaces, i.e., SQL, Java, and Python. And non-Kafka connectors, too. Only available on AWS.
  • Oracle OCI Streaming with Apache Kafka: The service is still in early access, but available to a huge customer base already on Oracle’s cloud infrastructure.
  • Microsoft Azure HDInsight. A piece of Azure’s Hadoop infrastructure. Not intended for other use cases beyond data ingestion for batch analytics.
  • Instaclustr: Partially managed Kafka cloud offerings across cloud providers. The product portfolios offer various hosted services of open-source technologies. Instaclustr also offers a (semi-)managed offering for on-premise infrastructure.
  • Amazon Kinesis: Data ingestion into AWS data stores. Mature product for a specific problem. Only available on AWS.
  • Aiven: Partially managed Kafka cloud offerings across cloud providers. The product portfolios offer various hosted services of open-source technologies.
  • IBM / Red Hat AMQ Streams: Provides Kafka as a partially managed cloud offering on OpenShift (through Red Hat). Sweet spot: Existing IBM customers.
  • Amazon Managed Service for Apache Kafka (MSK): AWS has hundreds of cloud services, and Kafka is part of that broad spectrum. MSK is only available in public AWS cloud regions; not on Outposts, Local Zones, Wavelength, etc. MSK is likely the largest partially managed Kafka service across all clouds. It evolved with new features like support for Kafka Connect and Tiered Storage. But lacks connectivity outside the AWS ecosystem and a data governance narrative.

Fully Managed Data Streaming Cloud Services (SaaS)

Fully Managed SaaS Data Streaming Platform including Confluent Alibaba Microsoft Event Hubs

Here is an overview of relevant SaaS data streaming cloud services (in order of adoption and completeness):

  • Decodable: A relatively new cloud service for Apache Flink in the early stage. Huge potential if it is combined with existing Kafka infrastructures in enterprises. But also provides pre-built connectors for non-Kafka systems. Main Opportunity: Combination with another cloud vendor that only does Kafka or other messaging/streaming without stream processing capabilities.
  • StreamNative Cloud: The primary vendor behind Apache Pulsar. Offers self-managed and fully managed solutions. StreamNative Cloud for Kafka is still in a very early stage of maturity, not sure if it will ever strengthen.
  • Ververica: Stream processing as a service powered by Apache Flink on all major cloud providers. Huge potential if it is combined with existing Kafka infrastructures in enterprises. Main Opportunity: Combination with another cloud vendor that only does Kafka or other messaging/streaming without stream processing capabilities.
  • Redpanda Cloud: Redpanda provides its data streaming as a serverless offering. Not much information is available on the website about this part of the vendor’s product portfolio.
  • Amazon MSK Serverless: Different functionalities and limitations than Amazon MSK. MSK Serverless still does not get much traction because of its limitations. Both MSK offerings exclude Kafka support in their SLAs (please read the terms and conditions).
  • Google Cloud DataFlow: Fully managed service for executing Apache Beam pipelines within the Google Cloud Platform ecosystem. Mature product for a specific problem. Only available on GCP.
  • Azure Event Hubs: A mature, fully managed cloud service. The service does one thing, and that is done very well: Data ingestion via the Kafka protocol. Hence, it is not a complete streaming platform but is more comparable to Amazon Kinesis or Google Cloud PubSub. The limited compatibility with the Kafka protocol and the high cost of the service for higher throughput are the two major blockers that come up regularly in conversations.
  • Confluent Cloud: A complete data streaming platform including Kafka and Flink as a fully managed offering. In addition to deep integration, the platform provides connectivity, security, data governance, self-service portal, disaster recovery tooling, and much more to be the most complete DSP on all major public clouds.

Potential for the Data Streaming Landscape 2026

Data streaming is a journey. So is the development of event streaming platforms and cloud services. Several established software and cloud vendors might get more traction with their data streaming offerings. And some startups might grow significantly. The following shows a few technologies that might evolve and see growing adoption in 2025:

  • New startups around the Kafka protocol emerge. The list of new frameworks and cloud services is growing every quarter. A few names I saw in some social network posts (but not much beyond in the real world): AutoMQ, S2, Astradot, Bufstream, Responsive, tansu, Tektite, Upstash. While some focus on the messaging/streaming part, others focus on a particular piece such as building database capabilities.
  • Streaming databases like Materialize or RisingWave might become a new software category. My feeling: Very early stage of the hype cycle. We will see in 2025 if and where this technology gets more broadly adopted and what the use cases are. It is hard to answer how these will compete with emerging real-time analytics databases like Apache Druid, Apache Pinot, ClickHouse, Timeplus, Tinybird, et al. I know there are differences, but the broader community and companies need to a) understand these differences and b) find business problems for it.
  • Stream Processing SaaS startups emerge: Quix and Bytewax provide stream processing with Python. Quix now also offers a hosted offering based on Kafka Streams; as does Responsive. DeltaStream provides Apache Flink as SaaS. And many more startups emerge these days. Let’s see which of these gets traction in the market with an innovative product and business model.
  • Traditional data management vendors like MongoDB or Snowflake try to get deeper into the data streaming business. I am still a fan of separation of concerns; so I think these should keep their sweet spot and (only) provide streaming ingestion and CDC as use cases, but not (try to) compete with data streaming vendors.

Fun fact: The business model of almost all emerging startups is fully managed cloud services, not selling licenses for on-premise deployments. Many are based on open-source or open-core, and others only provide a proprietary implementation.

Although they are not aiming to be full data streaming platforms (and thus are not part of the platform landscape), other complementary tools are gaining momentum in the data streaming ecosystem. For instance, Conduktor is developing a proxy for Kafka clusters, and Lenses, though relatively quiet since its acquisition by Celonis, has recently introduced updates to its user-friendly management and developer tools. These tools address gaps that some data streaming platforms leave unfilled.

Data Streaming: A Journey, Not a Sprint

Data streaming isn’t a sprint—it’s a journey! Adopting event-driven architectures with technologies like Apache Kafka or Apache Flink requires rethinking how applications are designed, developed, deployed, and monitored. Modern data strategies involve legacy integration, cloud-native microservices, and data sharing across hybrid and multi-cloud environments.

The data streaming landscape in 2025 highlights the emergence of a new software category, though it’s still in its early stages. Building such a category takes time. In discussions with customers, partners, and the community, a common sentiment emerges: “We understand the value but are just starting with the first data streaming pipelines and have a long-term plan to implement advanced stream processing and build a strategic data streaming organization.”

The Forrester Wave: Streaming Data Platforms, Q4 2023, and other industry reports from Gartner and IDG signal that this category is progressing through the hype cycle.

Last but not least, check out my Top Data Streaming Trends for 2025 to understand how the data streaming landscape fits into emerging trends:

  1. Democratization of Kafka: Apache Kafka has transitioned from a specialized tool to a foundational platform in modern data infrastructure.
  2. Kafka Protocol as the Standard: Vendors adopt the Kafka protocol over the framework, enabling flexibility with compatibility and performance trade-offs.
  3. BYOC Deployment Model: Bring Your Own Cloud gains traction for balancing security, compliance, and managed services.
  4. Flink Becomes the Standard for Stream Processing: Apache Flink rises as the premier framework for stream processing, building integration pipelines and business applications.
  5. Data Streaming for Real-Time Predictive AI and GenAI: Real-time model inference drives predictive and generative AI applications.
  6. Data Streaming Organizations: Companies unify real-time data strategies to standardize tools, governance, and collaboration.

Which are your favorite open-source frameworks or cloud services for data streaming? What are your most relevant and exciting trends around Apache Kafka and Flink in 2024 to set data in motion? What does your enterprise landscape for data streaming look like? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter. Make sure to download by free ebook about data streaming use cases and industry examples.

The post The Data Streaming Landscape 2025 appeared first on Kai Waehner.

]]>
Top Trends for Data Streaming with Apache Kafka and Flink in 2025 https://www.kai-waehner.de/blog/2024/12/02/top-trends-for-data-streaming-with-apache-kafka-and-flink-in-2025/ Mon, 02 Dec 2024 14:02:07 +0000 https://www.kai-waehner.de/?p=6923 Apache Kafka and Apache Flink are leading open-source frameworks for data streaming that serve as the foundation for cloud services, enabling organizations to unlock the potential of real-time data. Over recent years, trends have shifted from batch-based data processing to real-time analytics, scalable cloud-native architectures, and improved data governance powered by these technologies. Looking ahead to 2025, the data streaming ecosystem is set to undergo even greater changes. Here are the top trends shaping the future of data streaming for businesses.

The post Top Trends for Data Streaming with Apache Kafka and Flink in 2025 appeared first on Kai Waehner.

]]>
The evolution of data streaming has transformed modern business infrastructure, establishing real-time data processing as a critical asset across industries. At the forefront of this transformation, Apache Kafka and Apache Flink stand out as leading open-source frameworks that serve as the foundation for cloud services, enabling organizations to unlock the potential of real-time data. Over recent years, trends have shifted from batch-based data processing to real-time analytics, scalable cloud-native architectures, and improved data governance powered by these technologies. Looking ahead to 2025, the data streaming ecosystem is set to undergo even greater changes. Here are the top trends shaping the future of data streaming for businesses.

Data Streaming Trends for 2025 - Leading with Apache Kafka and Flink

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch.

The Top Data Streaming Trends

Some followers might notice that this became a series with articles about the top 5 data streaming trends for 2021, the top 5 for 2022, the top 5 for 2023, and the top 5 for 2024. Trends change over time, but the huge value of having a scalable real-time infrastructure as the central data hub stays. Data streaming with Apache Kafka is a journey and evolution to set data in motion.

I recently explored the past, present, and future of data streaming tools and strategies from the past decades. Data streaming is becoming more and more mature and standardized, but also innovative.

Let’s now look at the top trends coming up more regularly in conversations with customers, prospects, and the broader data streaming community across the globe:

  1. The Democratization of Kafka: Apache Kafka has transitioned from a specialized tool to a key pillar in modern data infrastructure.
  2. Kafka Protocol as the Standard: Vendors adopt the Kafka wire protocol, enabling flexibility with compatibility and performance trade-offs.
  3. BYOC Deployment Model: Bring Your Own Cloud gains traction for balancing security, compliance, and managed services.
  4. Flink Becomes the Standard for Stream Processing: Apache Flink rises as the premier framework for stream processing, building integration pipelines and business applications.
  5. Data Streaming for Real-Time Predictive AI and GenAI: Real-time model inference drives predictive and generative AI applications.
  6. Data Streaming Organizations: Companies unify real-time data strategies to standardize processes, tools, governance, and collaboration.

The following sections describe each trend in more detail. The trends are relevant for many scenarios; no matter if you use open-source frameworks like Apache Kafka and Flink, a commercial platform, or a fully managed cloud service like Confluent Cloud.

Trend 1: The Democratization of Kafka

In the last decade, Apache Kafka has become the standard for data streaming, evolving from a specialized tool to an essential utility in the modern tech stack. With over 150,000 organizations using Kafka today, it has become the de facto choice for stream processing. Yet, with a market crowded by offerings from AWS, Microsoft Azure, Google GCP, IBM, Oracle, Confluent, and various startups, companies can no longer rely solely on Kafka for differentiation. The vast array of Kafka-compatible solutions means that businesses face more choices than ever, but also new challenges in selecting the solution that balances cost, performance, and features.

The Challenge: Finding the Right Fit in a Crowded Kafka Market

For end users, choosing the right Kafka solution is becoming increasingly complex. Basic Kafka offerings cover standard streaming needs but may lack advanced features, such as enhanced security, data governance, or integration and processing capabilities, that are essential for specific industries. In such a diverse market, businesses must navigate trade-offs, considering whether a low-cost option meets their needs or whether investing in a premium solution with added capabilities provides better long-term value.

The Solution: Prioritizing Features for Your Strategic Needs

As Kafka solutions evolve, users must look beyond price and consider features that offer real strategic value. For example, companies handling sensitive customer data might benefit from Kafka products with top-tier security features. Those focused on analytics may look for solutions with strong integrations into data platforms and low cost for high throughput. By carefully selecting a Kafka product that aligns with industry-specific requirements, businesses can leverage the full potential of Kafka while optimizing for cost and capabilities.

For instance, look at Confluent’s various cluster types for different requirements and use cases in the cloud:

Confluent Cloud Cluster Types for Different Requirements and Use Cases
Source: Confluent

As an example, Freight Clusters was introduced to provide an offering with up to 90 percent less cost. The major trade-off is higher latency. But this is perfect for high volume log analytics at GB/sec scale.

The Business Value: Affordable and Customized Data Streaming

Kafka’s commoditization means more affordable, customizable options for businesses of all sizes. This competition reduces costs, making high-performance data streaming more accessible, even to smaller organizations. By choosing a tailored solution, businesses can enhance customer satisfaction, speed up decision-making, and innovate faster in a competitive landscape.

Trend 2: The Kafka Protocol, not Apache Kafka, is the New Standard for Data Streaming

With the rise of cloud-native architectures, many vendors have shifted to supporting the Kafka protocol rather than the open-source Kafka framework itself, allowing for greater flexibility and cloud optimization. This change enables businesses to choose Kafka-compatible tools that better align with specific needs, moving away from a one-size-fits-all approach.

Confluent introduced its KORA engine, i.e., Kafka re-architected to be cloud-native. A deep technical whitepaper goes into the details (this is not a marketing document but really for software engineers).

Confluent KORA - Apache Kafka Re-Architected to be Cloud Native
Source: Confluent

Other players followed Confluent and introduced their own cloud-native “data streaming engines”. For instance, StreamNative has URSA powered by Apache Pulsar, Redpanda talks about its R1 Engine implementing the Kafka protocol, and Ververica recently announced VERA for its Flink-based platform.

Some vendors rely only on the Kafka protocol with a proprietary engine from the beginning. For instance, Azure Event Hubs or WarpStream. Amazon MSK also goes in this direction by adding proprietary features like Tiered Storage or even introducing completely new product options such as Amazon MSK Express brokers.

The Challenge: Limited Compatibility Across Kafka Solutions

When vendors implement the Kafka protocol instead of the entire Kafka framework, it can lead to compatibility issues, especially if the solution doesn’t fully support Kafka APIs. For end users, this can complicate integration, particularly for advanced features like Exactly-Once Semantics, the Transaction API, Compacted Topics, Kafka Connect, or Kafka Streams, which may not be supported or working as expected.

The Solution: Evaluating Kafka Protocol Solutions Critically

To fully leverage the flexibility of Kafka protocol-based solutions, a thorough evaluation is essential. Businesses should carefully assess the capabilities and compatibility of each option, ensuring it meets their specific needs. Key considerations include verifying the support of required features and APIs (such as the Transaction API, Kafka Streams, or Connect).

It is also crucial to evaluate the level of product support provided, including 24/7 availability, uptime SLAs, and compatibility with the latest versions of open-source Apache Kafka. This detailed evaluation ensures that the chosen solution integrates seamlessly into existing architectures and delivers the reliability and performance required for modern data streaming applications.

The Business Value: Expanded Options and Cost-Efficiency

Kafka protocol-based solutions offer greater flexibility, allowing businesses to select Kafka-compatible services optimized for their specific environments. This flexibility opens doors for innovation, enabling companies to experiment with new tools without vendor lock-in.

For instance, innovations such as a “direct write to S3 object store” architecture, as seen in WarpStream, Confluent Freight Clusters, and other data streaming startups that also build proprietary engines around the Kafka protocol. The result is a more cost-effective approach to data streaming, though it may come with trade-offs, such as increased latency. Check out this video about the evolution of Kafka Storage to learn more.

Trend 3: BYOC (Bring Your Own Cloud) as a New Deployment Model for Security and Compliance

As data security and compliance concerns grow, the Bring Your Own Cloud (BYOC) model is gaining traction as a new way to deploy Apache Kafka. BYOC allows businesses to host Kafka in their own Virtual Private Cloud (VPC) while the vendor manages the control plane to handle complex orchestration tasks like partitioning, replication, and failover.

This BYOC approach offers organizations enhanced control over their data while retaining the operational benefits of a managed service. BYOC provides a middle ground between self-managed and fully managed solutions, addressing specific regulatory and security needs without sacrificing scalability or flexibility.

Cloud-Native BYOC for Apache Kafka with WarpStream in the Public Cloud
Source: Confluent

The Challenge: Balancing Security and Ease of Use

Ensuring data sovereignty and compliance is non-negotiable for organizations in highly regulated industries. However, traditional fully managed cloud solutions can pose risks due to vendor access to sensitive data and infrastructure. Many BYOC solutions claim to address these issues but fall short when it comes to minimizing external access to customer environments. Common challenges include:

  • Vendor Access to VPCs: Many BYOC offerings require vendors to have access to customer VPCs for deployment, cluster management, and troubleshooting. This introduces potential security vulnerabilities.
  • IAM Roles and Elevated Privileges: Cross-account Identity and Access Management (IAM) roles are often necessary for managing BYOC clusters, which can expose sensitive systems to unnecessary risks.
  • VPC Peering Complexity: Traditional BYOC solutions often rely on VPC peering, a complex and expensive setup that increases operational overhead and opens additional points of failure.

These limitations create significant challenges for security-conscious organizations, as they undermine the core promise of BYOC: control over the data environment.

The Solution: Gaining Control with a “Zero Access” BYOC Model

WarpStream redefines the BYOC model with a “zero access” architecture, addressing the challenges of traditional BYOC solutions. Unlike other BYOC offerings using the Kafka protocol, WarpStream ensures that no data leaves the customer’s environment, delivering a truly secure-by-default platform. Hence this section discusses specifically WarpStream, not BYOC Kafka offerings in general.

WarpStream BYOC Zero Access Kafka Architecture with Control and Data Plane
Source: WarpStream

Key features of WarpStream include:

  • Zero Access to Customer VPCs: WarpStream eliminates vendor access by deploying stateless agents within the customer’s environment, handling compute operations locally without requiring cross-account IAM roles or elevated privileges to reduce security risks.
  • Data/Metadata Separation: Raw data remains entirely within the customer’s network for full sovereignty, while only metadata is sent to WarpStream’s control plane for centralized management, ensuring data security and compliance.
  • Simplified Infrastructure: WarpStream avoids complex setups like VPC peering and cross-IAM roles, minimizing operational overhead while maintaining high performance.

Comparison with Other BYOC Solutions using the Kafka protocol:

Unlike most other BYOC offerings (e.g., Redpanda), WarpStream doesn’t require direct VPC access or elevated permissions, avoiding risks like data exposure or remote troubleshooting vulnerabilities. Its “zero access” architecture ensures unparalleled security and compliance.

The Business Value: Secure, Compliant, and Scalable Data Streaming

WarpStream’s innovative approach to BYOC delivers exceptional business value by addressing security and compliance concerns while maintaining operational simplicity and scalability:

  • Uncompromised Security: The zero-access architecture ensures that raw data remains entirely within the customer’s environment, meeting the strictest security and compliance requirements for regulated industries like finance, healthcare, and government.
  • Operational Efficiency: By eliminating the need for VPC peering, cross-IAM roles, and remote vendor access, WarpStream simplifies BYOC deployments and reduces operational complexity.
  • Cost Optimization: WarpStream’s reliance on cloud-native technologies like object storage reduces infrastructure costs compared to traditional disk-based approaches. Stateless agents also enable efficient scaling without unnecessary overhead.
  • Data Sovereignty: The data/metadata split guarantees that data never leaves the customer’s environment, ensuring compliance with regulations such as GDPR and HIPAA.
  • Peace of Mind for Security Teams: With no vendor access to the VPC or object storage, WarpStream’s zero-access model eliminates concerns about external breaches or elevated privileges, making it easier to gain buy-in from security and infrastructure teams.
BYOC Strikes the Balance Between Control and Managed Services

BYOC offers businesses the ability to strike a balance between control and managed services, but not all BYOC solutions are created equal. WarpStream’s “zero access” architecture sets a new standard, addressing the critical challenges of security, compliance, and operational simplicity. By ensuring that raw data never leaves the customer’s environment and eliminating the need for vendor access to VPCs, WarpStream delivers a BYOC model that meets the highest standards of security and performance. For organizations seeking a secure, scalable, and compliant approach to data streaming, WarpStream represents the future of BYOC data streaming.

But just to be clear: If a data streaming project goes to the cloud, fully managed Kafka (and Flink) should always be the first option as it is much easier to manage and operate to focus on fast time-to-market and business innovation. Choose BYOC only if fully managed does not work for you because of security requirements.

Apache Flink has emerged as the premier choice for organizations seeking a robust and versatile framework for continuous stream processing. Its ability to handle complex data pipelines with high throughput, low latency, and advanced stateful operations has solidified its position as the de facto standard for stream processing. Flink’s support for Java, Python, and SQL further enhances its appeal, enabling developers to build powerful data-driven applications using familiar tools.

Apache Flink Adoption Curve Compared to Kafka

As Flink adoption grows, it increasingly complements Apache Kafka as part of the modern data streaming ecosystem, while the Kafka Streams (Java-only) library remains relevant for lightweight, application-embedded use cases.

The Challenge: Handling Complex, High-Throughput Data Streams

Modern businesses increasingly rely on real-time data for both operational and analytical needs, spanning mission-critical applications like fraud detection, predictive maintenance, and personalized customer experiences, as well as Streaming ETL for integrating and transforming data. These diverse use cases demand robust stream processing capabilities that can address the challenges of:

Apache Flink’s versatility makes it uniquely positioned to meet the demands of both streaming ETL for data integration and building real-time business applications. Flink provides:

  • Low Latency: Near-instantaneous processing is crucial for enabling real-time decision-making in business applications, timely updates in analytical systems, and supporting transactional workloads where rapid processing and immediate consistency are essential for ensuring smooth operations and seamless user experiences.
  • High Throughput and Scalability: The ability to process millions of events per second, whether for aggregating operational metrics or moving massive volumes of data into data lakes or warehouses, without bottlenecks.
  • Stateful Processing: Support for maintaining and querying the state of data streams, essential for performing complex operations like aggregations, joins, and pattern detection in business applications, as well as data transformations and enrichment in ETL pipelines.
  • Multiple Programming Languages: Support for Java, Python, and SQL ensures accessibility for a wide range of developers, enabling efficient implementation across various use cases.

The rise of cloud services has further propelled Flink’s adoption, with offerings from major providers like Confluent, Amazon, IBM, and emerging startups. These cloud-native solutions simplify Flink deployments, making it easier for organizations to operationalize real-time analytics.

While Apache Flink has emerged as the de facto standard for stream processing, other frameworks like Apache Spark and its streaming module, Structured Streaming, continue to compete in this space. However, Spark Streaming has notable limitations that make it less suitable for many of the complex, high-throughput workloads modern enterprises demand.

The Challenges with Spark Streaming

Apache Spark, originally designed as a batch processing framework, introduced Spark Streaming and later Structured Streaming to address real-time processing needs. However, its batch-oriented roots present inherent challenges:

  • Micro-Batch Architecture: Spark Structured Streaming relies on micro-batches, where data is divided into small time intervals for processing. This approach, while effective for certain workloads, introduces higher latency compared to Flink’s true streaming architecture. Applications requiring millisecond-level processing or transactional workloads may find Spark unsuitable.
  • Limited Stateful Processing: While Spark supports stateful operations, its reliance on micro-batches adds complexity and latency. This makes Spark Streaming less efficient for use cases that demand continuous state updates, such as fraud detection or complex event processing (CEP).
  • Fault Tolerance Complexity: Spark’s recovery model is rooted in its lineage-based approach to fault tolerance, which can be less efficient for long-running streaming applications. Flink, by contrast, uses checkpointing and savepoints to handle failures more gracefully to ensure state consistency with minimal overhead.
  • Performance Overhead: Spark’s general-purpose design often results in higher resource consumption compared to Flink, which is purpose-built for stream processing. This can lead to increased infrastructure costs for high-throughput workloads.
  • Scalability Challenges for Stateful Workloads: While Spark scales effectively for batch jobs, its scalability for complex stateful stream processing is more limited, as distributed state management in micro-batches can become a bottleneck under heavy load.

By addressing these limitations, Apache Flink provides a more versatile and efficient solution than Apache Spark for organizations looking to handle complex, real-time data processing at scale.

Flink’s architecture is purpose-built for streaming, offering native support for stateful processing, low-latency event handling, and fault-tolerant operation, making it the preferred choice for modern real-time applications. But to be clear: Apache Spark, including Spark Streaming, has its place in data lakes and lakehouses to process analytical workloads.

Flink’s technical capabilities bring tangible business benefits, making it an essential tool for modern enterprises. By providing real-time insights, Flink enables businesses to respond to events as they occur, such as detecting and mitigating fraudulent transactions instantly, reducing losses, and enhancing customer trust.

The support of Flink for both transactional workloads (e.g., fraud detection or payment processing) and analytical workloads (e.g., real-time reporting or trend analysis) ensures versatility across a range of critical business functions. Scalability and resource optimization keep infrastructure costs manageable, even for demanding, high-throughput workloads, while features like checkpointing streamline failure recovery and upgrades, minimizing operational overhead.

Flink stands out with its dual focus on streaming ETL for data integration and building business applications powered by real-time analytics. Its rich APIs for Java, Python, and SQL make it easy for developers to implement complex workflows, accelerating time-to-market for new applications.

Data streaming has powered AI/ML infrastructure for many years because of its capabilities to scale to high volumes, process data in real-time, and integrate with transactional (payments, orders, ERP, etc.) and analytical (data warehouse, data lake, lakehouse) systems. My first article about Apache Kafka and Machine Learning was published in 2017: “How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka“.

As AI continues to evolve, real-time model inference powered by data streaming is opening up new possibilities for predictive and generative AI applications. By integrating model inference with stream processors such as Apache Flink, businesses can perform on-demand predictions for fraud detection, customer personalization, and more.

The Challenge: Provide Context for AI Applications In Real-Time

Traditional batch-based AI inference is too slow for many applications, delaying responses and leading to missed opportunities or wrong business decisions. To fully harness AI in real-time, businesses need to embed model inference directly within streaming pipelines.

Generative AI (GenAI) demands new design patterns like Retrieval Augmented Generation (RAG) to ensure accuracy, relevance, and reliability in its outputs. Without data streaming, RAG struggles to provide large language models (LLMs) with the real-time, domain-specific context they need, leading to outdated or hallucinated responses. Context is essential to ensure that LLMs deliver accurate and trustworthy outputs by grounding them in up-to-date and precise information.

Apache Flink enables real-time model inference by connecting data streams to external AI models through APIs. This setup allows companies to use centralized model servers for inference, providing flexibility and scalability while keeping data streams fast and responsive.

Flink’s ability to process data in real-time also enables advanced machine learning workflows, supporting predictive analytics and generative AI use cases that drive innovation.

GenAI Remote Model Inference with Stream Processing using Apache Kafka and Flink

Apache Flink enables real-time model inference by connecting data streams to external AI models through APIs. This setup allows companies to use centralized model servers for inference, providing flexibility and scalability while keeping data streams fast and responsive. By processing data in real-time, Flink ensures that generative AI models operate with the most current and relevant context, reducing errors and hallucinations.

Flink’s real-time processing capabilities also support advanced machine learning workflows. This enables use cases like predictive analytics, anomaly detection, and generative AI applications that require instantaneous decision-making. The ability to join live data streams with historical or external datasets enriches the context for model inference, enhancing both accuracy and relevance.

Additionally, Flink facilitates feature extraction and data preprocessing directly within the stream to ensure that the inputs to AI models are optimized for performance. This seamless integration with model servers and vector databases allows organizations to scale their AI systems effectively, leveraging real-time insights to drive innovation and deliver immediate business value.

The Business Value: Immediate, Actionable AI Insights

Real-time AI model inference with Flink enables businesses to provide personalized customer experiences, detect fraud as it happens, and perform predictive maintenance with minimal latency. This real-time responsiveness empowers companies to make AI-driven decisions in milliseconds, improving customer satisfaction and operational efficiency.

By integrating Flink with event-driven architectures like Apache Kafka, businesses can ensure that AI systems are always fed with up-to-date and trustworthy data, further enhancing the reliability of predictions.

The integration of Flink and data streaming offers a clear path to measurable business impact. By aligning real-time AI capabilities with organizational goals, they can drive innovation while reducing operational costs, such as automating customer support to lower reliance on service agents.

Furthermore, Flink’s ability to process and enrich data streams at scale supports strategic initiatives like hyper-personalized marketing or optimizing supply chains in real-time. These benefits directly translate into enhanced competitive positioning, faster time-to-market for AI-driven solutions, and the ability to make more confident, data-driven decisions at the speed of business.

Trend 6: Becoming a Data Streaming Organization

To harness the full potential of data streaming, companies are shifting toward structured, enterprise-wide data streaming strategies. Moving from a tactical, ad-hoc approach to a cohesive top-down strategy enables businesses to align data streaming with organizational goals, driving both efficiency and innovation.

The Challenge: Fragmented Data Streaming Efforts

Many companies face challenges due to disjointed streaming efforts, leading to data silos and inconsistencies that prevent them from reaping the full benefits of real-time data processing. At Confluent, we call this the enterprise adoption barrier:

Data Streaming Maturity Model - The Enterprise Adoption Barrier
Source: Confluent

This fragmentation results in inefficiencies, duplication of efforts, and a lack of standardized processes. Without a unified approach, organizations struggle with:

  • Data Silos: Limited data sharing across teams creates bottlenecks for broader use cases.
  • Inconsistent Standards: Different teams often use varying schemas, patterns, and practices, leading to integration challenges and data quality issues.
  • Governance Gaps: A lack of defined roles, responsibilities, and policies results in limited oversight, increasing the risk of data misuse and compliance violations.

These challenges prevent organizations from scaling their data streaming capabilities and realizing the full value of their real-time data investments.

The Solution: Building an Integrated Data Streaming Organization

By adopting a comprehensive data streaming strategy, businesses can create a unified data platform with standardized tools and practices. A dedicated streaming platform team, often called the Center of Excellence (CoE), ensures consistent operations. An internal developer platform provides governed, self-serve access to streaming resources.

Key elements of a data streaming organization include:

  • Unified Platform: Move from disparate tools and approaches to a single, standardized data streaming platform. This includes consistent policies for cluster management, multi-tenancy, and topic naming, ensuring a reliable foundation for data streaming initiatives.
  • Self-Service: Provide APIs, UIs, and other interfaces for teams to onboard, create, and manage data streaming resources. Self-service capabilities ensure governed access to topics, schemas, and streaming capabilities, empowering developers while maintaining compliance and security.
  • Data as a Product: Adopt a product-oriented mindset where data streams are treated as reusable assets. This includes formalizing data products with clear contracts, ownership, and metadata, making them discoverable and consumable across the organization.
  • Alignment: Define clear roles and responsibilities, from platform operators and developers to data product owners. Establishing an enterprise-wide data streaming function ensures coordination and alignment across teams.
  • Governance: Implement automated guardrails for compliance, quality, and access control. This ensures that data streaming efforts remain secure, trustworthy, and scalable.

The Business Value: Consistent, Scalable, and Agile Data Streaming

Becoming a Data Streaming Organization unlocks significant value by turning data streaming into a strategic asset. The benefits include:

  • Enhanced Agility: A unified platform reduces time-to-market for new data-driven products and services, allowing businesses to respond quickly to market trends and customer demands.
  • Operational Efficiency: Streamlined processes and self-service capabilities reduce the overhead of managing multiple tools and teams, improving productivity and cost-effectiveness.
  • Scalable Innovation: Standardized patterns and reusable data products enable the rapid development of new use cases, fostering a culture of innovation across the enterprise.
  • Improved Governance: Clear policies and automated controls ensure data quality, security, and compliance, building trust with customers and stakeholders.
  • Cross-Functional Collaboration: By breaking down silos, organizations can leverage data streams across teams, creating a network effect that accelerates value creation.

To successfully adopt a Data Streaming Organization model, companies must combine technical capabilities with cultural and structural change. This involves not just deploying tools but establishing shared goals, metrics, and education to bring teams together around the value of real-time data. As organizations embrace data streaming as a strategic function, they position themselves to thrive in a data-driven world.

Embracing the Future of Data Streaming

As data streaming continues to mature, it has become the backbone of modern digital enterprises. It enables real-time decision-making, operational efficiency, and transformative AI applications. Trends such as the commoditization of Kafka, the adoption of the Kafka protocol, BYOC deployment models, and the rise of Flink as the standard for stream processing demonstrate the rapid evolution and growing importance of this technology. These innovations not only streamline infrastructure but also empower organizations to harness real-time insights, foster agility, and remain competitive in the ever-changing digital landscape.

These advancements in data streaming present a unique opportunity to redefine data strategy. Leveraging data streaming as a central pillar of IT architecture allows businesses to break down silos, integrate machine learning into critical workflows, and deliver unparalleled customer experiences. The convergence of data streaming with generative AI, particularly through frameworks like Flink, underscores the importance of embracing a real-time-first approach to data-driven innovation.

Looking ahead, organizations that invest in scalable, secure, and strategic data streaming infrastructures will be positioned to lead in 2025 and beyond. By adopting these trends, enterprises can unlock the full potential of their data, drive business transformation, and solidify their place as leaders in the digital era. The journey to set data in motion is not just about technology—it’s about building the foundation for a future where real-time intelligence powers every decision and every experience.

What trends do you see for data streaming? Which ones are your favorites? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Top Trends for Data Streaming with Apache Kafka and Flink in 2025 appeared first on Kai Waehner.

]]>
When NOT to Use Apache Kafka? (Lightboard Video) https://www.kai-waehner.de/blog/2024/03/26/when-not-to-use-apache-kafka-lightboard-video/ Tue, 26 Mar 2024 06:45:11 +0000 https://www.kai-waehner.de/?p=6262 Apache Kafka is the de facto standard for data streaming to process data in motion. With its significant adoption growth across all industries, I get a very valid question every week: When NOT to use Apache Kafka? What limitations does the event streaming platform have? When does Kafka simply not provide the needed capabilities? How to qualify Kafka out as it is not the right tool for the job? This blog post contains a lightboard video that gives you a twenty-minute explanation of the DOs and DONTs.

The post When NOT to Use Apache Kafka? (Lightboard Video) appeared first on Kai Waehner.

]]>
Apache Kafka is the de facto standard for data streaming to process data in motion. With its significant adoption growth across all industries, I get a very valid question every week: When NOT to use Apache Kafka? What limitations does the event streaming platform have? When does Kafka simply not provide the needed capabilities? How to qualify Kafka out as it is not the right tool for the job? This blog post contains a lightboard video that gives you a twenty-minute explanation of the DOs and DONTs.

When NOT to Use Apache Kafka?

DisclaimeR: This blog post shares a lightboard video to watch an explanation about when NOT to use Apache Kafka. For a much more detailed and technical blog post with various use cases and case studies, check out this blog post from 2022 (which is still valid today – whenever you read it).

What is Apache Kafka, and what is it NOT?

Kafka is often misunderstood. For instance, I still hear way too often that Kafka is a message queue. Part of the reason is that some vendors only pitch it for a specific problem (such as data ingestion into a data lake or data warehouse) to sell their products. So, in short:

Kafka is…

  • a scalable real-time messaging platform to process millions of messages per second.
  • a data streaming platform for massive volumes of big data analytics and small volumes of transactional data processing.
  • a distributed storage provides true decoupling for backpressure handling, support of various communication protocols, and replayability of events with guaranteed ordering.
  • a data integration framework (Kafka Connect) for streaming ETL.
  • a data processing framework (Kafka Streams) for continuous stateless or stateful stream processing.

This combination of characteristics in a single platform makes Kafka unique (and successful).

Kafka is NOT…

  • a proxy for millions of clients (like mobile apps) – but Kafka-native proxies (like REST or MQTT) exist for some use cases.
  • an API Management platform – but these tools are usually complementary and used for the creation, life cycle management, or the monetization of Kafka APIs.
  • a database for complex queries and batch analytics workloads – but good enough for transactional queries and relatively simple aggregations (especially with ksqlDB).
  • an IoT platform with features such as device management  – but direct Kafka-native integration with (some) IoT protocols such as MQTT or OPC-UA is possible and the appropriate approach for (some) use cases.
  • a technology for hard real-time applications such as safety-critical or deterministic systems – but that’s true for any other IT framework, too. Embedded systems are a different software!

For these reasons, Kafka is complementary, not competitive, to these other technologies. Choose the right tool for the job and combine them!

Lightboard Video: When NOT to use Apache Kafka

The following video explores the key concepts of Apache Kafka. Afterwards, the DOs and DONTs of Kafka show how to complement data streaming with other technologies for analytics, APIs, IoT, and other scenarios.

Data Streaming Vendors and Cloud Services

The research company Forrester defines data streaming platforms as a new software category in a new Forrester Wave. Apache Kafka is the de facto standard used by over 100,000 organizations.

Plenty of vendors offer Kafka platforms and cloud services. Many complementary open source stream processing frameworks like Apache Flink and related cloud offerings emerged. And competitive technologies like Pulsar, Redpanda, or WarpStream try to get market share leveraging the Kafka protocol. Check out the data streaming landscape of 2024 to summarize existing solutions and market trends. The end of the article gives an outlook to potential new entrants in 2025.

Data Streaming Landscape 2024 around Kafka Flink and Cloud

Apache Kafka is a Data Streaming Platform: Combine it with other Platforms when needed!

Over 150,000 organizations use Apache Kafka in the meantime. The Kafka protocol is the de facto standard for many open source frameworks, commercial products and serverless cloud SaaS offerings.

However, Kafka is not an allrounder for every use case. Many projects combine Kafka with other technologies, such as databases, data lakes, data warehouses, IoT platforms, and so on. Additionally, Apache Flink is becoming the de facto standard for stream processing (but Kafka Streams is not going away and is the better choice for specific use cases).

Where do you (not) use Apache Kafka? What other technologies do you combine Kafka with? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post When NOT to Use Apache Kafka? (Lightboard Video) appeared first on Kai Waehner.

]]>
When NOT to choose Amazon MSK Serverless for Apache Kafka? https://www.kai-waehner.de/blog/2022/08/30/when-not-to-choose-amazon-msk-serverless-for-apache-kafka/ Tue, 30 Aug 2022 05:54:21 +0000 https://www.kai-waehner.de/?p=4777 Apache Kafka became the de facto standard for data streaming. Various cloud offerings emerged and improved in the last years. Amazon MSK Serverless is the latest Kafka product from AWS. This blog post looks at its capabilities to explore how it relates to "the normal" partially managed Amazon MSK, when the serverless version is a good choice, and when other fully-managed cloud services like Confluent Cloud are the better option.

The post When NOT to choose Amazon MSK Serverless for Apache Kafka? appeared first on Kai Waehner.

]]>
Apache Kafka became the de facto standard for data streaming. Various cloud offerings emerged and improved in the last years. Amazon MSK Serverless is the latest Kafka product from AWS. This blog post looks at its capabilities to explore how it relates to “the normal” partially managed Amazon MSK, when the serverless version is a good choice, and when other fully-managed cloud services like Confluent Cloud are the better option.

Is Amazon MSK Serverless for Apache Kafka a Self-Driving Car or just a Car Engine

Disclaimer: I work for Confluent. While AWS is a strong strategic partner, it also offers the competitive offering Amazon MSK. This post is not about comparing every feature but explaining the concepts behind the alternatives. Read articles and docs from the different vendors to make your own evaluation and decision. View this post as a list of criteria to not forget important aspects in your cloud service selection.

What is a fully-managed Kafka cloud offering?

Before we get started looking at Kafka cloud services like Amazon MSK and Confluent Cloud, it is important to understand what fully managed means actually:

Fully managed and serverless Apache Kafka

Make sure to evaluate the technical solution. Most Kafka cloud solutions market their offering as “fully managed”. However, almost all Kafka cloud offerings are only partially managed! In most cases, the customer must operate the Kafka infrastructure, fix bugs, and optimize scalability and performance.

Cloud-native Apache Kafka re-engineered for the cloud

Operating Apache Kafka as a fully-managed offering in the cloud requires several additional components to the core of open-source Kafka. A cloud-native Kafka SaaS has features like:

  • The latest stable version with non-disruptive rolling upgrades
  • Elastic scale (up and down!)
  • Self-balancing clusters that take over the complexity and risk of rebalancing partitions across Kafka brokers
  • Tiered storage for cost-efficient long-term storage and better scalability (as the cold storage does not require a rebalancing of partitions and other complex operations tasks)
  • Complete solution “on top of the infrastructure”, including connectors, stream processing, security, and data governance – all in a single fully-managed SaaS

To learn more about building a cloud-native Kafka service, I highly recommend reading the following paper:  The Cloud-Native Chasm: Lessons Learned from Reinventing Apache Kafka as a Cloud-Native, Online Service”.

Comparison of Apache Kafka products and cloud services

Apache Kafka became the de facto standard for data streaming. The open-source community is vast. Various vendors added Kafka and related tooling to their offerings or provide a Kafka cloud service. I wrote a blog post in 2021: “Comparison of Open Source Apache Kafka vs. Vendors including Confluent, Cloudera, Red Hat, Amazon MSK“:

Comparison of Apache Kafka Products and Services including Confluent Amazon MSK Cloudera IBM Red Hat

The article uses a car analogy – from the motor engine to the self-driving car – to explore the different Kafka offerings available on the market. I also cover a few other vehicles, meaning (partly) Kafka-compatible technologies. The goal is not a feature-by-feature comparison (that would be outdated the day after the publication). Instead, the intention is to educate about the different deployment models, product strategies, and trade-offs from the options.

The above post is worth reading to understand how comparing different Kafka solutions makes sense. However, products develop and innovate… Tech comparisons get outdated quickly. In the meantime, AWS released a new product: Amazon MSK Serverless. This blog post explores what it is, when to use it, and how it differs from other Kafka products. It compares especially Amazon MSK (the partially managed service) and Confluent Cloud (a fully-managed competitor to Amazon MSK Serverless).

How does Amazon MSK Serverless fit into the Kafka portfolio?

Keeping the car analogy of my previous post, I wonder: Is it a self-driving car, a complete car you drive by yourself, or just a car engine to build your own car? Interestingly, you can argue for all three. 🙂 Let’s explore this in the following sections.

Is Amazon MSK Serverless a Self-Driving Car of Apache Kafka

Introducing Amazon MSK Serverless

Amazon MSK Serverless is a cluster type for Amazon MSK to run Apache Kafka without having to manage and scale cluster capacity. MSK Serverless automatically provisions and scales compute and storage resources. Thus, you can use Apache Kafka on demand and pay for the data you stream and retain.

Amazon MSK is one of the hundreds of cloud services that AWS provides. AWS is a one-stop shop for all cloud needs. That’s definitely a key strength of AWS (and similar to Azure and GCP).

Amazon MSK Serverless is built to solve the problems that come with Amazon MSK (the partially managed Kafka service that is marketed as a fully-managed solution even though it is not): A lot of hidden ops, infra, and downtime costs. This AWS podcast has a great episode that introduces Amazon MSK Serverless and when to use it as a replacement for Amazon MSK.

What Amazon does NOT tell you about MSK Serverless

AWS has great websites, documentation, and videos for its cloud services. This is not different for Amazon MSK. However, a few important details are not obvious… 🙂 Let’s explore a few key points to make sure everybody understands what Amazon MSK Serverless is and what it is not.

Amazon MSK Serverless is incomplete Kafka

If you follow my blogs, then this might be boring. Despite that, too many people think about Kafka as a message queue and data transportation pipeline. That’s what it is, but Kafka is much more:

  • Real-time messaging at any scale
  • Data integration with Kafka Connect
  • Data processing (aka stream processing) with Kafka Streams (or 3rd party Kafka-native components like KSQL)
  • True decoupling (the most underestimated feature of Kafka because of its built-in storage capabilities) and replayability of events with flexible retention times
  • Data governance with service contracts using Schema Registry (to be fair, this is not part of open source Kafka, but a separate component and accessible from GitHub or by vendors like Confluent or Red Hat – but it is used in almost all serious Kafka projects)

As I won’t repeat myself, here are a few articles explaining why Kafka is more than a message queue like you find it in Amazon MSK Serverless:

TL;DR: AWS provides a cloud service for every problem. You can glue them together to build a solution. However, similar to a monolithic application that provides inflexibility in a single package, a mesh of too many independent glued services using different technologies is also very hard to operate and maintain. And the cost for so many services plus networking and data transfer will bring up many surprises in such a setup.

You should ask yourself a few questions:

  • How do you implement data integration and business logic with Amazon MSK Serverless?
  • What’s the consequence regarding cost, SLAs, and end-to-end latency respectively delivery guarantees of combining Amazon MSK Serverless with various other products like Amazon Kinesis Data Analytics, AWS Glue, AWS Data Pipeline, or a 3rd party integration tool?
  • What is your security and data governance strategy around streaming data? How do you build an event-based data hub that enforces compliant communication between data producers and independent downstream consumers?

Spoilt for Choice: Amazon MSK and Amazon MSK Serverless are different products

Amazon MSK is NOT fully-managed. It is partially managed. After providing the brokers, you need to deploy, operate and monitor Kafka brokers and Kafka Connect connectors, and realize rebalancing with the open source tool Cruise Control. Check out AWS’ latest MSK sizing post: “Best practices for right-sizing your Apache Kafka clusters to optimize performance and cost“. Seriously? A ten pages long and very technical article explaining how to operate a “fully-managed cloud Kafka service”?

You might think that Amazon MSK Serverless is the successor of Amazon MSK to solve these problems. However, there are now two products to choose from: Amazon MSK and Amazon MSK Serverless.

Amazon does NOT recommend using Amazon MSK Serverless for all use cases! It is recommended if you don’t know the workloads or if they often change in volume.
Amazon recommends “the normal” Amazon MSK for predictable workloads as it is more cost-effective (and because it is not workable because of its many tough limitations). MSK Connect is also not supported yet and coming at some point in the future.

It is totally okay to provide different products for different use cases. Confluent also has different offerings for different SLAs and functional requirements in its cloud offering. Multi-tenant basic clusters and dedicated clusters are available, but you never have to self-manage the cluster or fix bugs or performance issues yourself.

You should ask yourself a few questions:

  • Which projects require Amazon MSK and which require Amazon MSK Serverless?
  • How will the project scale as your grows?
  • What’s the migration/ upgrade plan if your workload exceeds MSK Serverless partition/retention limits?
  • What is the total cost of ownership (TCO) for MSK plus all the other cloud services I need to combine it with?

Amazon MSK Serverless excludes Kafka support

Amazon MSK service level agreements say: “The Service Commitment DOES NOT APPLY to any unavailability, suspension or termination … caused by the underlying Apache Kafka or Apache Zookeeper engine software that leads to request failures …

Amazon MSK Serverless is part of the Amazon MSK product and has the same limitation. Excluding Kafka support from the MSK offering is (or should be) a blocker for any serious data streaming project!

Not much more to add here… Do you really want to buy a specific product that excludes the support for its core capability? Please also ask your manager if he agrees and takes the risk.

You should ask yourself a few questions:

  • Who is responsible and takes the risk if you hit a Kafka issue in your project using Amazon MSK or Amazon MSK Serverless?
  • How do you react to security incidents related to the Apache Kafka open source project?
  • How do you fix performance or scalability issues (on both the client and server side)?

When NOT to use Amazon MSK Serverless?

Let’s go back to the car analogy. Is Amazon MSK Serverless a self-driving car?

Obviously, Amazon MSK Serverless is self-driving. That’s what a serverless product is. Similar to Amazon S3 for object storage or AWS Lambda for serverless functions.

However, Amazon MSK Serverless is NOT a complete car! It does not provide enterprise support for its functionality. And it does not provide more than just the core of data streaming.

Therefore, Amazon MSK Serverless is a great self-driving AWS product for some use cases. But you should evaluate the following facts before deciding for or against this cloud service.

24/7 Enterprise support for the product

MSK excludes Kafka support from its product Amazon MSK. Amazon MSK Serverless is part of Amazon MSK and uses its SLAs.

I am amazed at how many enterprises use Amazon MSK without reading the SLAs. Most people are not aware that Kafka support is excluded from the product.

This makes Amazon MSK Serverless a car engine, not a complete car, right? Do you really want to build your own car and take over the burden and risk of failures while driving on the street?

If you need to deploy mission-critical workloads with 24/7 SLAs, you can stop reading and qualify out Amazon MSK (including Amazon MSK Serverless) until AWS adds serious SLAs to this product in the future.

Complete data streaming platform

AWS has a service for everything. You can glue them together. In our cars analogy, it would be many cars or vehicles in your enterprise architecture. Most of us learned the hard way that distributed microservices are no free lunch.

The monolithic data lake (now pitched as lakehouse) from vendors like Databricks and Snowflake) is no better approach. Use the right technology for a problem or data product. Finding the right mix between focus and independence is crucial. Kafka’s role is the central or decentralized real-time data hub to transport events. This includes data integration and processing, and to decouple systems from each other.

A modern data flow requires a simple, reliable and governed way to integrate and process data. Leveraging Kafka’s ecosystem like Kafka Connect and Kafka Streams enables mission-critical end-to-end latency and SLAs in a cost-efficient infrastructure. Development, operations, and monitoring are much harder and more costly if you glue together several services to build a real-time data hub.

However, Kafka is not a silver bullet. Hence, you need to understand when NOT to use Kafka and how it relates to data warehouses, data lakes, and other applications.

After a long introduction to this aspect, long story short: If you use Amazon MSK Serverless, it is the data ingestion component in your enterprise architecture. No fully managed components other than Kafka and no native integrations to other 1st party cloud AWS services like S3 or Redshift, and 3rd party cloud services like Snowflake, Databricks, or MongoDB. You must combine Amazon MSK Serverless with several other AWS services for event processing and storage. Additionally, connectivity needs to be implemented and operated by your project team using Kafka Connect connectors, or another 1st or 3rd party ETL tools, or custom glue code).

Amazon MSK Serverless only supports AWS Identity and Access Management (IAM) authentication, which limits you to Java clients only. There is no way to use the open source clients for other programming languages. Python, C++, .NET, Go, JavaScript, etc. are not supported with Amazon MSK Serverless.

MSK Connect allows deploying Kafka Connect connectors (that are available open source, licensed from other vendors, or self-built) into this platform. Similar to Amazon MSK, this is not a fully-managed product. You deploy, operate, and monitor the connectors by yourself. Look at the fully managed connectors in Confluent Cloud to understand the difference. Also, note that AWS will only support Connect workers. But it will not support the connectors themselves even if running on MSK Connect.

Event-driven architecture with true decoupling between the microservices

An event-driven architecture powered by data streaming is great for single integration infrastructure. However, the story goes far beyond this. Modern enterprise architectures leverage principles like microservices, domain-driven design, and data mesh to build decentralized applications and data products.

A streaming data exchange enables such a decentralized architecture with real-time data sharing. A critical capability for such a strategic enterprise component is long-term data storage. It

  • decouples independent applications
  • handles backpressure for slow consumers (like batch systems or request-response web services)
  • enables replayability of historical events (e.g., for a Python consumer in the machine learning platform from data engineers).

Data Mesh with Apache Kafka

The storage capability of Kafka is a key differentiator against message queues like IBM MQ, Rabbit MQ, or AWS SQS. Retention time is an important feature to set the right storage options per Kafka topic. Confluent makes Kafka even better by providing Tiered Storage to separate storage from compute for a much more cost-efficient and scalable solution with infinite storage capabilities.

Amazon MSK Serverless has a limited retention time of 24 hours. This is good enough for many data ingestion use cases but not for building a real-time data mesh across business units or even across organizations.  Another tough requirement of Amazon MSK Serverless is the limitation of 120 partitions. Not really a limit that allows building a strategic platform around it.

As Amazon MSK Serverless is a new product, expect the limitations to change and improve over time. Check the docs for updates. UPDATE Q1 2023: Amazon MSK Serverless added unlimited retention time and support for more partitions. That’s excellent news for this service. With this update, if retention time is your critical criterion, Amazon MSK Serverless is stronger since 2023. However, check the storage costs and compare different cloud offerings for this.

But anyway, these limitations prove how hard it is to build a fully-managed Kafka offering (like Confluent Cloud) compared to a partially managed Kafka offering (like “the normal” Amazon MSK).

Hybrid AWS and multi-cloud Kafka deployments

The most obvious point: Amazon MSK Serverless is only a reasonable discussion if you run your apps in the public AWS cloud. For anything else, like multi-cloud with Azure or GCP, AWS edge offerings like AWS Outpost or Wavelength, hybrid environments, or edge deployments like a factory or retail store, AWS is no option.

If you need to deploy outside the public AWS cloud, check my comparison of Kafka offerings, including Confluent, IBM, Red Hat, and Cloudera.

I want to emphasize that no product or service is 100% cloud agnostic. For instance, building Confluent Cloud on AWS, Azure, and GCP includes unique challenges under the hood. Confluent Cloud is built on Kubernetes. Hence, the template and many automation mechanisms can be reused across cloud vendors. But storage, compute, pricing, support, and many other characteristics and features differ at each cloud service provider.

Having said this, as you leverage a SaaS like Confluent Cloud with no knowledge or access to the technical infrastructure. You don’t see these issues under the hood. On the developer level, you produce and consume messages with the Kafka API and configure additional features like fully-managed connectors, data governance, or cluster linking. All the operations complexity is handled by the vendor. No matter which cloud you run on.

Coopetition: The winners are AWS and Confluent Cloud

The reason for this post was the evolution of the Amazon MSK product line.  Hence, if you read this a year later, the product features and limitations might look completely different once again. Use blog posts like this to understand how to evaluate different solutions and SaaS offerings. Then do your own accurate research before making a product decision.

Amazon MSK Serverless is a great new AWS service that helps AWS customers with some times of projects. But it has tough limitations for some other projects. Additionally, Amazon MSK (including Amazon MSK Serverless) excludes Kafka support! And it is not a complete data streaming platform. Be careful not to create a mess of glue code between tens of serverless cloud services and applications. Confluent Cloud is the much more sophisticated fully-managed Kafka cloud offering (on AWS and everywhere). I am not saying this because I am a Confluent employee but because almost everybody agrees on this 🙂 And it is not really a surprise as Confluent only focuses on data streaming with 2000 people employees and employs many full-time committers to the Apache Kafka open source project. Amazon has zero, by the way 🙂

By the way: Did you know you can use your AWS credits to consume Confluent Cloud like any other native AWS service? This is because of the strong partnership between Confluent and AWS. Yes, there is coopetition. That’s how the world looks like today…

Confluent Cloud provides a complete cloud-native platform including 99.99% SLA, fully managed connectors and stream processing, and maybe most interesting to readers of this post, integration with AWS services (S3, Redshift, Lambda, Kinesis, etc.) plus AWS security and networking (VPC Peering, Private Link, Transit Gateway, etc.). Confluent and AWS work closely together on hybrid deployments, leveraging AWS edge services like AWS Wavelength for 5G scenarios.

Which Kafka cloud service do you use today? What are your pros and cons? Do you plan a migration – e.g., from Amazon MSK to Confluent Cloud or from open source Kafka to Amazon MSK Serverless? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post When NOT to choose Amazon MSK Serverless for Apache Kafka? appeared first on Kai Waehner.

]]>
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies? https://www.kai-waehner.de/blog/2022/06/27/data-warehouse-vs-data-lake-vs-data-streaming-friends-enemies-frenemies/ Mon, 27 Jun 2022 12:35:22 +0000 https://www.kai-waehner.de/?p=4533 The concepts and architectures of a data warehouse, a data lake, and data streaming are complementary to solving business problems. Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let's explore this dilemma in a blog series. This is part 1: Data Warehouse vs. Data Lake vs. Data Streaming - Friends, Enemies, Frenemies?

The post Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies? appeared first on Kai Waehner.

]]>
The concepts and architectures of a data warehouse, a data lake, and data streaming are complementary to solving business problems. Storing data at rest for reporting and analytics requires different capabilities and SLAs than continuously processing data in motion for real-time workloads. Many open-source frameworks, commercial products, and SaaS cloud services exist. Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let’s explore this dilemma in a blog series. Learn how to build a modern data stack with cloud-native technologies. This is part 1: Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?

Data Warehouse vs Data Lake vs Data Streaming Comparison

Blog Series: Data Warehouse vs. Data Lake vs Data Streaming

This blog series explores concepts, features, and trade-offs of a modern data stack using a data warehouse, data lake, and data streaming together:

  1. THIS POST: Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
  2. Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
  3. Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
  4. Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
  5. Lessons Learned from Building a Cloud-Native Data Warehouse

Stay tuned for a dedicated blog post for each topic as part of this blog series. I will link the blogs here as soon as they are available (in the next few weeks). Subscribe to my newsletter to get an email after each publication (no spam or ads).

I also created a short ten minute video that explains the differences between data streaming and a lakehouse (and why they are complementary):

The value of data: Transactional vs. analytical workloads

The last decade offered many articles, blogs, and presentations about data becoming the new oil. Today, nobody questions that data-driven business processes change the world and enable innovation across industries.

Data-driven business processes require both real-time data processing and batch processing. Think about the following flow of events across applications, domains, and organizations:

A stream of events - the business process for transactions and analytics

An event is business information or technical information. Events happen all the time. A business process in the real world requires the correlation of various events.

How critical is an event?

The criticality of an event defines the outcome. Potential impacts can be increased revenue, reduced risk, reduced cost, or improved customer experience.

  • Business transaction: Ideally, zero downtime and zero data loss. Example: Payments need to be processed exactly once.
  • Critical analytics: Ideally, zero downtime. Data loss of a single sensor event might be okay. Alerting on the aggregation of events is more critical. Example: Continuous monitoring of IoT sensor data and a (predictive) machine failure alert.
  • Non-critical analytics: Downtime and data loss are not good but do not kill the whole business. It is an accident, but not a disaster. Example: Reporting and business intelligence to forecast demand.

When to process an event?

Real-time usually means end-to-end processing within milliseconds or seconds. If you don’t need real-time decisions, batch processing (i.e., after minutes, hours, days) or on-demand (i.e., request-reply) is sufficient.

  • Business transactions are often real-time: A transaction like a payment usually requires real-time processing (e.g., before the customer leaves the store; before you ship the item; before you leave the ride-hailing car).
  • Critical analytics is usually real-time: Critical analytics very often requires real-time processing (e.g., detecting the fraud before it happens; predicting a machine failure before it breaks; upselling to a customer before he leaves the store).
  • Non-critical analytics is usually not real-time: Finding insights in historical data is usually done in a batch process using paradigms like complex SQL queries, map-reduce, or complex algorithms (e.g., reporting; model training with machine learning algorithms; forecasting).

With these basics about processing events, let’s understand why storing all events in a single, central data lake is not the solution to all problems.

Flexibility through decentralization and best-of-breed

The traditional data warehouse respectively data lake approach is to ingest all data from all sources into a central storage system for centralized data ownership. The sky (and your budget) is the limit with current big data and cloud technologies.

However, architectural concepts like domain-driven design, microservices, and data mesh show that decentralized ownership is the right choice for modern enterprise architecture:

Decentralization of the Data Warehouse and Data Lake using Data Streaming

No worries. The data warehouse and the data lake are not dead but more relevant than ever before in a data-driven world. Both make sense for many use cases. Even in one of these domains, larger organizations don’t use a single data warehouse or data lake. Selecting the right tool for the job (in your domain or business unit) is the best way to solve the business problem.

There are good reasons people are pleased with Databricks for batch ETL, machine learning, and now even data warehouse, but still prefer a lightweight cloud SQL database like AWS RDS (fully managed PostgreSQL) for some use cases.

There are good reasons happy Splunk users also ingest some data into Elasticsearch instead. And why Cribl is getting more and more traction in this space, too.

There are good reasons some projects leverage Apache Kafka as the database. Storing data long-term in Kafka makes only sense for some specific use cases (like compacted topics, key/value queries, streaming analytics). Kafka does not replace other databases or data lakes.

Choose the right tool for the job with decentralized data ownership!

With this in mind, let’s explore the use cases and added value of a modern data warehouse (and how it relates to data lakes and the new buzz lakehouse).

Data warehouse: Reporting and business intelligence with data at rest

A data warehouse (DWH) provides capabilities for reporting and data analysis. It is considered a core component of business intelligence.

Use cases for data at rest

No matter if you use a product called a data warehouse, data lake, or lakehouse. The data is stored at rest for further processing:

  • Reporting and Business Intelligence: Fast and flexible availability of reports, statistics, and key figures, for example, to identify correlations between market and service offering
  • Data Engineering: Integration of data from differently structured and distributed data sets to enable the identification of hidden relationships between data
  • Big Data Analytics and AI / Machine Learning: Global view of the source data and thus overarching evaluations to find unknown insights to improve business processes and interrelationships.

Some readers might say: Only the first is a use case for a data warehouse, and the other two are for a data lake or lakehouse! It all depends on the definition.

Data warehouse architecture

DWHs are central repositories of integrated data from disparate sources. They store historical data in one storage system. The data is stored at rest, i.e., saved for later analysis and processing. Business users analyze the data to find insights.

Data Warehouse Architecture

The data is uploaded from operational systems, such as IoT data, ERP, CRM, and many other applications. Data cleansing and data quality assurance are crucial pieces in the DWH pipeline. Extract, Transform, Load (ETL) or Extract, Load, Transform (ELT) are the two main approaches to building a data warehouse system. Data marts help to focus on a single subject or line of business within the data warehouse ecosystem.

The relation of the data warehouse to data lake and lakehouse

The focus of a data warehouse is reporting and business intelligence using structured data. Contrarily, the data lake is a synonym for storing and processing raw big data. A data lake was built with technologies like Hadoop, HDFS, and Hive in the past. Today, the data warehouse and data lake merged into a single solution. A cloud-native DWH supports big data. Similarly, a cloud-native data lake needs business intelligence with traditional tools.

Databricks: The evolution from the data lake to the data warehouse

That’s true for almost all vendors. For example, look at the history of one of the leading big data vendors: Databricks, known for being THE Apache Spark company. The company started as a commercial vendor behind Apache Spark, a big data batch processing platform. The platform was enhanced with (some) real-time workloads using micro-batching. A few milestones later, Databricks is an entirely different company today, focusing on cloud, data analytics, and data warehousing. Databricks’ strategy changed from:

  • open source to the cloud
  • self-managed software to fully-managed serverless offerings
  • focus on Apache Spark to AI / Machine Learning and later added data warehousing capabilities
  • a single product to a vast product portfolio around data analytics, including standardized data formats (“Delta Lake”), governance, ETL tooling (Delta Live Tables), and more,

Vendors like Databricks and AWS also coined a new buzzword for this merge of the data lake, data warehouse, business intelligence, and real-time capabilities: The Lakehouse.

The Lakehouse (sometimes called Data Lakehouse) is nothing new. It combines characteristics of separate platforms. I wrote an article about building a cloud-native serverless lakehouse on AWS using Kafka in conjunction with AWS analytics platforms.

Snowflake: The evolution from the data warehouse to the data lake

Snowflake came from the other direction. It was the first genuinely cloud-native data warehouse available on all major clouds. Today, Snowflake provides many more capabilities beyond the traditional business intelligence spectrum. For instance, data and software engineers have features to interact with Snowflake’s data lake through other technologies and APIs. The data engineer requires a Python interface to analyze historical data, while the software engineer prefers real-time data ingestion and analysis at any scale.

No matter if you build a data warehouse, data lake, or lakehouse: The crucial point is understanding the difference between data in motion and data at rest to find the right enterprise architecture and components for your solution. The following sections explore why a good data warehouse architecture requires both and how they complement each other very well.

Transactional real-time workloads should not run within the data warehouse or data lake! Separation of concerns is critical due to different uptime SLAs, regulatory and compliance laws, and latency requirements.

Data streaming: Supplementing the modern data warehouse with data in motion

Let’s clarify: Data streaming is NOT the same as data ingestion! You can use a data streaming technology like Apache Kafka for data ingestion into a data warehouse or data lake. Most companies do this. Fine and valuable.

BUT: A data streaming platform like Apache Kafka is so much more than just an ingestion layer. Hence, it differs significantly from ingestion engines like AWS Kinesis, Google Pub/Sub, and similar tools.

Data streaming is NOT the same as data ingestion

Data streaming provides messaging, persistence, integration, and processing capabilities. High scalability for millions of messages per second, high availability including backward compatibility and rolling upgrades for mission-critical workloads, and cloud-native features are some of the built-in features.

The de facto standard for data streaming is Apache Kafka. Therefore, I mainly use Kafka for data streaming architectures and use cases.

Transactional and analytics use cases for data streaming with Apache Kafka

The number of different use cases for data streaming is almost endless. Please remember that data streaming is NOT just a message queue for data ingestion! While data ingestion into a data lake was the first prominent use case, this implies <5% of actual Kafka deployments. Business applications, streaming ETL middleware, real-time analytics, and edge/hybrid scenarios are some of the other examples:

Use Cases for Data Streaming with Apache Kafka by Business Value

The persistence layer of Kafka enables decentralized microservice architectures for agile and truly decoupled applications.

Keep in mind that Apache Kafka supports transactional and analytics workloads. Both usually have very different uptime, latency, and data loss SLAs. Check out this post and slide deck to learn more about data streaming use cases across industries powered by Apache Kafka.

The next post of this blog series explores why data streaming with tools like Apache Kafka is the prevalent technology for data ingestion into data warehouses and data lakes.

Don’t (try to) use the data warehouse or data lake for data streaming

This blog post explored the difference between data at rest and data in motion:

  • A data warehouse is excellent for reporting and business intelligence.
  • A data lake is perfect for big data analytics and AI / Machine Learning.
  • Data streaming enables real-time use cases.
  • A decentralized, flexible enterprise architecture is required to build a modern data stack around microservices and data mesh.

None of the technologies is a silver bullet. Choose the right tool for the problem. Monolithic architectures do not solve today’s business problems. Storing all data only at rest does not help with the demand for real-time use cases.

The Kappa architecture is a modern approach for real-time and batch workloads to avoid a much more complex infrastructure using the Lambda architecture. Data streaming complements the data warehouse and data lake. The connectivity between these systems is available out-of-the-box if you choose the right vendors (that are often strategic partners, not competitors, as some people think).

For more details, browse to other posts of this blog series:

  1. THIS POST: Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
  2. Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
  3. Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
  4. Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
  5. Lessons Learned from Building a Cloud-Native Data Warehouse

How do you combine data warehouse and data streaming today? Is Kafka just your ingestion layer into the data lake? Do you already leverage data streaming for additional real-time use cases? Or is Kafka already the strategic component in the enterprise architecture for decoupled microservices and a data mesh? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies? appeared first on Kai Waehner.

]]>
Comparison: JMS Message Queue vs. Apache Kafka https://www.kai-waehner.de/blog/2022/05/12/comparison-jms-api-message-broker-mq-vs-apache-kafka/ Thu, 12 May 2022 05:13:19 +0000 https://www.kai-waehner.de/?p=4430 Comparing JMS-based message queue (MQ) infrastructures and Apache Kafka-based data streaming is a widespread topic. Unfortunately, the battle is an apple-to-orange comparison that often includes misinformation and FUD from vendors. This blog post explores the differences, trade-offs, and architectures of JMS message brokers and Kafka deployments. Learn how to choose between JMS brokers like IBM MQ or RabbitMQ and open-source Kafka or serverless cloud services like Confluent Cloud.

The post Comparison: JMS Message Queue vs. Apache Kafka appeared first on Kai Waehner.

]]>
Comparing JMS-based message queue (MQ) infrastructures and Apache Kafka-based data streaming is a widespread topic. Unfortunately, the battle is an apple-to-orange comparison that often includes misinformation and FUD from vendors. This blog post explores the differences, trade-offs, and architectures of JMS message brokers and Kafka deployments. Learn how to choose between JMS brokers like IBM MQ or RabbitMQ and open-source Kafka or serverless cloud services like Confluent Cloud.

JMS Message Queue vs Apache Kafka Comparison

Motivation: The battle of apples vs. oranges

I have to discuss the differences and trade-offs between JMS message brokers and Apache Kafka every week in customer meetings. What annoys me most is the common misunderstandings and (sometimes) intentional FUD in various blogs, articles, and presentations about this discussion.

I recently discussed this topic with Clement Escoffier from Red Hat in the “Coding over Cocktails” Podcast: JMS vs. Kafka: Technology Smackdown. A great conversation with more agreement than you might expect from such an episode where I picked the “Kafka proponent” while Clement took over the role of the “JMS proponent”.

These aspects motivated me to write a blog series about “JMS, Message Queues, and Apache Kafka”:

I will link the other posts here as soon as they are available. Please follow my newsletter to get updated in real-time about new posts. (no spam or ads)

Special thanks to my colleague and long-term messaging and data streaming expert Heinz Schaffner for technical feedback and review of this blog series. He has worked for TIBCO, Solace, and Confluent for 25 years.

10 comparison criteria: JMS vs. Apache Kafka

This blog post explores ten comparison criteria. The goal is to explain the differences between message queues and data streaming, clarify some misunderstandings about what an API or implementation is, and give some technical background to do your evaluation to find the right tool for the job.

The list of products and cloud services is long for JMS implementations and Kafka offerings. A few examples:

  • JMS implementations of the JMS API (open source and commercial offerings): Apache ActiveMQ, Apache Qpid (using AMQP), IBM MQ (formerly MQSeries, then WebSphere MQ), JBoss HornetQ, Oracle AQ, RabbitMQ, TIBCO EMS, Solace, etc.
  • Apache Kafka products, cloud services, and rewrites (beyond the valid option of using just open-source Kafka): Confluent, Cloudera, Amazon MSK, Red Hat, Redpanda, Azure Event Hubs, etc.

Here are the criteria for comparing JMS message brokers vs. Apache Kafka and its related products/cloud services:

  1. Message broker vs. data streaming platform
  2. API Specification vs. open-source protocol implementation
  3. Transactional vs. analytical workloads
  4. Push vs. pull message consumption
  5. Simple vs. powerful and complex API
  6. Storage for durability vs. true decoupling
  7. Server-side data-processing vs. decoupled continuous stream processing
  8. Complex operations vs. serverless cloud
  9. Java/JVM vs. any programming language
  10. Single deployment vs. multi-region (including hybrid and multi-cloud) replication

Let’s now explore the ten comparison criteria.

1. Message broker vs. data streaming platform

TL;DR: JMS message brokers provide messaging capabilities to produce and consume messages. Apache Kafka is a data streaming platform that combines messaging, storage, data integration, and stream processing capabilities.

The most important aspect first: The comparison of JMS and Apache Kafka is an apple to orange comparison for several reasons. I would even further say that not both can be fruit, as they are so different from each other.

JMS API (and implementations like IBM MQ, RabbitMQ, et al)

JMS (Java Message Service) is a Java application programming interface (API) that provides generic messaging models. The API handles the producer-consumer problem, which can facilitate the sending and receiving of messages between software systems.

Therefore, the central capability of JMS message brokers (that implement the JMS API) is to send messages from a source application to another destination in real-time. That’s it. And if that’s what you need, then JMS is the right choice for you! But keep in mind that projects must use additional tools for data integration and advanced data processing tasks.

Apache Kafka (open source and vendors like Confluent, Cloudera, Red Hat, Amazon, et al)

Apache Kafka is an open-source protocol implementation for data streaming. It includes:

  • Apache Kafka is the core for distributed messaging and storage. High throughput, low latency, high availability, secure.
  • Kafka Connect is an integration framework for connecting external sources/destinations to Kafka.
  • Kafka Streams is a simple Java library that enables streaming application development within the Kafka framework.

This combination of capabilities enables the building of end-to-end data pipelines and applications. That’s much more than what you can do with a message queue.

2. JMS API specification vs. Apache Kafka open-source protocol implementation

TL;DR: JMS is a specification that vendors implement and extend in their opinionated way. Apache Kafka is the open-source implementation of the underlying specified Kafka protocol.

It is crucial to clarify the terms first before you evaluate JMS and Kafka:

  • Standard API: Specified by industry consortiums or other industry-neutral (often global) groups or organizations specify standard APIs. Requires compliance tests for all features and complete certifications to become standard-compliant. Example: OPC-UA.
  • De facto standard API: Originates from an existing successful solution (an open-source framework, a commercial product, or a cloud service). Examples: Amazon S3 (proprietary from a single vendor). Apache Kafka (open source from the vibrant community).
  • API Specification: A specification document to define how vendors can implement a related product. There are no complete compliance tests or complete certifications for the implementation of all features. The consequence is a “standard API” but no portability between implementations. Example: JMS. Specifically for JMS, note that in order to be able to use the compliance suite for JMS, a commercial vendor has to sign up to very onerous reporting requirements towards Oracle.

The alternative kinds of standards have trade-offs. If you want to learn more, check out how Apache Kafka became the de facto standard for data streaming in the last few years.

Portability and migrations became much more relevant in hybrid and multi-cloud environments than in the past decades where you had your workloads in a single data center.

JMS is a specification for message-oriented middleware

JMS is a specification currently maintained under the Java Community Process as JSR 343. The latest (not yet released) version JMS 3.0 is under early development as part of Jakarta EE and rebranded to Jakarta Messaging API. Today, JMS 2.0 is the specification used in prevalent message broker implementations. Nobody knows where JMS 3.0 will go at all. Hence, this post focuses on the JMS 2.0 specification to solve real-world problems today.

I often use the term “JMS message broker” in the following sections as JMS (i.e., the API) does not specify or implement many features you know in your favorite JMS implementation. Usually, when people talk about JMS, they mean JMS message broker implementations, not the JMS API specification.

JMS message brokers and the JMS portability myth

The JMS specification was developed to provide a common Java library to access different messaging vendor’s brokers. It was intended to act as a wrapper to the messaging vendor’s proprietary APIs in the same way JDBC provided similar functionality for database APIs.

Unfortunately, this simple integration turned out not to be the case. The migration of the JMS code from one vendor’s broker to another is quite complex for several reasons:

  • Not all JMS features are mandatory (security, topic/queue labeling, clustering, routing, compression, etc.)
  • There is no JMS specification for transport
  • No specification to define how persistence is implemented
  • No specification to define how fault tolerance or high availability is implemented
  • Different interpretations of the JMS specification by different vendors result in potentially other behaviors for the same JMS functions
  • No specification for security
  • There is no specification for value-added features in the brokers (such as topic to queue bridging, inter-broker routing, access control lists, etc.)

Therefore, simple source code migration and interoperability between JMS vendors is a myth! This sounds crazy, doesn’t it?

Vendors provide a great deal of unique functionality within the broker (such as topic-to-queue mapping, broker routing, etc.) that provide architectural functionality to the application but are part of the broker functionality and not the application or part of the JMS specification.

Apache Kafka is an open-source protocol implementation for data streaming

Apache Kafka is an implementation to do reliable and scalable data streaming in real-time. The project is open-source and available under Apache 2.0 license, and is driven by a vast community.

Apache Kafka is NOT a standard like OPC-UA or a specification like JMS. However, Kafka at least provides the source code reference implementation, protocol and API definitions, etc.

Kafka established itself as the de facto standard for data streaming. Today, over 100,000 organizations use Apache Kafka. The Kafka API became the de facto standard for event-driven architectures and event streaming. Use cases across all industries and infrastructure. Including various kinds of transactional and analytic workloads. Edge, hybrid, multi-cloud. I collected a few examples across verticals that use Apache Kafka to show the prevalence across markets.

Now, hold on. I used the term Kafka API in the above section. Let’s clarify this: As discussed, Apache Kafka is an implementation of a distributed data streaming platform including the server-side and client-side and various APIs for producing and consuming events, configuration, security, operations, etc. The Kafka API is relevant, too, as Kafka rewrites like Azure Event Hubs and Redpanda use it.

Portability of Apache Kafka – yet another myth?

If you use Apache Kafka as an open-source project, this is the complete Kafka implementation. Some vendors use the full Apache Kafka implementation and build a more advanced product around it.

Here, the migration is super straightforward, as Kafka is not just a specification that each vendor implements differently. Instead, it is the same code, libraries, and packages.

For instance, I have seen several successful migrations from Cloudera to Confluent deployments or from self-managed Apache Kafka open-source infrastructure to serverless Confluent Cloud.

The Kafka API – Kafka rewrites like Azure Event Hubs, Redpanda, Apache Pulsar

With the global success of Kafka, some vendors and cloud services did not build a product on top of the Apache Kafka implementation. Instead, they made their implementation on top of the Kafka API. The underlying implementation is proprietary (like in Azure’s cloud service Event Hubs) or open-source (like Apache Pulsar’s Kafka bridge or Redpanda’s rewrite in C++).

Be careful and analyze if vendors integrate the whole Apache Kafka project or rewrote the complete API. Contrary to the battle-tested Apache Kafka project, a Kafka rewrite using the Kafka API is a completely new implementation!

Many vendors even exclude some components or APIs (like Kafka Connect for data integration or Kafka Streams for stream processing) completely or exclude critical features like exactly-once semantics or long-term storage in their support terms and conditions.

It is up to you to evaluate the different Kafka offerings and their limitations. Recently, I compared Kafka vendors such as Confluent, Cloudera, Red Hat, or Amazon MSK and related technologies like Azure Event Hubs, AWS Kinesis, Redpanda, or Apache Pulsar.

Just battle-test the requirements by yourself. If you find a Kafka-to-XYZ bridge with less than a hundred lines of code, or if you find a .exe Windows Kafka server download from a middleware vendor. Be skeptical! 🙂

All that glitters is not gold. Some frameworks or vendors sound too good to be true. Just saying you support the Kafka API, you provide a fully managed serverless Kafka offering, or you scale much better is not trustworthy if you are constantly forced to provide fear, uncertainty, and doubt (FUD) on Kafka and that you are much better. For instance, I was annoyed by Pulsar always trying to be better than Kafka by creating a lot of FUDs and myths in the open-source community. I responded in my Apache Pulsar vs. Kafka comparison two years ago. FUD is the wrong strategy for any vendor. It does not work. For that reason, Kafka’s adoption still grows like crazy while Pulsar grows much slower percentage-wise (even though the download numbers are on a much lower level anyway).

3. Transactional vs. analytical workloads

TL;DR: A JMS message broker provides transactional capabilities for low volumes of messages. Apache Kafka supports low and high volumes of messages supporting transactional and analytical workloads.

JMS – Session and two-phase commit (XA) transactions

Most JMS message brokers have good support for transactional workloads.

A transacted session supports a single series of transactions. Each transaction groups a set of produced messages and a set of consumed messages into an atomic unit of work.

Two-phase commit transactions (XA transactions) work on a limited scale. They are used to integrate with other systems like Mainframe CICS / DB2 or Oracle database. But it is hard to operate and not possible to scale beyond a few transactions per second.

It is important to note that support for XA transactions is not mandatory with the JMS 2.0 specification. This differs from the session transaction.

Kafka – Exactly-once semantics and transaction API

Kafka is a distributed, fault-tolerant system that is resilient by nature (if you deploy and operate it correctly). No downtime and no data loss can be guaranteed, like in your favorite database, mainframe, or other core platforms.

And even better: Kafka’s Transaction API, i.e., Exactly-Once Semantics (EOS), has been available since Kafka 0.11 (GA’ed many years ago). EOS makes building transactional workloads even easier as you don’t need to handle duplicates anymore.

Kafka supports atomic writes across multiple partitions through the transactions API. This allows a producer to send a batch of messages to multiple partitions. Either all messages in the batch are eventually visible to any consumer, or none are ever visible to consumers.

Kafka transactions work very differently than JMS transactions. But the goal is the same: Each consumer receives the produced event exactly once. Find more details in the blog post “Analytics vs. Transactions in Data Streaming with Apache Kafka“.

4. Push vs. pull message consumption

TL;DR: JMS message brokers push messages to consumer applications. Kafka consumers pull messages providing true decoupling and backpressure handling for independent consumer applications.

Pushing messages seems to be the obvious choice for a real-time messaging system like JMS-based message brokers. However, push-based messaging has various drawbacks regarding decoupling and scalability.

JMS expects the broker to provide back pressure and implement a “pre-fetch” capability, but this is not mandatory. If used, the broker controls the backpressure, which you cannot control.

With Kafka, the consumer controls the backpressure. Each Kafka consumer consumes events in real-time, batch, or only on demand – in the way the particular consumer supports and can handle the data stream. This is an enormous advantage for many inflexible and non-elastic environments.

So while JMS has some kind of backpressure, the producer stops if the queue is full. In Kafka, you control the backpressure on the consumer. There is no way to scale a producer with JMS (as there are no partitions in a JMS queue or topic).

JMS consumers can be scaled, but then you lose guaranteed ordering. Guaranteed ordering in JMS message brokers only works via a single producer, single consumer, and transaction.

5. Simple JMS API vs. powerful and complex Kafka API

TL;DR: The JMS API provides simple operations to produce and consume messages. Apache Kafka has a more granular API that brings additional power and complexity.

JMS vendors hide all the cool stuff in the implementation under the spec. You only get the 5% (no control, built by the vendor). You need to make the rest by yourself. On the other side, Kafka exposes everything. Most developers only need 5%.

In summary, be aware that JMS message brokers are built to send messages from a data source to one or more data sinks. Kafka is a data streaming platform that provides many more capabilities, features, event patterns, and processing options; and a much larger scale. With that in mind, it is no surprise that the APIs are very different and have different complexity.

If your use case requires just sending a few messages per second from A to B, the JMS is the right choice and simple to use! If you need a streaming data hub at any scale, including data integration and data processing, that’s only Kafka.

Asynchronous request-reply vs. data in motion

One of the most common wishes of JMS developers is to use are request-response function in Kafka. Note that this design pattern is different in messaging systems from an RPC (remote procedure call) as you know it from legacy tools like Corba or web service standards like SOAP/WSDL or HTTP. Request-reply in messaging brokers is an asynchronous communication that leverages a correlation ID.

Asynchronous messaging to get events from a producer (say a mobile app) to a consumer (say a database) is a very traditional workflow. No matter if you do fire-and-forget or request-reply. You put data at rest for further processing. JMS supports request-reply out-of-the-box. The API  is very simple.

Data in motion with event streaming continuously processes data. The Kafka log is durable. The Kafka application maintains and queries the state in real-time or in batch. Data streaming is a paradigm shift for most developers and architects. The design patterns are very different. Don’t try to reimplement your JMS application within Kafka using the same pattern and API. That is likely to fail! That is an anti-pattern.

Request-reply is inefficient and can suffer a lot of latency depending on the use case. HTTP or better gRPC is suitable for some use cases. Request-reply is replaced by the CQRS (Command and Query Responsibility Segregation) pattern with Kafka for streaming data. CQRS is not possible with JMS API, since JMS provides no state capabilities and lacks event sourcing capability.

A Kafka example for the request-response pattern

CQRS is the better design pattern for many Kafka use cases. Nevertheless, the request-reply pattern can be implemented with Kafka, too. But differently. Trying to do it like in a JMS message broker (with temporary queues etc.) will ultimately kill the Kafka cluster (because it works differently).
The Spring project shows how you can do better. The Kafka Spring Boot Kafka Template libraries have a great example of the request-reply pattern built with Kafka.
Check out “org.springframework.kafka.requestreply.ReplyingKafkaTemplate“. It creates request/reply applications using the Kafka API easily. The example is interesting since it implements the asynchronous request/reply, which is more complicated to write if you are using, for example, JMS API). Another nice DZone article talks about synchronous request/reply using Spring Kafka templates.
The Spring documentation for Kafka Templates has a lot of details about the Request/Reply pattern for Kafka. So if you are using Spring, the request/reply pattern is pretty simple to implement with Kafka. If you are not using Spring, you can learn how to do request-reply with Kafka in your framework.

6. Storage for durability vs. true decoupling

TL;DR: JMS message brokers use a storage system to provide high availability. The storage system of Kafka is much more advanced to enable long-term storage, back-pressure handling and replayability of historical events.

Kafka storage is more than just the persistence feature you know from JMS

When I explain the Kafka storage system to experienced JMS developers, I almost always get the same response: “Our JMS message broker XYZ also has storage under the hood. I don’t see the benefit of using Kafka!”

JMS uses an ephemeral storage system, where messages are only persisted until they are processed. Long-term storage and replayability of messages are not a concept JMS was designed for.

The core Kafka principles of append-only logs, offsets, guaranteed ordering, retention time, compacted topics, and so on provide many additional benefits beyond the durability guarantees of a JMS. Backpressure handling, true decoupling between consumers, the replayability of historical events, and more are huge differentiators between JMS and Kafka.

Check the Kafka docs for a deep dive into the Kafka storage system. I don’t want to touch on how Tiered Storage for Kafka is changing the game even more by providing even better scalability and cost-efficient long-term storage within the Kafka log.

7. Server-side data-processing with JMS vs. decoupled continuous stream processing with Kafka

TL;DR: JMS message brokers provide simple server-side event processing, like filtering or routing based on the message content. Kafka brokers are dumb. Its data processing is executed in decoupled applications/microservices. 

Server-side JMS filtering and routing

Most JMS message brokers provide some features for server-side event processing. These features are handy for some workloads!

Just be careful that server-side processing usually comes with a cost. For instance:

  • JMS Pre-filtering scalability issues: The broker has to handle so many things. This can kill the broker in a hidden fashion
  • JMS Selectors (= routing) performance issues: It kills 40-50% of performance

Again, sometimes, the drawbacks are acceptable. Then this is a great functionality.

Kafka – Dumb pipes and smart endpoints

Kafka intentionally does not provide server-side processing. The brokers are dumb. The processing happens at the smart endpoints. This is a very well-known design pattern: Dumb pipes and smart endpoints.

The drawback is that you need separate applications/microservices/data products to implement the logic. This is not a big issue in serverless environments (like using a ksqlDB process running in Confluent Cloud for data processing). It gets more complex in self-managed environments.

However, the massive benefit of this architecture is the true decoupling between applications/technologies/programming languages, separation of concerns between business units for building business logic and operations of infrastructure, and the much better scalability and elasticity.

Would I like to see a few server-side processing capabilities in Kafka, too? Yes, absolutely. Especially for small workloads, the performance and scalability impact should be acceptable! Though, the risk is that people misuse the features then. The future will show if Kafka will get there or not.

8. Complex operations vs. serverless cloud

TL;DR: Self-managed operations of scalable JMS message brokers or Kafka clusters are complex. Serverless offerings (should) take over the operations burden.

Operating a cluster is complex – no matter if JMS or Kafka

A basic JMS message broker is relatively easy to operate (including active/passive setups). However, this limits scalability and availability. The JMS API was designed to talk to a single broker or active/passive for high availability. This concept covers the application domain.

More than that (= clustering) is very complex with JMS message brokers.  More advanced message broker clusters from commercial vendors are more powerful but much harder to operate.

Kafka is a powerful, distributed system. Therefore, operating a Kafka cluster is not easy by nature. Cloud-native tools like an operator for Kubernetes take over some burdens like rolling upgrades or handling fail-over.

Both JMS message brokers and Kafka clusters are the more challenging, the more scale and reliability your SLAs demand. The JMS API is not specified for a central data hub (using a cluster). Kafka is intentionally built for the strategic enterprise architecture, not just for a single business application.

Fully managed serverless cloud for the rescue

As the JMS API was designed to talk to a single broker, it is hard to build a serverless cloud offering that provides scalability. Hence, in JMS cloud services, the consumer has to set up the routing and role-based access control to the specific brokers. Such a cloud offering is not serverless but cloud-washing! But there is no other option as the JMS API is not like Kafka with one big distributed cluster.

In Kafka, the situation is different. As Kafka is a scalable distributed system, cloud providers can build cloud-native serverless offerings. Building such a fully managed infrastructure is still super hard. Hence, evaluate the product, not just the marketing slogans!

Every Kafka cloud service is marketed as “fully managed” or “serverless” but most are NOT. Instead, most vendors just provision the infrastructure and let you operate the cluster and take over the support risk. On the other side, some fully managed Kafka offerings are super limited in functionality (like allowing a very limited number of partitions).

Some cloud vendors even exclude Kafka support from their Kafka cloud offerings. Insane, but true. Check the terms and conditions as part of your evaluation.

9. Java/JVM vs. any programming language

TL;DR: JMS focuses on the Java ecosystem for JVM programming languages. Kafka is independent of programming languages.

As the name JMS (=Java Message Service) says: JMS was written only for Java officially. Some broker vendors support their own APIs and clients. These are proprietary to that vendor. Almost all severe JMS projects I have seen in the past use Java code.

Apache Kafka also only provides a Java client. But vendors and the community provide other language bindings for almost every programming language, plus a REST API for HTTP communication for producing/consuming events to/from Kafka. For instance, check out the blog post “12 Programming Languages Walk into a Kafka Cluster” to see code examples in Java, Python, Go, .NET, Ruby, node.js, Groovy, etc.

The true decoupling of the Kafka backend enables very different client applications to speak with each other, no matter what programming languages one uses. This flexibility allows for building a proper domain-driven design (DDD) with a microservices architecture leveraging Kafka as the central nervous system.

10. Single JMS deployment vs. multi-region (including hybrid and multi-cloud) Kafka replication

TL;DR: The JMS API is a client specification for communication between the application and the broker. Kafka is a distributed system that enables various architectures for hybrid and multi-cloud use cases.

JMS is a client specification, while multi-data center replication is a broker function. I won’t go deep here and put it simply: JMS message brokers are not built for replication scenarios across regions, continents, or hybrid/multi-cloud environments.

Multi-cluster and cross-data center deployments of Apache Kafka have become the norm rather than an exception. Various scenarios require multi-cluster Kafka solutions. Specific requirements and trade-offs need to be looked at.

Kafka technologies like MirrorMaker (open source) or Confluent Cluster Linking (commercial) enable use cases such as disaster recovery, aggregation for analytics, cloud migration, mission-critical stretched deployments and global Kafka deployments.

I covered hybrid cloud architectures in various other blog posts. “Low Latency Data Streaming with Apache Kafka and Cloud-Native 5G Infrastructure” is a great example.

Slide deck and video recording

I created a slide deck and video recording if you prefer learning or sharing that kind of material instead of a blog post:

Fullscreen Mode

JMS and Kafka solve distinct problems!

The ten comparison criteria show that JMS and Kafka are very different things. While both overlap (e.g., messaging, real-time, mission-critical), they use different technical capabilities, features, and architectures to support additional use cases.

In short, use a JMS broker for simple and low-volume messaging from A to B. Kafka is usually a real-time data hub between many data sources and data sinks. Many people call it the central real-time nervous system of the enterprise architecture.

The data integration and data processing capabilities of Kafka at any scale with true decoupling and event replayability are the major differences from JMS-based MQ systems.

However, especially in the serverless cloud, don’t fear Kafka being too powerful (and complex). Serverless Kafka projects often start very cheaply at a very low volume, with no operations burden. Then it can scale with your growing business without the need to re-architect the application.

Understand the technical differences between a JMS-based message broker and data streaming powered by Apache Kafka. Evaluate both options to find the right tool for the problem. Within messaging or data streaming, do further detailed evaluations. Every message broker is different even though they all are JMS compliant. In the same way, all Kafka products and cloud services are different regarding features, support, and cost.

Do you use JMS-compliant message brokers? What are the use cases and limitations? When did you or do you plan to use Apache Kafka instead? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Comparison: JMS Message Queue vs. Apache Kafka appeared first on Kai Waehner.

]]>
Is Apache Kafka an iPaaS or is Event Streaming its own Software Category? https://www.kai-waehner.de/blog/2021/11/03/apache-kafka-cloud-native-ipaas-versus-mq-etl-esb-middleware/ Wed, 03 Nov 2021 06:57:54 +0000 https://www.kai-waehner.de/?p=3917 This post explores why Apache Kafka is the new black for integration projects, how Kafka fits into the discussion around cloud-native iPaaS solutions, and why event streaming is a new software category. A concrete real-world example shows the difference between event streaming and traditional integration platforms respectively iPaaS.

The post Is Apache Kafka an iPaaS or is Event Streaming its own Software Category? appeared first on Kai Waehner.

]]>
Enterprise integration is more challenging than ever before. The IT evolution requires the integration of more and more technologies. Applications are deployed across the edge, hybrid, and multi-cloud architectures. Traditional middleware such as MQ, ETL, ESB does not scale well enough or only processes data in batch instead of real-time. This post explores why Apache Kafka is the new black for integration projects, how Kafka fits into the discussion around cloud-native iPaaS solutions, and why event streaming is a new software category. A concrete real-world example shows the difference between event streaming and traditional integration platforms respectively iPaaS.

Apache Kafka as iPaaS or Event Streaming as new Software Category

What is iPaaS (Enterprise Integration Platform as a Service)?

iPaaS (Enterprise Integration Platform as a Service) is a term coined by Gartner. Here is the official Gartner definition: “Integration Platform as a Service (iPaaS) is a suite of cloud services enabling development, execution, and governance of integration flows connecting any combination of on-premises and cloud-based processes, services, applications, and data within individual or across multiple organizations.” The acronym eiPaaS (Enterprise Integration Platform as a Service)” is used in some reports as a replacement for iPaaS.

The Gartner Magic Quadrant for iPaaS shows various vendors:

Gartner Magic Quadrant for iPaaS 2021

Three points stand out for me:

  • Many very different vendors provide a broad spectrum of integration solutions.
  • Many vendors (have to) list various products to provide an iPaaS; this means different technologies, codebases, support teams, etc.
  • No Kafka offering (like Confluent, Cloudera, Amazon MSK) is in the magic quadrant.

The last bullet point makes me wonder if Kafka-based solutions should be considered iPaaS or not?

Is Apache Kafka an iPaaS?

I don’t know. It depends on the definition of the term “iPaaS”. Yes, Kafka solutions fit into the iPaaS, but it is just a piece of the event streaming success story.

Kafka is an event streaming platform. Use cases differ from traditional middleware like MQ, ETL, ESB, or iPaaS. Check out real-world Kafka deployments across industries if you don’t know the use cases yet.

Kafka does not directly compete with ETL tools like Talend or Informatica, MQ frameworks like IBM MQ or RabbitMQ, API Management platforms like MuleSoft, and cloud-based iPaaS like Boomi or TIBCO. At least not if people understand the differences between Kafka and traditional middleware. For that reason, many people (including me) think that Event Streaming should be its Magic Quadrant.

Having said this, all these very different vendors are in the iPaaS Magic Quadrant. So, should Kafka respectively its vendors be in here? I think so because I have seen hundreds of customers leverage the Kafka ecosystem as a cloud-native, scalable, event-driven integration platform, often in hybrid and multi-cloud architectures. And that’s an iPaaS.

What’s different with Kafka as an Integration Platform?

If you are new to this discussion, check out the article “Apache Kafka vs. MQ, ETL, ESB” or the related slides and video. Here is my summary on why Kafka is unique for integration scenarios and therefore adopted everywhere:

Why Kafka as iPaaS instead of Traditional Middleware like MQ ETL ESB

A unique selling point for event streaming is the ability to leverage a single platform. In contrast, other iPaaS solutions require different products (including codebases, support teams, integration between the additional tech, etc.).

Kafka as Cloud-native and Serverless iPaaS

Fundamental differences exist between modern iPaaS solutions and traditional legacy middleware; this includes the software architecture, the scalability and operations of the platform, and data processing capabilities. On a high level, an “Kafka iPaaS” requires the following characteristics:

  • Cloud-native Infrastructure: Elasticity is vital for success in the modern IT world. The ability to scale up and down (= expanding and shrinking Kafka clusters) is mandatory. This capability enables starting with a small pilot project and scaling up or handling load spikes (like Christmas business in retail).
  • Automated Operations: Truly serverless SaaS should always be the preferred option if the software runs in a public cloud. Orchestration tools like Kubernetes and related tools (like a Kafka operator for Kubernetes) are the next best option in a self-managed data center or at the edge outside a data center.
  • Complete Platform: An integration platform requires real-time messaging and storage for backpressure handling and long-running processes. Data integration and continuous data processing are mandatory, too. Hence, an “Kafka iPaaS” is only possible if you have access to various pre-built Kafka-native connectors to open standards, legacy systems, and modern SaaS interfaces. Otherwise, Kafka needs to be combined with another middleware like an iPaaS or an ETL tool like Apache Nifi.
  • Single Solution: It sounds trivial, but most other middleware solutions use several codebases and products under the hood. Just look at stacks from traditional players such as IBM and Oracle or open-source-driven Cloudera. The complex software stack makes it much harder to provide end-to-end integration, 24/7 operations, and mission-critical support. Don’t get me wrong: Kafka-native solutions like Confluent Cloud also include different products with additional pricing (like a fully-managed connector or data governance add-ons), but all run on a single Kafka-native platform.

From this perspective, some Kafka solutions are modern, cloud-native, scalable, iPaaS. Having said this, would I be happy if you consider some Kafka solutions as an iPaaS on your technology radar? No, not really!

Event Streaming is its Software Category!

While some Kafka solutions can be used as iPaaS, this is only one of many usage scenarios for event streaming. However, as explained above, Kafka-based solutions differ greatly from other iPaaS solutions in the Gartner Magic Quadrant. Hence, event streaming deserves its software category.

If you still wonder what I mean, check out event streaming use cases across industries to understand the difference between Kafka and traditional iPaaS, MQ, ETL, ESB, API tools. Here is a relatively old but still fantastic diagram that summarizes the broad spectrum of use cases for event streaming:

Use Cases for Apache Kafka and Event Streaming

TL;DR: Kafka provides capabilities for various use cases, not just integration scenarios. Many new innovative business applications are built with Kafka. It is not just an integration platform but a unique suite of capabilities for end-to-end data processing in real-time at scale.

New concepts like Data Mesh also prove the evolution. The basic principles are not unique: Domain-driven design, microservices, true decoupling of services, but now with much more focus on data as a product. The latter means it is turning from a cost center into a profit center and innovative new services. Event streaming is a critical component of a data mesh-powered enterprise architecture as real-time almost always beats slow data across use cases.

The Non-Existing Event Streaming Gartner Magic Quadrant or Forrester Wave

Unfortunately, a Gartner Magic Quadrant or Forrester Wave for Event Streaming does NOT exist today. While some event streaming solutions fit into some of these reports (like the Gartner iPaaS Magic Quadrant or the Forrester Wave for Streaming Analytics), it is still an apple to orange comparison.

Event Streaming is its category. Many software vendors built their entire business around this category. Confluent is the leader in this space – note that I am biased as a Confluent employee, but I guess there is no question around this statement 🙂 Many other companies emerge around Kafka, or in a related way using the Kafka protocol, or competitive event streaming offerings such as Amazon Kinesis or Apache Pulsar.

The following Event Streaming Landscape 2021 summarizes the current status:

Event Streaming Landscape with Apache Kafka Pulsar Kinesis Event Hubs

I hope to see a Gartner Magic Quadrant for Event Streaming and a Forrester Wave for Event Streaming soon, too.

Open Source vs. Partially Managed vs. Fully-Managed Event Streaming

One more aspect to point out: You might have noticed that I said, “some event streaming solutions can be considered an iPaaS”. The word “some” is a crucial detail. Just providing an open-source framework is not enough.

iPaaS requires a complete offering, ideally as fully-managed services. Many vendors for event streaming use Kafka, Pulsar, or another framework but do NOT provide a complete offering with operations tooling, commercial 24/7 support, user interfaces, etc. The following resources should help you learn more about the event streaming landscape in 2021:

TL;DR: Evaluate the various offerings. A lot of capabilities are just marketing! Many “fully-managed services” are only partially managed instead of serverless and with very limited SLAs and support. Some other offerings provide plenty of cool features but are more an alpha version and overselling than a mature battle-service solution. A counterexample is the complexity in T-Mobile’s report about upgrading Amazon MSK. This shows the difference between “promoting and selling a fully managed service” and the “not at all fully-managed reality”. A truly fully-managed offering does NOT require the end user to upgrade the infrastructure.

Kafka as Event Streaming iPaaS at Deutsche Bahn (German Railway)

Let’s now look at a practicable example to understand why a traditional iPaaS cannot help in use cases that require event streaming and why this combination of capabilities in a single technology sets a new software category.

This section explores a real-world use case with the journey of Deutsche Bahn (German Railway) providing a great customer experience to their customers. This example uses Event Streaming as iPaaS (regarding my definition of these terms).

Use Case: Improve Customer Experience with Real-time Notifications

The use case sounds very simple: Improve the customer experience by providing real-time information and notifications to customers across various channels like websites, smartphones, and displays at train stations.

Delays and cancellations happen in a complex rail network like in Germany. Frequent travelers accept this downside. Nobody can do anything against lousy weather, self-murder using a traveling train, and technical defects.

However, customers at least expect real-time information and notifications so that they can wait in a coffee shop or lounge instead of freezing at the station platform for minutes or even hours. The reality at Deutsche Bahn was a different sequence of notifications: 10min delay, 20min delay, 30min delay, train canceled – take the next train.

The goal of the project Reisendeninformation (= traveler information system) was to improve the day-to-day experience of millions of travelers across Germany by delivering up-to-date, accurate, and consistent travel information in any location.

Initial Project: A mess of different Integration Technologies

Again, the use case sounds simple. Let’s send real-time notifications to customers if a train is delayed or canceled. Every magic black iPaaS box can do this:

Deutsche Bahn Reisendeninformation Kafka

Is this true or just the marketing of all the integration vendors? Can each black box integrate all the different systems to correlate events in real-time?

Deutsche Bahn started with an open-source messaging platform to send real-time notifications. Unfortunately, the team quickly found out that not every piece of information was coming in in real-time. So, a caching and storage system was added to the project to handle the late-arriving information from some legacy batch or file-based systems. Now, the data from the legacy systems needed to be integrated. Hence, an integration framework was installed to connect to files, databases, and other applications. Now, the data needed to be processed, correlating real-time and non-real-time data from various systems. A stream processing engine can do this.

The pilot project included several different frameworks. A few skeptical questions came up:

  • How to scale this combination of very different technologies?
  • How to get end-to-end support across various platforms?
  • Is this cost-efficient?
  • What is the time-to-market for new features?

Deutsche Bahn re-evaluated their tech stack and found out that Apache Kafka provides all the required capabilities out-of-the-box within one platform.

The Migration to Cloud-native Kafka

The team at Deutsche Bahn re-architected their pilot project. Here is the new solution leveraging Kafka as the single point of truth between various systems, technologies, and communication paradigms:

A traditional iPaaS can implement this scenario. But with several codebases, technologies, and clusters, even if you select one software vendor! Some iPaaS might even do well in the beginning but struggle to scale up. Only event streaming allows to start small but scales up with no need to re-architect the infrastructure.

Today, the project is in production. Check your DB Navigator mobile app to get real-time updates about all trains in Germany. Living in Germany, I appreciate this new service to have a much better traveling experience.

Learn more about the project from the Deutsche Bahn team. They gave several public talks at different conferences and wrote on the Confluent Blog about their Kafka journey. Though, the journey did not end here 🙂 As described in their blog post, Deutsche Bahn is now evaluating the migration from a self-managed Confluent Platform deployment in the public cloud to the fully-managed, truly serverless Confluent Cloud offering to reduce TCO and improve time-to-market.

Complete Platform: Kafka is more than just Messaging

A project like the one described above is only possible with a complete platform. Many people still think about Kafka as an ingestion layer into a data lake or data warehouse, as this was one of the first prominent Kafka use cases. Data ingestion is still an excellent use case today. Many projects already use more than just the core of Kafka to implement this. Kafka Connect provides out-of-the-box connectivity between Kafka and the data store. If you are in the public cloud, you even get integrated in a fully-managed, serverless manner whether you need to integrate with a 1st party cloud service like Amazon S3, Google Cloud BigQuery, Azure Cosmos DB, or other 3rd SaaS like MongoDB Atlas, Snowflake, or Databricks.

Continous Kafka-native stream processing is the next level of a complete platform. For instance, Deutsche Bahn leverages Kafka Streams a lot for their data correlations in real-time at scale. Other companies use ksqlDB as a fully-managed function in Confluent Cloud. The enormous benefit is that you don’t need yet another platform or service for streaming analytics. A complete platform makes the architecture more cost-efficient, and end-to-end integration is easier from SLA and support perspective.

A complete platform requires many additional services “on top”, like visualization of data flows, data governance, role-based access control, audit logs, self-service capabilities, user interfaces, visual coding/code generation tools, etc. Visual coding is the point where traditional middleware and iPaaS tools are stronger today than event streaming offerings.

3rd Party integration via Open API and non-Kafka Tools

So far, you learned why event streaming is its software category and how Deutsche Bahn is an excellent example to show this. However, event streaming is NOT the silver bullet for every problem! When exploring if Kafka and MQ/ETL/ESB are friends, enemies, or frenemies, I already pointed this out. For instance, MQ or an ESB can complement event streaming in an integration project, depending on your project requirements.

Let’s go back to Deutsche Bahn. As mentioned, their real-time traveler information platform is live, with Kafka as the single point of truth. Recently, Deutsche Bahn announced a partnership with Google and 3rd Party Integration with Google Maps:

Deutsche Bahn Google Maps Open API Integration

Real-time Schedule Updates to 3rd Party Google Maps API

The integration provides real-time train schedule updates to Google Maps users:

Google Maps Deutsche Bahn Real-time Integration

The integration allows to reach new people and expand the business. Users can buy train tickets via one click from the Google Maps page.

I don’t know what technology or product this 3rd party integration uses. The heart of Deutsche Bahn’s real-time infrastructure enables new innovative business models and collaboration with partners.

Likely, this integration between Deutsche Bahn and Google Maps does not directly use the Kafka protocol (even though this is done sometimes, for instance, see Here Technologies Open API for their mapping service).

Event Streaming is complementary to other services. In this example, the project team might have used an API Management platform to provide internal APIs to external consumers, including access control, billing, and reporting. The article “Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies?” explores the relationship between event streaming and API Management.

Event Streaming Everywhere – Cloud, Data Center, Edge

Real-time beats slow data everywhere. This is a new software category because we don’t just send events into another database via a messaging system. Instead, we use and correlate data from different data source in real-time. That’ the real added value and game changer in innovative projects.

Hence, event streaming must be possible in every location. While cloud-first is a viable strategy for many IT projects, edge and hybrid scenarios are and will continue to be very relevant.

Think about a project related to the Deutsche Bahn example above (but being completely fictive): A hybrid architecture with real-time applications the cloud and edge computing within the trains:

Hybrid Architecture with Kafka at the edge and in the cloud

I covered this in other articles, including “Edge Use Cases for Apache Kafka Across Industries“. TL;DR: Leverage the open architecture of event streaming for real-time data processing everywhere, including multi-cloud, data centers, and edge deployments (i.e., outside a data center). The enterprise architecture does not need various technologies and products to implement real-time data processing and integration with separate iPaaS, ETL tools, ESBs, MQ systems.

However, once again, it is crucial to understand how event streaming fits into the enterprise architecture. For instance, Kafka is often combined with IoT technologies such as MQTT for the last mile integration with IoT devices in these edge scenarios.

Slide Deck and Video for “Apache Kafka vs. Cloud-native iPaaS Middleware”

Here are the related slides for this topic:

And the video recording of this presentation:

Kafka is a cloud-native iPaaS, and much more!

Kafka is the new black for integration projects across industries because of its unique combination of capabilities. Some Kafka solutions are part of the iPaaS category, with trade-offs like any other integration platform.

However, event streaming is its software category. Hence, iPaaS is just one usage of Kafka or other similar event streaming platforms. Real-time data beats slow data. For that reason, event streaming is the backbone for many projects to process data in motion (but also integrate with other systems that store data at rest for reporting, model training, and other use cases).

How do you leverage event streaming and Kafka as an integration platform? What projects did you already work on or are in the planning? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Is Apache Kafka an iPaaS or is Event Streaming its own Software Category? appeared first on Kai Waehner.

]]>
Serverless Kafka in a Cloud-native Data Lake Architecture https://www.kai-waehner.de/blog/2021/06/25/serverless-kafka-data-lake-cloud-native-hybrid-lake-house-architecture/ Fri, 25 Jun 2021 09:10:57 +0000 https://www.kai-waehner.de/?p=3481 Apache Kafka became the de facto standard for processing data in motion. Kafka is open, flexible, and scalable. Unfortunately, the latter makes operations a challenge for many teams. Ideally, teams can use a serverless Kafka SaaS offering to focus on business logic. However, hybrid scenarios require a cloud-native platform that provides automated and elastic tooling to reduce the operations burden. This blog post explores how to leverage cloud-native and serverless Kafka offerings in a hybrid cloud architecture. We start from the perspective of data at rest with a data lake and explore its relation to data in motion with Kafka.

The post Serverless Kafka in a Cloud-native Data Lake Architecture appeared first on Kai Waehner.

]]>
Apache Kafka became the de facto standard for processing data in motion. Kafka is open, flexible, and scalable. Unfortunately, the latter makes operations a challenge for many teams. Ideally, teams can use a serverless Kafka SaaS offering to focus on business logic. However, hybrid scenarios require a cloud-native platform that provides automated and elastic tooling to reduce the operations burden. This blog post explores how to leverage cloud-native and serverless Kafka offerings in a hybrid cloud architecture. We start from the perspective of data at rest with a data lake and explore its relation to data in motion with Kafka.

Serverless Kafka for Data in Motion as Rescue for Data at Rest in the Data Lake

Data at Rest – Still the Right Approach?

Data at Rest means to store data in a database, data warehouse, or data lake. This means that the data is processed too late in many use cases – even if a real-time streaming component (like Kafka) ingests the data. The data processing is still a web service call, SQL query, or map-reduce batch process away from providing a result to your problem.

Don’t get me wrong. Data at rest is not a bad thing. Several use cases such as reporting (business intelligence), analytics (batch processing), and model training (machine learning) require this approach… If you do it right!

The Wrong Approach of Cloudera’s Data Lake

A data lake was introduced into most enterprises by Cloudera and Hortonworks (plus several partners like IBM) years ago. Most companies had a big data vision (but they had no idea how to gain business value out of it). The data lake consists of 20+ different open source frameworks.

New frameworks are added when they emerge so that the data lake is up-to-date. The only problem: Still no business value. Plus vendors that have no good business model. Just selling support does not work, especially if two very similar vendors compete with each other. The consequence was the Cloudera/Hortonworks merger. And a few years later, the move to private equity.

In 2021, Cloudera still supports so many different frameworks, including many data lake technologies but also event streaming platforms such as Storm, Kafka, Spark Streaming, Flink. I am surprised how one relatively small company can do this. Well, honestly, I am not amazed. TL;DR: They can’t! They know each framework a little bit (and only the dying Hadoop ecosystem very well). This business model does not work. Also, in 2021, Cloudera still does not have a real SaaS product. This is no surprise either. It is not easy to build a true SaaS offering around 20+ frameworks.

Hence, my opinion is confirmed: Better do one thing right if you are a relatively small company instead of trying to do all the things. 

The Lake House Strategy from AWS

That’s why data lakes are built today with other vendors: The major cloud providers (AWS, GCP, Azure, Alibaba), MongoDB, Databricks, Snowflake. All of them have their specific use cases and trade-offs. However, all of them have in common that they have a cloud-first strategy and a serverless SaaS offering for their data lake.

Let’s take a deeper look at AWS to understanding how a modern, cloud-native strategy with a good business model looks like in 2021.

AWS is the market leader for public cloud infrastructure. Additionally, AWS invents new infrastructure categories regularly. For example, EC2 instances started the cloud era and enabled agile and elastic compute power. S3 became the de facto standard for object storage. Today, AWS has hundreds of innovative SaaS services.

AWS’ data lake strategy is based on the new buzzword Lake House:

Lake House architecture

As you can see, the key message is that one solution cannot solve all problems. However, even more important, all of these problems are solved with cloud-native, serverless AWS solutions:

Lake House architecture on AWS

This is how a cloud-native data lake offering in the public cloud has to look like—obviously, other hyperscalers like GCP and Azure head in the same direction with their serverless offerings.

Unfortunately, the public cloud is not the right choice for every problem due to latency, security, and cost reasons.

Hybrid and Multi-Cloud become the Norm

In the last years, many new innovative solutions tackle another market: Edge and on-premise infrastructure. Some examples: AWS Local Zones, AWS Outposts, AWS Wavelength. I really like the innovative approach from AWS of setting new infrastructure and software categories. Most cloud providers have very similar offerings, but AWS rolls it out in many cases, and others often more or less copy it. Hence, I focus on AWS in this post.

Having said this, each cloud provider has specific strengths. GCP is well known for its leadership in open source-powered services around Kubernetes, TensorFlow, and others. IBM and Oracle are best in providing services and infrastructure for their own products.

I see demand for multiple cloud providers everywhere. Most customers I talk to have a multi-cloud strategy using AWS and other vendors like Azure, GCP, IBM, Oracle, Alibaba, etc. Good reasons exist to use different cloud providers, including cost, data locality, disaster recovery across vendors, vendor independence, historical reasons, dedicated cloud-specific services, etc.

Fortunately, the serverless Kafka SaaS Confluent Cloud is available on all major clouds. Hence, similar examples are available for the usage of the fully-managed Kafka ecosystem with Azure and GCP. Hence, let’s finally go to the Kafka part of this post… 🙂

From “Data at Rest” to “Data in Motion”

This was a long introduction before we actually come back to serverless Kafka. But only with background knowledge is it possible to understand the rise of data in motion and the need for cloud-native and serverless services.

Let’s start with the key messages to point out:

  • Real-time beats slow data in most use cases across industries.
  • For event streaming, the same cloud-native approach is required as for the modern data lake.
  • Event streaming and data lake / lake house technologies are complementary, not competitive.

The rise of event-driven architectures and data in motion powered by Apache Kafka enables enterprises to build real-time infrastructure and applications.

Apache Kafka – The De Facto Standard for Data in Motion

I will not explore the reasons and use cases for the success of Kafka in this post. Instead, check out my overview about Kafka use cases across industries for more details. Or read some of my vertical-specific blog posts.

In short, most added value comes from processing data in motion while it is relevant instead of storing data at rest and process it later (or too late). The following diagram from Forrester’s Mike Gualtieri shows this very well:

Forrester - Business Value of Real Time Data

The Kafka API is the De Facto Standard API for Data in Motion like Amazon S3 for Object Storage:

Apache Kafka is the Open Source De Facto Standard API for Event Streaming and Data in Motion

While I understand that vendors such as Snowflake and MongoDB would like to get into the data in motion business, I doubt this makes much sense. As discussed earlier in the Cloudera section, it is best to focus on one thing and do that very well. That’s obviously why Confluent has strong partnerships not just with the cloud providers but also with Snowflake and MongoDB.

Apache Kafka is the battle-tested and scalable open-source framework for processing data in motion. However, it is more like a car engine.

A Complete, Serverless Kafka Platform

As I talk about cloud, serverless, AWS, etc., you might ask yourself: “Why even think about Kafka on AWS at all if you can simply use Amazon MSK?” – that’s the obligatory, reasonable question!

The short answer: Amazon MSK is PaaS and NOT a fully managed and NOT a serverless Kafka SaaS offering.

Here is a simple counterquestion: Do you prefer to buy

  1. a battle-tested car engine (without wheels, brakes, lights, etc.)
  2. a complete car (including mature and automated security, safety, and maintenance)
  3. a self-driving car (including safe automated driving without the need to steer the car, refueling, changing brakes, product recalls, etc.)

Comparison of Apache Kafka Products and Services

In the Kafka world, you only get a self-driving car from Confluent. That’s not a sales or marketing pitch, but the truth. All other cloud offerings provide you a self-managed product where you need to choose the brokers, fix bugs, do performance tuning, etc., by yourself. This is also true for Amazon MSK. Hence, I recommend evaluating the different offerings to understand if “fully managed” or “serverless” is just a marketing term or reality.

Check out “Comparison of Open Source Apache Kafka vs. Vendors including Confluent, Cloudera, Red Hat, Amazon MSK” for more details.

No matter if you want to build a data lake / lake house architecture, integrate with other 3rd party applications, or build new custom business applications: Serverless is the way to go in the cloud!

Serverless, fully-managed Kafka

A fully managed serverless offering is the best choice if you are in the public cloud. No need to worry about operations efforts. Instead, focus on business problems using a pay-as-you-go model with consumption-based pricing and mission-critical SLAs and support.

Serverless Fully Managed Apache Kafka

A truly fully managed and serverless offering does not give you access to the server infrastructure. Or did you get access to your AWS S3 object storage or Snowflake server configs? No, because then you would have to worry about operations and potentially impact or even destroy the cluster.

Self-managed Cloud-native Kafka

Not every Kafka cluster runs in the public cloud. Therefore, some Kafka clusters require partial management by their own operations team. I have seen plenty of enterprises struggling with Kafka operations themselves, especially if the use case is not just data ingestion into a data lake but mission-critical transactional or analytical workloads.

A cloud-native Kafka supports the operations team with automation. This reduces risks and efforts. For example, self-balancing clusters take over the rebalancing of partitions. Automated rolling upgrades allow that you upgrade to every new release instead of running an expensive and risky migration project (often with the consequence of not migrating to new versions). Separation of computing and storage (with Tiered Storage) enables large but also cost-efficient Kafka clusters with Terrabytes or even Petabytes of data. And so on.

Oh, by the way: A cloud-native Kafka cluster does not have to run on Kubernetes. Ansible or plain container/bare-metal deployments are other common options to deploy Kafka in your own data center or at the edge. But Kubernetes definitely provides the best cloud-native experience regarding automation with elastic scale. Hence, various Kafka operators (based on CRDs) were developed in the past years, for instance, Confluent for Kubernetes or Red Hat’s Strimzi.

Kafka is more than just Messaging and Data Ingestion

Last but not least, let’s be clear: Kafka is more than just messaging and data ingestion. I see most Kafka projects today also leverage Kafka Connect for data integration and/or Kafka Streams/ksqlDB for continuous data processing. Thus, with Kafka, one single (distributed and scalable) infrastructure enables messaging, storage, integration, and processing of data:

Real-time decision making for claim processing and fraud detection in insurance with Apache Kafka

A fully-managed Kafka platform does not just operate Kafka but the whole ecosystem. For instance, fully-managed connectors enable serverless data integration with native AWS services such as S3, Redshift or Lambda, and non-AWS systems such as MongoDB Atlas, Salesforce or Snowflake. In addition, fully managed streaming analytics with ksqlDB enables continuous data processing at scale.

A complete Kafka platform provides the whole ecosystem, including security (role-based access control, encryption, audit logs), data governance (schema registry, data quality, data catalog, data lineage), and many other features like global resilience, flexible DevOps automation, metrics, monitoring, etc.

Example 1: Event Streaming + Data Lake / Lake House

The following example shows how to use a complete platform to do real-time analytics with various Confluent components plus integration with AWS lake house services:

 

Real-time analytics with Confluent Cloud and AWS

Ingest & Process

Capture event streams with a consistent data structure using Schema Registry, develop real-time ETL pipelines with a lightweight SQL syntax using ksqlDB & unify real-time streams with batch processing using Kafka Connect Connectors.

Store & Analyze

Stream data with pre-built Confluent connectors into your AWS data lake or data warehouse to execute queries on vast amounts of streaming data for real-time and batch analytics.

This example shows very well how data lake or lake house services and event streaming complement each other. All services are SaaS. Even the integration (powered by Kafka Connect) is serverless.

Example 2: Serverless Application and Microservice Integration

The following example shows how to use a complete platform to integrate existing and build new applications and serverless microservices with various Confluent and AWS services:

Serverless Application and Microservice Integration with Kafka Confluent Cloud and AWS

Serverless integration

Connect existing and applications and data stores in a repeatable way without having to manage and operate anything. Apache Kafka and Schema Registry ensure to maintain app compatibility. ksqlDB allows to development of real-time apps with SQL syntax. Kafka Connect provides effortless integrations with Lambda & data stores.

AWS serverless platform

Stop provisioning, maintaining, or administering servers for backend components such as compute, databases and storage so that you can focus on increasing agility and innovation for your developer teams.

Kafka Everywhere – Cloud, On-Premise, Edge

The public cloud is the future of the data center. However, two main reasons exist NOT to run everything in a public cloud infrastructure:

  • Brownfield architectures: Many enterprises have plenty of applications and infrastructure in data centers. Hybrid cloud architectures are the only option. Think about mainframes as one example.
  • Edge use cases: Some scenarios do not make sense in the public cloud due to cost, latency, security or legal reasons. Think about smart factories as one example.

Multi-cluster and cross-data center deployments of Apache Kafka have become the norm rather than an exception. Several scenarios require multi-cluster solutions, including disaster recovery, aggregation for analytics, cloud migration, mission-critical stretched deployments and global Kafka. Learn more about this in the blog post “Architecture patterns for distributed, hybrid, edge, and global Apache Kafka deployments“.

Various AWS infrastructures enable the deployment of Kafka outside the public cloud. Confluent Platform is certified on AWS Outposts and therefore runs on various AWS hardware offerings.

Example 3: Hybrid Integration with Kafka-native Cluster Linking

Here is an example for brownfield modernization:

Accelerate modernization from on-prem to AWS with Kafka and Confluent Cluster Linking

 

Connect

Pre-built connectors continuously bring valuable data from existing services on-prem, including enterprise data warehouse, databases, and mainframes. In addition, bi-directional communication is also possible where needed.

Bridge

Hybrid cloud streaming enables consistent, reliable replication in real-time to build a modern event-driven architecture for new applications and the integration with 1st and 3rd party SaaS interfaces.

Modernize

Public cloud infrastructure increases agility in getting applications to market and reduces TCO when freeing up resources to focus on value-generating activities and not managing servers.

Example 4: Low-Latency Kafka with Cloud-native 5G Infrastructure on AWS Wavelength

Low Latency data streaming requires infrastructure that runs close to the edge machines, devices, sensors, smartphones, and other interfaces. AWS Wavelength was built for these scenarios. The enterprise does not have to install its own IT infrastructure at the edge.

The following architecture shows an example built by Confluent, AWS, and Verizon:

Low Latency 5G Use Cases with AWS Wavelength based on AWS Outposts and Confluent

Learn more about low latency use cases for data in motion in the blog post “Low Latency Data Streaming with Apache Kafka and Cloud-Native 5G Infrastructure“.

Live Demo – Hybrid Cloud Replication

Here is a live demo I recorded to show the streaming replication between an on-premise Kafka cluster and Confluent Cloud, including stream processing with ksqlDB and data integration with Kafka Connect (using the fully-managed AWS S3 connector):

Slides from Joint AWS + Confluent Webinar on Serverless Kafka

I recently did a webinar together with AWS to present the added value of combining Confluent and AWS services. This became a standard pattern in the last years for many transactional and analytical workloads.

Here are the slides:

Reverse ETL and its Relation to Data Lakes and Kafka

To conclude this post, I want to explore one more term you might have heard about. The buzzword is still in the early stage, but more and more vendors pitch a new trend: Reverse ETL. In short, this means storing data in your favorite long-term storage (database /  data warehouse / data lake / lake house) and then get the data out of there again to connect to other business systems.

In the Kafka world, this is the same as Change Data Capture (CDC). Hence, Reverse ETL is not a new thing. Confluent provides CDC connectors for the many relevant systems, including Oracle, MongoDB, and Salesforce.

As I mentioned earlier, data storage vendors try to get into the data in motion business. I am not sure where this will go. Personally, I am convinced that the event streaming platform is the right place in the enterprise architecture for processing data in motion. This way every application can consume the data in real-time – decoupled from another database or data lake. But the future will show. I thought it was a good ending to this blog post to clarify this new buzzword and its relation to Kafka. I wrote more about this topic in its own article: “Apache Kafka and Reverse ETL“.

Serverless and Cloud-native Kafka with AWS and Confluent

Cloud-first strategies are the norm today. Whether the use case is a new greenfield project, a brownfield legacy integration architecture, or a modern edge scenario with hybrid replication. Kafka became the de facto standard for processing data in motion. However, Kafka is just a piece of the puzzle. Most enterprises prefer a complete, cloud-native service.

AWS and Confluent are a proven combination for various use cases across industries to deploy and operate Kafka environments everywhere, including truly serverless Kafka in the public cloud and cloud-native Kafka outside the public cloud. Of course, while this post focuses on the relation between Confluent and AWS, Confluent has similar strong partnerships with GCP and Azure to provide data in motion on these hyperscalers, too.

How do you run Kafka in the cloud today? Do you operate it by yourself? Do you use a partially managed service? Or do you use a truly serverless offering? Are your projects greenfield, or do you use hybrid architectures? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Serverless Kafka in a Cloud-native Data Lake Architecture appeared first on Kai Waehner.

]]>
Kafka API is the De Facto Standard API for Event Streaming like Amazon S3 for Object Storage https://www.kai-waehner.de/blog/2021/05/09/kafka-api-de-facto-standard-event-streaming-like-amazon-s3-object-storage/ Sun, 09 May 2021 14:32:49 +0000 https://www.kai-waehner.de/?p=3380 Real-time beats slow data in most use cases across industries. The rise of event-driven architectures and data in motion powered by Apache Kafka enables enterprises to build real-time infrastructure and applications. This blog post explores why the Kafka API became the de facto standard API for event streaming like Amazon S3 for object storage, and the tradeoffs of these standards and corresponding frameworks, products, and cloud services.

The post Kafka API is the De Facto Standard API for Event Streaming like Amazon S3 for Object Storage appeared first on Kai Waehner.

]]>
Real-time beats slow data in most use cases across industries. The rise of event-driven architectures and data in motion powered by Apache Kafka enables enterprises to build real-time infrastructure and applications. This blog post explores why the Kafka API became the de facto standard API for event streaming like Amazon S3 for object storage, and the tradeoffs of these standards and corresponding frameworks, products, and cloud services.

De Facto Standard API - Amazon S3 for Object Storage and Apache Kafka for Event Streaming

Event-Driven Architecture: This Time It’s Not A Fad

The Forbes’ article “Event-Driven Architecture: This Time It’s Not A Fad” from April 2021 explained why enterprises are not just talking about event-driven real-time applications, but finally building them. Here are some arguments:

  • REST limitations can limit your business strategy
  • Data needs to be fluid and real-time
  • Microservices and serverless need event-driven architectures

Real-time Data in Motion beats Slow Data

Use cases for event-driven architectures exist across industries. Some examples:

  • Transportation: Real-time sensor diagnostics, driver-rider match, ETA updates
  • Banking: Fraud detection, trading, risk systems, mobile applications/customer experience
  • Retail: Real-time inventory, real-time POS reporting, personalization
  • Entertainment: Real-time recommendations, a personalized news feed, in-app purchases
  • The list goes on across verticals…

Real-time data in motion beats data at rest in databases or data lakes in most scenarios. There are a few exceptions that require batch processing:

  • Reporting (traditional business intelligence)
  • Batch analytics (processing high volumes of data in a bundle, for instance, Hadoop and Spark’s map-reduce, shuffling, and other data processing only make sense in batch mode)
  • Model training as part of a machine learning infrastructure (while model scoring and monitoring often requires real-time predictions, the model training is batch in almost all currently available ML algorithms)

Beyond these exceptions, almost everything is better in real-time than batch.

Be aware that real-time data processing is more than just sending data from A to B in real-time (aka messaging or pub/sub). Real-time data processing requires integration and processing capabilities. If you send data into a database or data lake in real-time but have to wait until it is processed there in batch, it does not solve the problem.

With the ideas around real-time in mind, let’s explore what a de facto standard API is.

What is a (De Facto) Standard API?

The answer is longer than you might expect and needs to be separated into three sections:

  • API
  • Standard API
  • De facto standard API

What is an API?

An application programming interface (API) is an interface that defines interactions between multiple software applications or mixed hardware-software intermediaries. It defines the kinds of calls or requests that can be made, how to make them, the data formats that should be used, the conventions to follow, etc. It can also provide extension mechanisms so that users can extend existing functionality in various ways and to varying degrees.

An API can be entirely custom, specific to a component, or designed based on an industry-standard to ensure interoperability. Through information hiding, APIs enable modular programming, allowing users to use the interface independently of the implementation.

What is a Standard API?

Industry consortiums or other industry-neutral (often global) groups or organizations specify standard APIs. A few characteristics show the trade-offs:

  • Vendor-agnostic interfaces
  • Slow evolution and long specification process
  • Most vendors add proprietary features because a) too slow process of the standard specification or more often b) to differentiate their commercial offering
  • Acceptance and success depend on the complexity and added value (this sounds obvious but is often the key blocker for success)

Examples for Standard APIs

Here are some examples of standard APIs. I also add my thoughts if I think they are successful or not (but I fully understand that there are good arguments against my opinion).

Generic Standards
  • SQL: Domain-specific language used in programming and designed for managing data held in a relational database management system. Successful as almost every database somehow supports SQL or tries to build a similar syntax. A good example is ksqlDB, the Kafka-native streaming SQL engine. ksqlDB (like most other streaming SQL engines) is not ANSI SQL, but still understood easily by people that know SQL.
  • J2EE / Java EE / Jakarta EE: Successful as most vendors adopted at least parts of it for Java frameworks. While early versions were very heavyweight and complex, the current APIs and implementations are much more lightweight and user-friendly. JMS is a great example where vendors added proprietary add-ons to add features and differentiate. No vendor-lockin is only true in theory!
  • HTTP: Successful as application layer protocol for distributed, collaborative, hypermedia information systems. While not 100% correct, people typically interpret HTTP as REST Web Services. HTTP is often misused for things it is not built for.
  • SOAP / WSDL: Partly successful in providing XML-based web service standard specifications. Some vendors built good tooling around it. However, this is typically only true for the basic standards such as SOAP and WSDL, not so much for all the other complex add-ons (often called WS-* hell).
Standards for a Specific Problem or Industry
  • OPC-UA for Industrial IoT (IIoT): Partly successful machine-to-machine communication protocol for industrial automation developed. Adopted by almost every vendor in the industrial space. The drawback (similarly to HTTP) is that it is often misused. For instance, MQTT is a much better and more lightweight choice in some scenarios. OPC-UA is a great example where the core is successful, but the industry-specific add-ons are not prevalent and not supported by tools. Also, OPC-UA is too heavyweight for many of the use cases it is used in.
  • PMML for Machine Learning: Not successful as an XML-based predictive model interchange format. The idea is great: Train an analytic model once and then deploy it across platforms and programming languages. In practice, it did not work. Too many limitations and unnecessary complexity for a project. Most real-world machine learning deployments I have seen in the wild avoid it and deploy models to production with a standard wrapper. ONNX and other successors are not more prevalent yet either.

In summary, some standard APIs are successful and adopted well; many others are not. Contrary to these standards specified by consortiums, there is another category emerging: De Facto Standard APIs.

What is a De Facto Standard API?

De Facto standard APIs originate from an existing successful solution (that can be an open-source framework, a commercial product, or a cloud service). Two ways exist how these de facto standard APIs emerge:

  • Driven by a single vendor (often proprietary), for example: Amazon S3 for object storage.
  • Driven by a huge community around a successful open-source project, for example: Apache Kafka for event streaming.

No matter how a de facto standard API originated, they typically have a few characteristics in common:

  • Creation of a new category of software, something that did not exist before
  • Adoption by other frameworks, products, or cloud services as the API because became the de facto standard
  • No complex, formal, long-running standard processes; hence innovation is possible in a relatively flexible and agile way
  • Practical processes and rules are in place to ensure good quality and consensus (either controlled by the owner company for a proprietary standard API or across the open source community)

Let’s now explore two de facto standard APIs: Amazon S3 and Apache Kafka. Both are very successful but very different regarding being a standard. Hence, the trade-offs are very different.

Amazon S3: De Facto Standard API for Object Storage

Amazon S3 or Amazon Simple Storage Service is a service offered by Amazon Web Services (AWS) that provides object storage through a web service interface in the public AWS cloud. It uses the same scalable storage infrastructure that Amazon.com uses to run its global e-commerce network. Amazon S3 can be employed to store any type of object, which allows for uses like storage for internet applications, backup and recovery, disaster recovery, data archives, data lakes for analytics, and hybrid cloud storage. Additionally, S3 on Outposts provides on-premises object storage for on-premises applications that require high-throughput local processing.

Amazon CTO on Past, Present, Future of S3” is a great read about the evolution of this fully-managed cloud service. While the public API was kept stable, the internal backend architecture under the hood changed several times significantly. Plus, new features were developed on top of the API, for instance, AWS Athena for analytics and interactive queries using standard SQL. I really like how Werner Vogels describes his understanding of a good cloud service:

Vogels doesn’t want S3 users to even think for a moment about spindles or magnetic hardware. He doesn’t want them to care about understanding what’s happening in those data centers at all. It’s all about the services, the interfaces, and the flexibility of access, preferably with the strongest consistency and lowest latency when it really matters.

So, we are talking about a very successful proprietary cloud service by AWS. Hence, what’s the point?

Most Object Storage Vendors Support the Amazon S3 API

Many enterprises use the Amazon S3 API. Hence, it became the de facto standard. If other storage vendors want to sell object storage, supporting the S3 interface is often crucial to get through the evaluations and RFPs. If you don’t support the S3 API, it is much harder for companies to adopt the storage and implement the integration (as most companies already use Amazon S3 and have built tools, scripts, testing around this API).

For this reason, many applications have been built to support the Amazon S3 API natively. This includes applications that write data to Amazon S3 and Amazon S3-compatible object stores.

S3 compatible solutions include client backup, file browser, server backup, cloud storage, cloud storage gateway, sync&share, hybrid storage, on-premises storage, and more.

Many vendors sell S3-compatible products: Oracle, EMC, Microsoft, NetApp, Western Digital, MinIO, Pure Storage, and many more. Check out the Amazon S3 site from Wikipedia for a more detailed and complete list.

So why has the S3 API become so ubiquitous?

The creation of a new software category is a dream for every vendor! Let’s understand how and why Amazon was successful in establishing S3 for object storage. The following is a quote from Chris Evan’s great article from 2016: “Has S3 become the de facto API standard?

So why has the S3 API become so ubiquitous?  I suspect there are a number of reasons.  These include:

  • First to market – When S3 was launched in 2006, most enterprises were familiar with object storage as “content addressable storage” through EMC’s Centera platform.  Other than that, applications were niche and not widely adopted except for specific industries like High Performance Computing where those users were used to coding to and for the hardware.  S3 quickly became a platform everyone could use with very little investment.  That made it easy to consume and experiment with.  By comparison, even today the leaders in object storage (as ranked by the major analysts) still don’t make it easy (or possible) to download and evaluate their products, even though most are software only implementations.
  • Documentation – following on from the previous point, S3 has always been well documented, with examples on how to run API commands.  There’s a document history listing changes over the past 6-7 years that shows exactly how the API has evolved.
  • A Single Agenda – the S3 API was designed to fit a single agenda – that of storing and retrieving objects from S3.  As such, Amazon didn’t have to design by committee and could implement the features they required and evolve from there.  Contrast that with the CDMI (Cloud Data Management Interface) from SNIA.  The SNIA website is difficult to navigate, the standard itself is only on the 4th published iteration in six years, while the documentation runs to 264 pages! (Note that the S3 API runs into more pages, but is infinitely more consumable, with simple examples from page 11 onwards).

Cons of a Proprietary De Facto Standard like Amazon S3

Many people might say: “Better a proprietary standard than no standard.” I partly agree with this. The possibility to learn one API and use it across multi-cloud and on-premise systems and vendors is great. However, Amazon S3 has several disadvantages as it is NOT an open standard:

  • Other vendors (have to) build their implementation on a best guess about the behavior of the API. There is no official standard specification they can rely on.
  • Customers cannot be sure what they buy. At least, they should not expect the same behavior of 3rd party S3 implementations that they get from their experiences using Amazon S3 on AWS.
  • Amazon can change APIs and features as it likes. Other vendors need to “reverse engineer the API” and adjust their products.
  • Amazon could sue competitors for using S3 API branding – even though this is not likely to happen as the benefits are probably bigger (I am not a lawyer; hence this statement might be wrong and is just my personal opinion)

Let’s now look at an open-source de facto standard: Kafka.

Kafka API: De Facto Standard API for Event Streaming

Apache Kafka is mainstream today! The Kafka API became the de facto standard for event-driven architectures and event streaming. Two proof points:

The Kafka API (aka Kafka Protocol)

Kafka became the de facto event streaming API. Similar like the S3 API became the de facto standard for object storage. Actually, the situation is even better for the Kafka API as the S3 API is a proprietary protocol from AWS. In contrast, the Kafka API and protocol are open source under Apache 2.0 license.

The Kafka protocol covers the wire protocol implemented in Kafka. It defines the available requests, their binary format, and the proper way to make use of them to implement a client.

One of my favorite characteristics of the Kafka protocol is backward compatibility. Kafka has a “bidirectional” client compatibility policy. In other words, new clients can talk to old servers, and old clients can talk to new servers. This allows users to upgrade either clients or servers without experiencing any downtime or data loss. This makes Kafka ideal for microservice architectures and domain-driven design (DDD). Kafka really decouples the applications from each other in contrary to web service/REST-based architectures).

Pros of an Open Source De Facto Standard like the Kafka API

The huge benefit of an open-source de facto standard API is that it is open and usually follows a collaborative standardized process to make changes to the API. This brings various benefits to the community and software vendors.

The following facts about the Kafka API make many developers and enterprises happy:

  • Changes occur in a visible process enforced by a committee. For Apache Kafka, the Apache Software Foundation (ASF) is the relevant organization. Apache projects are managed using a collaborative, consensus-based process with members from various countries and enterprises. Check out how it works if you don’t know it yet.
  • Frameworks and vendors can implement against the open protocol and validate the implementation. That is significantly different from proprietary de facto standards like Amazon S3. Having said this, not every product that says it uses the Kafka API is 100% compatible and consequently is limited in the feature set and provides different behavior.
  • Developers can test the underlying behavior against the same API. Hence, unit and performance tests for different implementations can use the same code.
  • The Apache 2.0 license makes sure that the user does not have to worry about infringing any patents by using the software.

Frameworks, Products, and Cloud Services using the Kafka API

Many frameworks and vendors adopted the Kafka API. Let’s take a look at a few very different alternatives available today that use the Kafka API:

  • Open-source Apache Kafka from the Apache website
  • Self-managed Kafka-based vendor solutions for on-premises or cloud deployments from Confluent, Cloudera, Red Hat
  • Partially managed Kafka-based cloud offerings from Amazon MSK, Red Hat, Azure HD Insight’s Kafka, Aiven, cloudkarafka, Instaclustr.
  • Fully managed Kafka cloud offerings such as Confluent Cloud – actually, there is no other serverless, fully compatible Kafka SaaS offering on the market today (even though many marketing departments try to sell it like this)
  • Partly protocol-compatible, self-managed solutions such Apache Pulsar (with a simple, very limited Kafka wrapper class) or RedPanda for embedded / WebAssembly (WASM) use cases
  • Partly protocol-compatible, fully managed offerings like Azure EventHubs

Just be aware that the devil is in the details. Many offerings only implement a fraction of the Kafka API. Additionally, many offerings only support the core messaging concept, but exclude key features such as Kafka Connect for data integration, Kafka Streams for stream processing, or exactly-once semantics (EOS) for building transactional systems.

The Kafka API Dominates the Event Streaming Landscape

If you look at the current event streaming landscape, you see that more and more frameworks and products adopt the Kafka API. Even though the following is not a complete list (and other non-Kafka offerings exist), it is imposing:

Event Streaming Landscape with Apache Kafka Related Products and Cloud Services including Confluent Cloudera Red Hat IBM Amazon MSK Redpanda Pulsar Azure Event Hubs

If you want to learn more about the different Kafka offerings on the market, check out my Kafka vendor comparison. It is crucial to understand what Kafka offering is right for you. Do you want to focus on business logic and consume the Kafka infrastructure as a service? Or do you want to implement security, integration, monitoring, etc., by yourself?

The Kafka API is here to stay…

The Kafka API became the de facto standard API for event streaming. The usage of an open protocol creates huge benefits for corresponding frameworks, products, and cloud services leveraging the Kafka API.

Vendors can implement against the open standard and validate their implementation. End users can choose the best solution for their business problem. Migration between different Kafka services is also possible relatively easily – as long as each vendor is compliant with the Kafka protocol and implements it completely and correctly.

Are you using the Kafka API today? Open source Kafka (“car engine”), a commercial self-managed offering (“complete car”), or the serverless Confluent Cloud (“self-driving car) to focus on business problems? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Kafka API is the De Facto Standard API for Event Streaming like Amazon S3 for Object Storage appeared first on Kai Waehner.

]]>