GCP Archives - Kai Waehner https://www.kai-waehner.de/blog/tag/gcp/ Technology Evangelist - Big Data Analytics - Middleware - Apache Kafka Thu, 06 Mar 2025 07:15:37 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.2 https://www.kai-waehner.de/wp-content/uploads/2020/01/cropped-favicon-32x32.png GCP Archives - Kai Waehner https://www.kai-waehner.de/blog/tag/gcp/ 32 32 Fully Managed (SaaS) vs. Partially Managed (PaaS) Cloud Services for Data Streaming with Kafka and Flink https://www.kai-waehner.de/blog/2025/01/18/fully-managed-saas-vs-partially-managed-paas-cloud-services-for-data-streaming-with-kafka-and-flink/ Sat, 18 Jan 2025 11:33:44 +0000 https://www.kai-waehner.de/?p=7224 The cloud revolution has reshaped how businesses deploy and manage data streaming with solutions like Apache Kafka and Flink. Distinctions between SaaS and PaaS models significantly impact scalability, cost, and operational complexity. Bring Your Own Cloud (BYOC) expands the options, giving businesses greater flexibility in cloud deployment. Misconceptions around terms like “serverless” highlight the need for deeper analysis to avoid marketing pitfalls. This blog explores deployment options, enabling informed decisions tailored to your data streaming needs.

The post Fully Managed (SaaS) vs. Partially Managed (PaaS) Cloud Services for Data Streaming with Kafka and Flink appeared first on Kai Waehner.

]]>
The cloud revolution has transformed how businesses deploy, scale, and manage data streaming solutions. While Software-as-a-Service (SaaS) and Platform-as-a-Service (PaaS) cloud models are often used interchangeably in marketing, their distinctions have significant implications for operational efficiency, cost, and scalability. In the context of data streaming around Apache Kafka and Flink, understanding these differences and recognizing common misconceptions—such as the overuse of the term “serverless”—can help you make an informed decision. Additionally, the emergence of Bring Your Own Cloud (BYOC) offers yet another option, providing organizations with enhanced control and flexibility in their cloud environments.

SaaS vs PaaS Cloud Service for Data Streaming with Apache Kafka and Flink

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch.

The Data Streaming Landscape 2025 highlights how data streaming has evolved into a key software category, moving from niche adoption to a fundamental part of modern data architecture.

The Data Streaming Landcape 2025 with Kafka Flink Confluent Amazon MSK Cloudera Event Hubs and Other Platforms

With frameworks like Apache Kafka and Flink at its core, the landscape now spans self-managed, BYOC, and fully managed SaaS solutions, driving real-time use cases, unifying transactional and analytical workloads, and enabling innovation across industries.

If you’re still grappling with the fundamentals of stream processing, this article is a must-read: “Stateless vs. Stateful Stream Processing with Kafka Streams and Apache Flink“.

What is SaaS in Data Streaming?

SaaS data streaming solutions are fully managed services where the provider handles all aspects of deployment, maintenance, scaling, and updates. SaaS offerings are designed for ease of use, providing a serverless experience where developers focus solely on building applications rather than managing infrastructure.

Characteristics of SaaS in Data Streaming

  1. Serverless Architecture: Resources scale automatically based on demand. True SaaS solutions eliminate the need to provision or manage servers.
  2. Low Operational Overhead: The provider manages hardware, software, and runtime configurations, including upgrades and security patches.
  3. Pay-As-You-Go Pricing: Consumption-based pricing aligns costs directly with usage, reducing waste during low-demand periods.
  4. Rapid Deployment: SaaS enables users to start processing streams within minutes, accelerating time-to-value.

Examples of SaaS in Data Streaming:

  • Confluent Cloud: A fully managed Kafka platform offering serverless scaling, multi-tenancy, and a broad feature set for both stateless and stateful processing.
  • Amazon Kinesis Data Analytics: A managed service for real-time analytics with automatic scaling.

What is PaaS in Data Streaming?

PaaS offerings sit between fully managed SaaS and infrastructure-as-a-service (IaaS). These solutions provide a platform to deploy and manage applications but still require significant user involvement for infrastructure management.

Characteristics of PaaS in Data Streaming

  1. Partial Management: The provider offers tools and frameworks, but users must manage servers, clusters, and scaling policies.
  2. Manual Configuration: Deployment involves provisioning VMs or containers, tuning parameters, and monitoring resource usage.
  3. Complex Scaling: Scaling is not always automatic; users may need to adjust resource allocation based on workload changes.
  4. Higher Overhead: PaaS requires more expertise and operational involvement, making it less accessible to teams without dedicated DevOps resources.

PaaS offerings in data streaming, while simplifying some infrastructure tasks, still require significant user involvement compared to fully serverless SaaS solutions. Below are three common examples, along with their benefits and pain points compared to serverless SaaS:

  • Apache Flink (Self-Managed on Kubernetes Cloud Service like EKS, AKS or GKE)
    • Benefits: Full control over deployment and infrastructure customization.
    • Pain Points: High operational overhead for managing Kubernetes clusters, manual scaling, and complex resource tuning.
  • Amazon Managed Service for Apache Flink (Amazon MSF)
    • Benefits: Simplifies infrastructure management and integrates with some other AWS services.
    • Pain Points: Users still handle job configuration, scaling optimization, and monitoring, making it less user-friendly than serverless SaaS solutions.
  • Amazon MSK (Managed Streaming for Apache Kafka)
    • Benefits: Eases Kafka cluster maintenance and integrates with the AWS ecosystem.
    • Pain Points: Requires users to design and manage producers/consumers, manually configure scaling, and handle monitoring responsibilities. MSK also excludes support for Kafka if you have any operational issues with the Kafka piece of the infrastructure.

SaaS vs. PaaS: Key Differences

SaaS and PaaS differ in the level of management and user responsibility, with SaaS offering fully managed services for simplicity and PaaS requiring more user involvement for customization and control.

FeatureSaaSPaaS
InfrastructureFully managed by the providerPartially managed; user controls clusters
ScalingAutomatic and server lessManual or semi-automatic scaling
Deployment SpeedImmediate, ready to useSlower; requires configuration
Operational ComplexityMinimalModerate to high
Cost ModelConsumption-based, no idle costsMay incur idle resource costs

The big benefit of PaaS over SaaS is greater flexibility and control, allowing users to customize the platform, integrate with specific infrastructure, and optimize configurations to meet unique business or technical requirements. This level of control is often essential for organizations with strict compliance, security, or data sovereignty requirements.

SaaS is NOT Always Better than PaaS!

Be careful: The limitations and pain points of PaaS do NOT mean that SaaS is always better.

A concrete example: Amazon MSK Serverless simplifies Apache Kafka operations with automated scaling and infrastructure management but comes with significant limitations, including the lack of fully-managed connectors, advanced data governance tools, and native multi-language client support.

Amazon MSK also excludes Kafka engine support, leading to potential operational risks and cost unpredictability, especially when integrating with additional AWS services for a complete data streaming pipeline. I explored these challenges in more detail in my article “When NOT to choose Amazon MSK Serverless for Apache Kafka?“.

Bring Your Own Cloud (BYOC) as Alternative to PaaS

BYOC (Bring Your Own Cloud) offers a middle ground between fully managed SaaS and self-managed PaaS solutions, allowing organizations to host applications in their own VPCs.

BYOC provides enhanced control, security, and compliance while reducing operational complexity. This makes BYOC a strong alternative to PaaS for companies with strict regulatory or cost requirements.

As an example, here are the options of Confluent for deploying the data streaming platform: Serverless Confluent Cloud, Self-managed Confluent Platform (some consider this a PaaS if you leverage Confluent’s Kubernetes Operator and other automation / DevOps tooling) and WarpStream as BYOC offering:

Cloud-Native BYOC for Apache Kafka with WarpStream in the Public Cloud
Source: Confluent

While BYOC complements SaaS and PaaS, it can be a better choice when fully managed solutions don’t align with specific business needs. I wrote a detailed article about this topic: “Deployment Options for Apache Kafka: Self-Managed, Fully-Managed / Serverless and BYOC (Bring Your Own Cloud)“.

“Serverless” Claims: Don’t Trust the Marketing

Many cloud data streaming solutions are marketed as “serverless,” but this term is often misused. A truly serverless solution should:

  1. Abstract Infrastructure: Users should never need to worry about provisioning, upgrading, or cluster sizing.
  2. Scale Transparently: Resources should scale up or down automatically based on workload.
  3. Eliminate Idle Costs: There should be no cost for unused capacity.

However, many products marketed as serverless still require some degree of infrastructure management or provisioning, making them closer to PaaS. For example:

  • A so-called “serverless” PaaS solution may still require setting initial cluster sizes or monitoring node health.
  • Some products charge for pre-provisioned capacity, regardless of actual usage.

Do Your Own Research

When evaluating data streaming solutions, dive into the technical documentation and ask pointed questions:

  • Does the solution truly abstract infrastructure management?
  • Are scaling policies automatic, or do they require manual configuration?
  • Is there a minimum cost even during idle periods?

By scrutinizing these factors, you can avoid falling for misleading “serverless” claims and choose a solution that genuinely meets your needs.

Choosing the Right Model for Your Data Streaming Business: SaaS, PaaS, or BYOC

When adopting a data streaming platform, selecting the right model is crucial for aligning technology with your business strategy:

  • Use SaaS (Software as a Service) if you prioritize ease of use, rapid deployment, and operational simplicity. SaaS is ideal for teams looking to focus entirely on application development without worrying about infrastructure.
  • Use PaaS (Platform as a Service) if you require deep customization, control over resource allocation, or have unique workloads that SaaS offerings cannot address.
  • Use BYOC (Bring Your Own Cloud) if your organization demands full control over its data but sees benefits in fully managed services. BYOC enables you to run the data plane within your cloud VPC, ensuring compliance, security, and architectural flexibility while leveraging SaaS functionality for the control plane .

In the rapidly evolving world of data streaming around Apache Kafka and Flink, SaaS data streaming platforms like Confluent Cloud provide the best of both worlds: the advanced features of tools like Apache Kafka and Flink, combined with the simplicity of a fully managed serverless experience. Whether you’re handling stateless stream processing or complex stateful analytics, SaaS ensures you’re scaling efficiently without operational headaches.

What deployment option do you use today for Kafka and Flink? Any changes planned in the future? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Fully Managed (SaaS) vs. Partially Managed (PaaS) Cloud Services for Data Streaming with Kafka and Flink appeared first on Kai Waehner.

]]>
The Data Streaming Landscape 2025 https://www.kai-waehner.de/blog/2024/12/04/the-data-streaming-landscape-2025/ Wed, 04 Dec 2024 13:49:37 +0000 https://www.kai-waehner.de/?p=6909 Data streaming is a new software category. It has grown from niche adoption to becoming a fundamental part of modern data architecture, leveraging open source technologies like Apache Kafka and Flink. With real-time data processing transforming industries, the ecosystem of tools, platforms, and cloud services has evolved significantly. This blog post explores the data streaming landscape of 2025, analyzing key players, trends, and market dynamics shaping this space.

The post The Data Streaming Landscape 2025 appeared first on Kai Waehner.

]]>
Data streaming is a new software category. It has grown from niche adoption to becoming a fundamental part of modern data architecture. With real-time data processing transforming industries, the ecosystem of tools, platforms, and services has evolved significantly. This blog post explores the data streaming landscape of 2025, analyzing key players, trends, and market dynamics shaping this space.

The data streaming landscape of 2025 categorizes solutions by their adoption and completeness as fully-featured data streaming platforms, as well as their deployment models, which range from self-managed setups to BYOC (Bring Your Own Cloud) and fully managed cloud services like PaaS and SaaS. While Apache Kafka remains the backbone of this ecosystem, the landscape also includes stream processing engines like Apache Flink and competitive technologies such as Pulsar and Redpanda that are built on the Kafka protocol.

This blog also explores the latest market trends and provides an outlook for 2025 and beyond, highlighting potential new entrants and evolving use cases. By the end, you’ll gain a clear understanding of the data streaming platform landscape and its trajectory in the years to come.

The Data Streaming Landcape 2025 with Kafka Flink Confluent Amazon MSK Cloudera Event Hubs and Other Platforms

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download by free ebook about data streaming use cases and industry-specific success stories.

Data Streaming in 2025: The Rise of a New Software Category

Real-time data beats slow data. That’s true across almost all use cases in any industry. Event-driven applications powered by data streaming continuously process data from any data source. This approach increases the business value as the overall goal by increasing revenue, reducing cost, reducing risk, or improving the customer experience. And the event-driven architecture ensures a future-ready architecture.

Even top researchers and advisory firms such as Forrester, Gartner, and IDG recognize data streaming as a new software category. In December 2023, Forrester released “The Forrester Wave™: Streaming Data Platforms, Q4 2023,” highlighting Microsoft, Google, and Confluent as leaders, followed by Oracle, Amazon, and Cloudera.

Data Streaming is NOT just another data integration tool. Plenty of software categories and related data platforms exist to process and analyze data. I explored in a few dedicated series how data streaming. differs:

The Business Value of Data Streaming

A new software category opens use cases and adds business value across all industries:

Use Cases for Data Streaming with Apache Kafka by Business Value
Source: Lyndon Hedderly (Confluent)

Adding business value is crucial for any enterprise. With so many potential use cases, it is no surprise that more and more software vendors add Kafka support to their products.

Search my blog for your favorite industry to find plenty of case studies and architectures. Or read about use cases for Apache Kafka across industries to get started.

The Data Streaming Landscape of 2025

Data Streaming is a separate software category of data platforms. Many software vendors built their entire businesses around this category. Several mature players in the data market added support for data streaming in their platforms or cloud service ecosystem. Various SaaS startups have emerged in this category in the last few years.

The Data Streaming Landcape 2025 with Kafka Flink Confluent Amazon MSK Cloudera Event Hubs and Other Platforms

It all began with the open-source framework Apache Kafka, and today, the Kafka protocol is widely adopted across various implementations, including proprietary ones. What truly matters now is leveraging the capabilities of a complete data streaming platform—one that is fully compatible with the Kafka protocol. This includes built-in features like connectors, stream processing, security, data governance, and the elimination of self-management, reducing risks and operational effort.

The Kafka Protocol is the De Facto Standard of Data Streaming

Most software vendors use Kafka (or its protocol) at the core of their data streaming platformsApache Kafka has become the de facto standard for data streaming.

Apache Kafka is the de facto standard API for data streaming in 2024

Additionally, “benchmarketing” (i.e., picking a sweet spot or niche scenario where you perform better than your competitor) is the favorite marketing technique to “prove” differentiators to the real Apache Kafka. Some vendors also present misleading cost-efficiency comparisons by excluding critical cloud costs such as data transfer or storage, giving an incomplete picture of the true expenses.

Apache Kafka vs. Data Streaming Platform

Many still use Kafka merely as a dumb ingestion pipeline, overlooking its potential to power sophisticated, real-time data streaming use cases. One reason is that Kafka alone lacks the full capabilities of a comprehensive data streaming platform.

A complete solution requires more than “just” Kafka. Apache Flink is becoming the de facto standard for stream processing. Data integration capabilities (connectors, clients, APIs), data governance, security, and critical 24/7 SLAs and support are important for many data streaming projects.

The Data Streaming Landscape 2025 summarizes the current status of relevant products and cloud services, focusing on deployment models and the adoption/completeness of the data streaming platforms.

Data Streaming Vendors and Categories for the 2025 Landscape

The data streaming landscape changed this year. As most solutions evolve, I do not distinguish anymore between Kafka, non-Kafka, and stream processing as categories. Instead, I look at the adoption and completeness to assess the maturity of a data streaming solution from an open-source framework to a complete platform.

The deployment models also changed in the 2025 landscape. Instead of categorizing it into Self Managed, Partially Managed, and Fully Managed, I sort as follows: Self Managed, Bring Your Own Cloud (BYOC), and Cloud. The Cloud category is separated into PaaS (Platform as a Service) and SaaS (Software as a Service) to indicate that many Kafka cloud offerings are still NOT fully managed!

Please note: Intentionally, this data streaming landscape is not a complete list of frameworks, cloud services, or vendors. It is also not an official research. There is no statistical evidence. You might miss your favorite technology in this diagram. Then I did not see it in my conversations with customers, prospects, partners, analysts, or the broader data streaming community.

Also, note that I focus on general data streaming infrastructure. Brilliant solutions exist for using and analyzing streaming data for specific scenarios, like time-series databases, machine learning engines, observability platforms, or purpose-built IoT solutions. These are usually complementary, often connected out of the box via a Kafka connector, or even built on top of a data streaming platform (invisible for the end user).

Adoption and Completeness of Data Streaming (X-Axis)

Data streaming is adopted more and more across all industries. The concept is not new. In “The Past, Present and Future of Stream Processing“, I explored how the data streaming journey started decades ago with research and the first purpose-built proprietary products for specific use cases like stock trading.

Open source stream processing frameworks emerged during the big data and Hadoop era to make at least the ingestion layer a bit more real-time. Is anyone still remembering (or even still using) Apache Storm? 🙂

Today, most enterprises are realizing the value of data streaming for both analytical and operational use cases across industries. The cloud has brought a transformative shift, enabling businesses to start streaming and processing data with just a click, using fully managed SaaS solutions and intuitive UIs. Complete data streaming platforms now offer many built-in features that users previously had to develop themselves, including connectors, encryption, access control, governance, data sharing, and more.

Capabilities of a Complete Data Streaming Platform

Data streaming vendors are on the way to building a complete Data Streaming Platform (DSP). Capabilities include:

  • Messaging (“Streaming”): Transfer messages in real-time and persist for durability, decoupling, and slow consumers (near real-time, batch, API, file).
  • Data Integration: Connect to any legacy and cloud-native sources and sinks.
  • Stream Processing: Correlate events with stateless and stateful transformation or business logic.
  • Data Governance:  Ensure security, observability, data sovereignty, and compliance.
  • Developer Tooling: Enable flexibility for different personas such as software engineers, data scientists, and business analysts by providing different APIs (such as Java, Python, SQL, REST/HTTP), graphical user interfaces, and dashboards.
  • Operations Tooling and SaaS: Ease infrastructure management on premise respectively take over the entire operations burden in the public cloud with serverless offerings.
  • Uptime SLAs and Support: Provide the required guarantees and expertise for critical use cases.

Evolution from Open Source Adoption to a Data Streaming Organization

Modern data streaming is not just about adopting a product; it’s about transforming the way organizations operate and derive value from their data. Hence, the adoption goes beyond product features:

  • From open source and self-operations to enterprise-grade products and SaaS.
  • From human scale to automated, serverless elasticity with consumption-based pricing.
  • From dumb ingestion pipes to complex data pipelines and business applications.
  • From analytical workloads to critical transactional (and analytical) use cases.
  • From a single data streaming cluster to a powerful edge, hybrid, and multi-cloud architecture, including integration, migration, aggregation, and disaster recovery scenarios.
  • From wild adoption across business units with uncontrolled growth using various frameworks, cloud services, and integration tools to a center of excellence (CoE) with a strategic approach with standards, best practices, and knowledge sharing in an internal community.
  • From effortful and complex human management to enterprise-wide data governance, automation, and self-service APIs.

Data Streaming Deployment Models: Self-Managed vs. BYOC vs. Cloud (Y-Axis)

Different data streaming categories exist regarding the deployment model:

  • Self-Managed: Operate nodes like Kafka Broker, Kafka Connect, and Schema Registry by yourself with your favorite scripts and tools. This can be on-premise or in the public cloud in your VPC. Reduce the operations burden via a cloud-native platform (usually Kubernetes) and related operator tools that automate operations tasks like rolling upgrades or rebalancing Kafka Partitions.
  • Bring Your Own Cloud (BYOC): Allow organizations to host Kafka within their own cloud VPC. BYOC combines some of the benefits of cloud flexibility with enhanced security and control, while it outsources most of Kafka’s management to specialized vendors. The data plane is still customer-managed, but in contrast to self-managed Kafka, the customer does not need to worry about the complexity under the hood (like rebalancing, rolling upgrades, backups) – that is what cloud-native object storage and other magic code of the BYOC control plane service take over.
  • Cloud (PaaS or SaaS): Both PaaS and SaaS solutions operate within the cloud provider’s VPC. Fully managed SaaS for data streaming takes overall operational responsibilities, including scaling, failover handling, upgrades, and performance tuning, allowing users to focus solely on integration and business logic. In contrast, partially managed PaaS reduces the operational burden by automating certain tasks like rolling upgrades and rebalancing, but still requires some level of user involvement in managing the infrastructure. Fully Managed SaaS typically provides critical SLAs for support and uptime while partially managed PaaS cannot provide such guarantees.

Most organizations prefer SaaS for data streaming when business and technical requirements allow, as it minimizes operational complexity and maximizes scalability. Other deployment models are chosen when specific constraints or needs require them.

The Evolution of BYOC Kafka Cloud Services

Cloud and On-Premise deployment options are typically well understood, but BYOC (Bring Your Own Cloud) often requires more explanation due to its unique operating model and varying implementations across vendors.

In last year’s data streaming landscape 2024, I wrote the following about BYOC for Kafka:

I do NOT believe in this approach as too many questions and challenges exist with BYOC regarding security, support, and SLAs in the case of P1 and P2 tickets and outages. Hence, I put this in the category of self-managed. That is what it is, even though the vendor touches your infrastructure. In the end, it is your risk because you have to and want to control your environment.

This statement made sense because BYOC vendors at that time required access to the customer VPC and offered a shared support model. While this is still true for some BYOC solutions, my mind changed with the innovation of BYOC by one emerging vendor: WarpStream.

WarpStream’s BYOC Operating Model with Stateless Agents in the Customer VPC

WarpStream published a new operating model for BYOC: The customer only deploys stateless agents in its VPC and provides an object storage bucket to store the data. The control plane and metadata store are fully managed by the vendor as SaaS and the vendor takes over all the complexity.

Cloud-Native BYOC for Apache Kafka with WarpStream in the Public Cloud
Source: Confluent

With this innovation, BYOC is now a worthwhile third deployment option besides a self-managed and fully managed data streaming platform. It brings several benefits:

  • No access is needed by the BYOC cloud vendor to the customer VPC. The data plane (i.e., the “brokers” in the customer VPC) is stateless. The metadata/consensus is in the control plane (i.e., the cloud service in the WarpStream VPC).
  • The architecture solves sovereignty challenges and is a great fit for security and compliance requirements.
  • The cost of the BYOC offering is cheaper than self-managed Apache Kafka because it is built with cloud-native concepts and technologies in mind (e.g., zero disks and zero interzone networking fees, leveraging cloud object storage such as Amazon S3, Azure Blog Storage, or Google Cloud Storage).
  • The stateless architecture in the customer VPC makes autoscaling and elasticity very easy to implement/configure.

When to use BYOC?

WarpStream introduced an innovative share-nothing operating model that makes BYOC practical, secure, and cost-efficient. With that being said, I still recommend only looking at BYOC options for Apache Kafka in the public cloud if a fully managed and serverless data streaming platform does NOT work for you because of cost, security, or compliance reasons! When it comes to simplicity and ease of operation, nothing beats a fully managed cloud service.

And please keep in mind that NOT every BYOC cloud service provides these characteristics and benefits. Make sure to make a proper evaluation of your favorite solutions. For more details, look at my blog post: “Deployment Options for Apache Kafka: Self-Managed, Fully-Managed / Serverless and BYOC (Bring Your Own Cloud)“.

Changes in the Data Streaming Landscape from 2024 to 2025

My goal is NOT a growing landscape with tens or even hundreds of vendors and cloud services. Plenty of these pictures exist. Instead, I focus on a few technologies, vendors, and cloud offerings that I see in the field, with adoption by the broader open-source and cloud community.

I already discussed the important conceptual changes in the data streaming landscape:

  • Deployment Model: From self-managed, partially managed, and fully managed to self-managed, BYOC and cloud.
  • Streaming Categories: From different streaming categories to a single category for all data streaming platforms sorted by adoption and completeness.

Additionally, every year I modified the list of solutions compared to the data streaming landscape 2024 published one year ago.

Additions to the Data Streaming Landscape 2025

The following data streaming services were added:

  • Alibaba (Cloud): Confluent Data Streaming Service on Alibaba Cloud is an OEM partnership between Alibaba Cloud and Confluent to offer a fully managed SaaS in Mainland China. The service was announced end of 2021 and sees more and more traction in Asia. Alibaba is the contractor and first-level support for the end user.
  • Google Managed Service for Kafka (Cloud): Google announced this Kafka PaaS recently. The strategy looks very similar to Amazon’s MSK. Even the shortcut is the same: MSK. I explored when (not) to choose Google’s Kafka cloud service after the announcement. The service is still in preview, but available to a huge customer base already.
  • Oracle Streaming with Apache Kafka (Cloud): A partially managed Apache Kafka PaaS on Oracle Cloud Infrastructure (OCI). The service is in early access, but available to a huge customer base already.
  • WarpStream (BYOC): WarpStream was acquired by Confluent. It is still included with its logo as Confluent continues to keep the brand and solution separated (at least for now).

Removals from the Data Streaming Landscape 2025

There are NO REMOVALS this year, BUT I was close to removing two technologies:

  • Apache Pulsar and StreamNative: I was close to removing Apache Pulsar as I see zero traction in the market. Around 2020, Pulsar had some traction but focused too much on Kafka FUD instead of building a vibrant community. While Kafka simplified its architecture (ZooKeeper removal), Pulsar still includes three distributed systems (ZooKeeper or alternatives like etcd, BookKeeper, and Pulsar Broker). It also pivots to the Kafka protocol trying to get some more traction again. But it seems to be too late.
  • ksqlDB (formerly called KSQL): The product is feature complete. While it is still supported by Confluent, it will not get any new features. ksqlDB is still a great piece of software for some (!) Kafka-native stream processing projects but might be removed in the future. Confluent co-founder and Kafka co-creator Jay Kreps commented on X (former Twitter): “Confluent went through a set of experiments in this area. What we learned is that for *platform* layers you want a clean separation. We learned this the hard way: our source available stream processing layer KSQL, lost to open-source Apache Flink. We pivoted to Flink.

Vendor Overview for Data Streaming Platforms

All vendors of the landscape do some kind of data streaming. However, the offerings differ a lot in adoption, completeness, and vision. And many solutions are not available everywhere but only in one public cloud or only as self-managed. For detailed product information and experiences, the vendor websites and other blogs/conference talks are the best and most up-to-date resources. The following is just a summary to get an overview.

Before we do the deep dive, here again, the entire data streaming landscape for 2025:

The Data Streaming Landcape 2025 with Kafka Flink Confluent Amazon MSK Cloudera Event Hubs and Other Platforms

Self-Managed Data Streaming with Open Source and Proprietary Products

Self Managed Data Streaming Platforms including Kafka Flink Spark IBM Confluent Cloudera

The following list describes the open-source frameworks and proprietary products for self-managed data streaming deployments (in order of adoption and completeness):

  • Apache Pulsar: A competitor to Apache Kafka. Similar story and use cases, but different architecture. Kafka is a single distributed cluster – after removing the ZooKeeper dependency in 2022. Pulsar is three (!) distributed clusters: Pulsar brokers, ZooKeeper, and BookKeeper. Pulsar vs. Kafka explored the differences. And Kafka catches up to some missing features like Queues for Kafka.
  • StreamNative: The primary vendor behind Apache Pulsar. Not much market traction.
  • ksqlDB (usually called KSQL, even after Confluent’s rebranding): An abstraction layer on top of Kafka Streams to provide stream processing with streaming SQL. Hence, also Kafka-native. It comes with a Confluent Community License and is free to use. Sweet spot: Streaming ETL.
  • Redpanda: Implements the Kafka protocol with C++. Trying out different market strategies to define Redpanda as an alternative to a Kafka-native offering. Still in the early stage in the maturity curve. Adding tons of (immature) product features in parallel to find the right market fit in a growing Kafka market. Recently acquired Benthos to provide connectivity to data sources and sinks (similar to Kafka Connect).
  • Ververica: Well-known Flink company. Acquired by Chinese tech giant Alibaba in 2019. Not much traction in Europe and the US. Sweet spot: Flink in Mainland China.
  • Apache Flink: Becoming the de facto standard for stream processing. Open-source implementation. Provides advanced features including a powerful scalable compute engine, freedom of choice for developers between SQL, Java, and Python, APIs for Complex Event Processing (CEP), and unified APIs for stream and batch workloads.
  • Spark Streaming: The streaming part of the open-source big data processing framework Apache Spark. The enormous installed base of Spark clusters in enterprises broadens adoption thanks to solutions from Cloudera, Databricks, and the cloud service providers. Sweet spot: Analytics in (micro)batches with data stored at rest in the data lake/lakehouse.
  • Apache Kafka: The de facto standard for data streaming. Open-source implementation with a vast community. Almost all vendors rely on (parts of) this project. Often underestimated: Kafka includes data integration and stream processing capabilities with Kafka Connect and Kafka Streams, making even the open-source Apache download already more powerful than many other data streaming frameworks and products.
  • IBM / Red Hat AMQ Streams: Provides Kafka as self-managed Kafka on Kubernetes via OpenShift. Kafka is part of the integration portfolio that includes other open-source frameworks like Apache Camel. Sweet spot: Existing IBM customers.
  • Cloudera: Provides Kafka, Flink, and other open-source data and analytics frameworks as a self-managed offering. The main strategy is offering one product with a vast combination of many open-source frameworks that can be deployed on any infrastructure. Sweet spot: Analytics.
  • Confluent Platform: Focuses on a complete data streaming platform including Kafka and Flink, and various advanced data streaming capabilities for operations, integration, governance, and security. Sweet spot: Unifying operational and analytical workloads, and combination with the fully managed cloud service.

Data Streaming with Bring Your Own Cloud (BYOC)

Bring Your Own Cloud BYOC Data Streaming Platforms Redpanda Databricks WarpStream

BYOC is an emerging category and is mainly used for specific challenges such as strict data security and compliance requirements. The following vendors provide dedicated BYOC offerings for data streaming (in order of adoption and completeness)

  • WarpStream (Confluent): A new entrant into the data streaming market. The cloud service is a Kafka-compatible data streaming platform built directly on top of S3. Innovated the BYOC model to enable secure and cost-effective data streaming for workloads that don’t have strong latency requirements.
  • Redpanda: The first BYOC offering on the market for data streaming. The biggest concern is the shared responsibility model of this solution because the vendor requires access to the customer VPC for operations and support. This is against the key principles of BYOC regarding security and compliance and why organizations (have to) look for BYOC instead of SaaS solutions.
  • Databricks: Cloud-based data platform that provides a collaborative environment for data engineering, data science, and machine learning, built on top of Apache Spark. Data Streaming is enabled by Spark Streaming and focuses mainly on analytical workloads that are optimized from batch to near real-time.

Partially Managed Data Streaming Cloud Platforms (PaaS)

PaaS Cloud Streaming Platforms like Red Hat Amazon MSK Google Oracle Aiven

Here is an overview of relevant PaaS data streaming cloud services (in order of adoption and completeness):

  • Google Cloud Managed Service for Apache Kafka (MSK): Initially branded as Google Managed Kafka for BigQuery (likely for a better marketing push), the service enables data ingestion into lakehouses on GCP such as Google BigQuery.
  • Amazon Managed Service for Apache Flink (MSF): A partially managed service by AWS that allows customers to transform and analyze streaming data in real-time with Apache Flink. It still provides some (costly) gaps for auto-scaling and is not truly serverless. Supports all Flink interfaces, i.e., SQL, Java, and Python. And non-Kafka connectors, too. Only available on AWS.
  • Oracle OCI Streaming with Apache Kafka: The service is still in early access, but available to a huge customer base already on Oracle’s cloud infrastructure.
  • Microsoft Azure HDInsight. A piece of Azure’s Hadoop infrastructure. Not intended for other use cases beyond data ingestion for batch analytics.
  • Instaclustr: Partially managed Kafka cloud offerings across cloud providers. The product portfolios offer various hosted services of open-source technologies. Instaclustr also offers a (semi-)managed offering for on-premise infrastructure.
  • Amazon Kinesis: Data ingestion into AWS data stores. Mature product for a specific problem. Only available on AWS.
  • Aiven: Partially managed Kafka cloud offerings across cloud providers. The product portfolios offer various hosted services of open-source technologies.
  • IBM / Red Hat AMQ Streams: Provides Kafka as a partially managed cloud offering on OpenShift (through Red Hat). Sweet spot: Existing IBM customers.
  • Amazon Managed Service for Apache Kafka (MSK): AWS has hundreds of cloud services, and Kafka is part of that broad spectrum. MSK is only available in public AWS cloud regions; not on Outposts, Local Zones, Wavelength, etc. MSK is likely the largest partially managed Kafka service across all clouds. It evolved with new features like support for Kafka Connect and Tiered Storage. But lacks connectivity outside the AWS ecosystem and a data governance narrative.

Fully Managed Data Streaming Cloud Services (SaaS)

Fully Managed SaaS Data Streaming Platform including Confluent Alibaba Microsoft Event Hubs

Here is an overview of relevant SaaS data streaming cloud services (in order of adoption and completeness):

  • Decodable: A relatively new cloud service for Apache Flink in the early stage. Huge potential if it is combined with existing Kafka infrastructures in enterprises. But also provides pre-built connectors for non-Kafka systems. Main Opportunity: Combination with another cloud vendor that only does Kafka or other messaging/streaming without stream processing capabilities.
  • StreamNative Cloud: The primary vendor behind Apache Pulsar. Offers self-managed and fully managed solutions. StreamNative Cloud for Kafka is still in a very early stage of maturity, not sure if it will ever strengthen.
  • Ververica: Stream processing as a service powered by Apache Flink on all major cloud providers. Huge potential if it is combined with existing Kafka infrastructures in enterprises. Main Opportunity: Combination with another cloud vendor that only does Kafka or other messaging/streaming without stream processing capabilities.
  • Redpanda Cloud: Redpanda provides its data streaming as a serverless offering. Not much information is available on the website about this part of the vendor’s product portfolio.
  • Amazon MSK Serverless: Different functionalities and limitations than Amazon MSK. MSK Serverless still does not get much traction because of its limitations. Both MSK offerings exclude Kafka support in their SLAs (please read the terms and conditions).
  • Google Cloud DataFlow: Fully managed service for executing Apache Beam pipelines within the Google Cloud Platform ecosystem. Mature product for a specific problem. Only available on GCP.
  • Azure Event Hubs: A mature, fully managed cloud service. The service does one thing, and that is done very well: Data ingestion via the Kafka protocol. Hence, it is not a complete streaming platform but is more comparable to Amazon Kinesis or Google Cloud PubSub. The limited compatibility with the Kafka protocol and the high cost of the service for higher throughput are the two major blockers that come up regularly in conversations.
  • Confluent Cloud: A complete data streaming platform including Kafka and Flink as a fully managed offering. In addition to deep integration, the platform provides connectivity, security, data governance, self-service portal, disaster recovery tooling, and much more to be the most complete DSP on all major public clouds.

Potential for the Data Streaming Landscape 2026

Data streaming is a journey. So is the development of event streaming platforms and cloud services. Several established software and cloud vendors might get more traction with their data streaming offerings. And some startups might grow significantly. The following shows a few technologies that might evolve and see growing adoption in 2025:

  • New startups around the Kafka protocol emerge. The list of new frameworks and cloud services is growing every quarter. A few names I saw in some social network posts (but not much beyond in the real world): AutoMQ, S2, Astradot, Bufstream, Responsive, tansu, Tektite, Upstash. While some focus on the messaging/streaming part, others focus on a particular piece such as building database capabilities.
  • Streaming databases like Materialize or RisingWave might become a new software category. My feeling: Very early stage of the hype cycle. We will see in 2025 if and where this technology gets more broadly adopted and what the use cases are. It is hard to answer how these will compete with emerging real-time analytics databases like Apache Druid, Apache Pinot, ClickHouse, Timeplus, Tinybird, et al. I know there are differences, but the broader community and companies need to a) understand these differences and b) find business problems for it.
  • Stream Processing SaaS startups emerge: Quix and Bytewax provide stream processing with Python. Quix now also offers a hosted offering based on Kafka Streams; as does Responsive. DeltaStream provides Apache Flink as SaaS. And many more startups emerge these days. Let’s see which of these gets traction in the market with an innovative product and business model.
  • Traditional data management vendors like MongoDB or Snowflake try to get deeper into the data streaming business. I am still a fan of separation of concerns; so I think these should keep their sweet spot and (only) provide streaming ingestion and CDC as use cases, but not (try to) compete with data streaming vendors.

Fun fact: The business model of almost all emerging startups is fully managed cloud services, not selling licenses for on-premise deployments. Many are based on open-source or open-core, and others only provide a proprietary implementation.

Although they are not aiming to be full data streaming platforms (and thus are not part of the platform landscape), other complementary tools are gaining momentum in the data streaming ecosystem. For instance, Conduktor is developing a proxy for Kafka clusters, and Lenses, though relatively quiet since its acquisition by Celonis, has recently introduced updates to its user-friendly management and developer tools. These tools address gaps that some data streaming platforms leave unfilled.

Data Streaming: A Journey, Not a Sprint

Data streaming isn’t a sprint—it’s a journey! Adopting event-driven architectures with technologies like Apache Kafka or Apache Flink requires rethinking how applications are designed, developed, deployed, and monitored. Modern data strategies involve legacy integration, cloud-native microservices, and data sharing across hybrid and multi-cloud environments.

The data streaming landscape in 2025 highlights the emergence of a new software category, though it’s still in its early stages. Building such a category takes time. In discussions with customers, partners, and the community, a common sentiment emerges: “We understand the value but are just starting with the first data streaming pipelines and have a long-term plan to implement advanced stream processing and build a strategic data streaming organization.”

The Forrester Wave: Streaming Data Platforms, Q4 2023, and other industry reports from Gartner and IDG signal that this category is progressing through the hype cycle.

Last but not least, check out my Top Data Streaming Trends for 2025 to understand how the data streaming landscape fits into emerging trends:

  1. Democratization of Kafka: Apache Kafka has transitioned from a specialized tool to a foundational platform in modern data infrastructure.
  2. Kafka Protocol as the Standard: Vendors adopt the Kafka protocol over the framework, enabling flexibility with compatibility and performance trade-offs.
  3. BYOC Deployment Model: Bring Your Own Cloud gains traction for balancing security, compliance, and managed services.
  4. Flink Becomes the Standard for Stream Processing: Apache Flink rises as the premier framework for stream processing, building integration pipelines and business applications.
  5. Data Streaming for Real-Time Predictive AI and GenAI: Real-time model inference drives predictive and generative AI applications.
  6. Data Streaming Organizations: Companies unify real-time data strategies to standardize tools, governance, and collaboration.

Which are your favorite open-source frameworks or cloud services for data streaming? What are your most relevant and exciting trends around Apache Kafka and Flink in 2024 to set data in motion? What does your enterprise landscape for data streaming look like? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter. Make sure to download by free ebook about data streaming use cases and industry examples.

The post The Data Streaming Landscape 2025 appeared first on Kai Waehner.

]]>
Multi-Cloud Replication in Real-Time with Apache Kafka and Cluster Linking https://www.kai-waehner.de/blog/2024/08/14/multi-cloud-replication-in-real-time-with-apache-kafka-and-cluster-linking-between-aws-azure/ Wed, 14 Aug 2024 06:07:28 +0000 https://www.kai-waehner.de/?p=6487 Multiple Apache Kafka clusters are the norm; not an exception anymore. Hybrid integration and multi-cloud replication for migration or disaster recovery are common use cases. This blog post explores a real-world success story from financial services around the transition of a large traditional bank from on-premise data centers into the public cloud for multi-cloud data sharing between AWS and Azure.

The post Multi-Cloud Replication in Real-Time with Apache Kafka and Cluster Linking appeared first on Kai Waehner.

]]>
Multiple Apache Kafka clusters are the norm; not an exception anymore. Hybrid integration and multi-cloud replication for migration or disaster recovery are common use cases. This blog post explores a real-world success story from financial services around the transition of a large traditional bank from on-premise data centers into the public cloud for multi-cloud data sharing between AWS and Azure.

Multi-Cloud Replication in Real-Time with Apache Kafka and Cluster Linking

What is Multi-Cloud and How Does Apache Kafka Help?

Multi-cloud refers to the use of multiple cloud computing services from different providers in a single heterogeneous IT environment. This approach enhances flexibility, performance, and reliability while avoiding vendor lock-in.

Here are the key benefits of multi-cloud:

  • Avoidance of Vendor Lock-In: By utilizing multiple cloud providers, organizations can avoid dependency on a single vendor, reducing the risk associated with vendor-specific outages and price changes.
  • Optimization of Performance and Cost: Different cloud providers offer varying strengths, pricing models, and geographic availability. Multi-cloud strategies enable organizations to choose the best provider for each workload to optimize performance and cost.
  • Enhanced Redundancy and Resilience: Multi-cloud setups can provide higher availability and disaster recovery capabilities by distributing workloads across multiple cloud environments, thus reducing the impact of localized outages.
  • Regulatory and Compliance Benefits: Some industries and regions have specific regulatory requirements that may be easier to meet using a multi-cloud approach, ensuring data residency and compliance.

The real-time capabilities of Apache Kafka are a perfect match for multi-cloud architectures. Information is replicated and synchronized directly after the creation of an event. Apache Kafka’s combination of high throughput, low latency, durability provides strong data consistency guarantees across multiple cloud providers like AWS, Azure, GCP and Alibaba; no matter if the data sources or data sinks are real time, batch or API-driven request

One Apache Kafka Cluster Does NOT Fit All Use Cases

Organizations require multiple Kafka cluster strategies for various use cases: Hybrid integration, aggregation, migration and disaster recovery. I explored the architecture options and trade-offs in a dedicated blog post: “Apache Kafka Cluster Type Deployment Strategies“.

One Apache Kafka Cluster Type Does NOT Fit All Use Cases

Multi-cloud is a special case with even higher challenges regarding security, cost, and latency. Nevertheless, all larger organizations have multi-cloud infrastructure and integration needs. Let’s explore the multi-cloud journey of fidelity investments and how data streaming with Apache Kafka helps.

Fidelity’s Hybrid Cloud Data Streaming Journey

Fidelity Investments is a leading financial services company that provides a wide range of investment management, retirement planning, brokerage, and wealth management services to individuals and institutions. Founded in 1946, Fidelity has grown to become one of the largest asset managers in the world, with trillions of dollars in assets under management. The company is known for its comprehensive research tools, innovative technology platforms, and commitment to customer service, helping clients achieve their financial goals.

Fidelity Investments presented at Kafka Summit events about their data streaming journey transitioning from on-premise to hybrid cloud infrastructure.

Fidelity Cloud Journey in Banking FinServ
Source: Fidelity Investments

Fidelity’s Event Streaming Platform

Fidelity Investments built a streaming platform that hosts business application events for event driven architectures and integration with other business applications through events and streams.

Fidelity Event Streaming Platform Architecture with Apache Kafka and Confluent
Source: Fidelity Investments

A few impressive numbers about from Fidelity’s event streaming platform infrastructure (presented at Kafka Summit London in 2023):

  • 4 years in public cloud
  • 16k+ producer and consumer applications
  • 6B+ events per day
  • 72+ self-service APIs
  • 300+ observability metrics

One of the first critical use cases was the integration and offloading from IBM z Systems mainframe via IBM MQ, Kafka Connect (deployed on the mainframe) and Confluent Platform.

For architectures and best practices around mainframe modernization, check out my article “Mainframe Integration, Offloading and Replacement with Apache Kafka“.

Fidelity’s Cloud Journey: From Point-to-Point to Decoupling Applications with Kafka and Cluster Linking between AWS and Azure

The Kafka Summit talk “Multi-Cloud Data Sharing: Make the Data Move for you Across CSPs using Cluster Linking” explored Fidelity Investment’s transition to the cloud.

Fidelity Investments designed its multi-cloud event streaming platform to enable applications residing in different cloud service providers to seamlessly share data between them.

BEFORE: Point-to-Point Multi-Cloud Replication

Many architects call this the spaghetti integration architecture. All applications do a point-to-point connection to each other application:

Fidelity Point to Point Integration Across Multi Cloud AWS Azure
Source: Fidelity Investments

This setup is costly, error-prone and hard to maintain or innovate.

AFTER: Apache Kafka and Cluster Linking for Real-Time Data Sharing and Decoupled Applications

One of Apache Kafka’s unique values is truly decoupling between applications. The event-based durable commit log guarantees data consistency but also allows choosing the right technology or API in each business unit.

Replication between the Kafka clusters running in different cloud infrastructures and regions on AWS and Azure is implemented with Confluent Cluster Linking.

Fidelity Data Sharing with Apache Kafka and Confluent Cluster Linking
Source: Fidelity Investments

Confluent Cluster Linking plays a crucial role in this design for real-time replication between Kafka clusters. It uses the Kafka protocol for replication to provide all the benefits of Kafka and no needed infrastructure and operations overhead with tools like MirrorMaker. Using the Kafka protocol for multi-cloud replication also affects, i.e., reduces the network cost significantly because it avoids many translations and compression tasks required by MirrorMaker.

Fidelity’s Multi-Cloud Requirements: Data Ownership, Data Contracts and Self-Service API

Besides reliable real-time replication across multi-cloud environments, Fidelity Investments’ other important aspects include for the multi-cloud Kafka enterprise architecture include:

  • Data ownership in a multi-cloud environment
  • Schema Registry to provide common data contracts (often called data products in a data mesh architecture) across Kafka clusters in different cloud providers
  • Self-service management API plane allowing teams to manage their multi-cloud topic replications with as little as a single configuration change.

Transition to Cloud with Data Streaming Across Industries

Multi-cloud use cases include data integration, migration, aggregation and disaster recovery scenarios. Here are a few real-world examples from the financial services, healthcare and telecom sector:

Even if you do not plan multi-cloud infrastructure because focusing on a single service provider across regions, you can be sure: The next merger and acquisition (M&A) comes for sure… 🙂 Multi-cloud scenarios are not an exception, but the norm in larger organizations.

Do you already deploy across multiple cloud providers? What are the use cases? How do you efficiently and reliably integrate these environments? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Multi-Cloud Replication in Real-Time with Apache Kafka and Cluster Linking appeared first on Kai Waehner.

]]>
Apache Iceberg – The Open Table Format for Lakehouse AND Data Streaming https://www.kai-waehner.de/blog/2024/07/13/apache-iceberg-the-open-table-format-for-lakehouse-and-data-streaming/ Sat, 13 Jul 2024 12:01:01 +0000 https://www.kai-waehner.de/?p=6143 An open table format framework like Apache Iceberg is essential in the enterprise architecture to ensure reliable data management and sharing, seamless schema evolution, efficient handling of large-scale datasets and cost-efficient storage. This blog post explores market trends, adoption of table format frameworks like Iceberg, Hudi, Paimon, Delta Lake and XTable, and the product strategy of leading vendors of data platforms such as Snowflake, Databricks (Apache Spark), Confluent (Apache Kafka / Flink), Amazon Athena and Google BigQuery.

The post Apache Iceberg – The Open Table Format for Lakehouse AND Data Streaming appeared first on Kai Waehner.

]]>
Every data-driven organization has operational and analytical workloads. A best of breed approach emerges with various data platforms, including data streaming, data lake, data warehouse and lakehouse solutions and cloud services. An open table format framework like Apache Iceberg is essential in the enterprise architecture to ensure reliable data management and sharing, seamless schema evolution, efficient handling of large-scale datasets and cost-efficient storage while providing strong support for ACID transactions and time travel queries. This blog post explores market trends, adoption of table format frameworks like Iceberg, Hudi, Paimon, Delta Lake, XTable, and the product strategy of leading vendors of data platforms such as Snowflake, Databricks (Apache Spark), Confluent (Apache Kafka / Flink), Amazon Athena and Google BigQuery.

Apache Iceberg Standard Table Format for Data Lake Lakehouse Streaming wtih Kafka Flink Databricks Snowflake AWS GCP Azure

What is an Open Table Format for a Data Platform?

An open table format helps in maintaining data integrity, optimizing query performance, and ensuring a clear understanding of the data stored within the platform.

The open table format for data platforms typically includes a well-defined structure with specific components that ensure data is organized, accessible, and easily queryable. A typical table format contains a table name, column names, data types, primary and foreign keys, indexes, and constraints.

This is not a new concept. Your favourite decades-old database, like Oracle, IBM DB2 (even on the mainframe) or PostgreSQL, uses the same principles. However, the requirements and challenges changed a bit for cloud data warehouses, data lakes, lake houses regarding scalability, performance and query capabilities.

Benefits of a “Lakehouse Table Format” like Apache Iceberg

Every part of an organization becomes data-driven. The consequence is extensive data sets, data sharing with data products across business units, and new requirements for processing data in near real-time.

Apache Iceberg provides many benefits for the enterprise architecture:

  • Single Storage: Data is stored once (coming from various data sources) reduces cost and complexity
  • Interoperability: Access without integration efforts from any analytical engine
  • All Data: Unify operational and analytical workloads (transactional systems, big data logs/IoT/clickstream, mobile APIs, 3rd party B2B interfaces, etc.)
  • Vendor Independence: Work with any favorite analytics engine (no matter if it is near real-time, batch or API-based)

Apache Hudi and Delta Lake provide the same characteristics. Though, Delta Lake is mainly driven by Databricks as a single vendor.

Table Format AND Catalog Interface

It is important to understand that discussions about Apache Iceberg or similar table format frameworks include two concepts: Table Format AND Catalog Interface! As an end user of the technology, you need both!

The Apache Iceberg project implements the format but only provides a specification (but not implementation) for the catalog:

  • The table format defines how data is organized, stored, and managed within a table.
  • The catalog interface manages the metadata for tables and provides an abstraction layer for accessing tables in a data lake.

The Apache Iceberg documentation explores the concepts in much more detail, based on this diagram:

Apache Iceberg Metadata Architecture
Source: Apache Iceberg

Organizations use various implementations for Iceberg’s catalog interface. Each integrates with different metadata stores and services. Key implementations include:

  • Hadoop Catalog: Uses the Hadoop Distributed File System (HDFS) or other compatible file systems to store metadata. Suitable for environments already using Hadoop.
  • Hive Catalog: Integrates with Apache Hive Metastore to manage table metadata. Ideal for users leveraging Hive for their metadata management.
  • AWS Glue Catalog: Uses AWS Glue Data Catalog for metadata storage. Designed for users operating within the AWS ecosystem.
  • REST Catalog: Provides a RESTful interface for catalog operations via HTTP. Enables integration with custom or third-party metadata services.
  • Nessie Catalog: Uses Project Nessie, which provides a Git-like experience for managing data.

The momentum and growing adoption of Apache Iceberg motivates many data platform vendors to implement its own Iceberg catalog. I discuss a few strategies in the below section about data platform and cloud vendor strategies, including Snowflake’s Polaris, Databricks’ Unity, and Confluent’s Tableflow.

First-Class Iceberg Support vs. Iceberg Connector

Please note that supporting Apache Iceberg (or Hudi/Delta Lake) means much more than just providing a connector and integration with the table format via API. Vendors and cloud services differentiate by advanced features like automatic mapping between data formats, critical SLAs, travel back in time, intuitive user interfaces, and so on.

Let’s look at an example: Integration between Apache Kafka and Iceberg. Various Kafka Connect connectors were already implemented. However, here are the benefits of using a first-class integration with Iceberg (e.g., Confluent’s Tableflow) compared to just using a Kafka Connect connector:

  • No connector config
  • No consumption through connector
  • Built-in maintenance (compaction, garbage collection, snapshot management)
  • Automatic schema evolution
  • External catalog service synchronization
  • Simpler operations (in a fully-managed SaaS solution, it is serverless, with no need for any scale or operations by the end-user)

Similar benefits apply to other data platforms and potential first-class integration compared to providing simple connectors.

Open Table Format for a Data Lake / Lakehouse using Apache Iceberg, Apache Hudi, Delta Lake

The general goal of table format frameworks such as Apache Iceberg, Apache Hudi, and Delta Lake is to enhance the functionality and reliability of data lakes by addressing common challenges associated with managing large-scale data. These frameworks help to:

  1. Improve Data Management
    • Facilitate easier handling of data ingestion, storage, and retrieval in data lakes.
    • Enable efficient data organization and storage, supporting better performance and scalability.
  2. Ensure Data Consistency
    • Provide mechanisms for ACID transactions, ensuring that data remains consistent and reliable even during concurrent read and write operations.
    • Support snapshot isolation, allowing users to view a consistent state of data at any point in time.
  3. Support Schema Evolution
    • Allow for changes in data schema (such as adding, renaming, or removing columns) without disrupting existing data or requiring complex migrations.
  4. Optimize Query Performance
    • Implement advanced indexing and partitioning strategies to improve the speed and efficiency of data queries.
    • Enable efficient metadata management to handle large datasets and complex queries effectively.
  5. Enhance Data Governance
    • Provide tools for better tracking and managing data lineage, versioning, and auditing, which are crucial for maintaining data quality and compliance.

By addressing these goals, table format frameworks like Apache Iceberg, Apache Hudi, and Delta Lake help organizations build more robust, scalable, and reliable data lakes and lakehouses. Data engineers, data scientists and business analysts leverage analytics, AI/ML or reporting/visualization tools on top of the table format to manage and analyze large volumes of data.

Comparison of Apache Iceberg, Hudi, Paimon, Delta Lake?

I won’t do a comparison of the different table format frameworks Apache Iceberg, Apache Hudi, Apache Paimon and Delta Lake here. Many experts wrote about this already. Each table format framework has unique strengths and benefits. But updates are required every month because of the fast evolution and innovation, adding new improvements and capabilities within these frameworks.

Here is a summary of what I see in various blog posts about the three alternatives:

  • Apache Iceberg: Excels in schema and partition evolution, efficient metadata management, and broad compatibility with various data processing engines.
  • Apache Hudi: Best suited for real-time data ingestion and upserts, with strong change data capture capabilities and data versioning.
  • Apache Paimon: A lake format that enables building a real-time lakehouse architecture with Flink and Spark for both streaming and batch operations.
  • Delta Lake: Provides robust ACID transactions, schema enforcement, and time travel features, making it ideal for maintaining data quality and integrity.

A key decision point might be that Delta Lake is not driven by a broad community like Iceberg and Hudi, but mainly by Databricks as a single vendor behind it.

Apache XTable as Interoperable Cross-Table Framework supporting Iceberg, Hudi and Delta Lake

Users have lots of choices. XTable is yet another incubating table framework under Apache open source license to seamlessly interoperate cross-table between Apache Hudi, Delta Lake, and Apache Iceberg.

Apache XTable:

  • provides cross-table omnidirectional interoperability between lakehouse table formats.
  • is NOT a new or separate format, Apache XTable provides abstractions and tools for the translation of lakehouse table format metadata.
  • is formerly known as OneTable.

Maybe Apache XTable is the answer to provide options for specific data platforms and cloud vendors but still provide simple integration and interoperability.

But be careful: A wrapper on top of different technologies is not a silver bullet. We saw this years ago when Apache Beam emerged. Apache Beam is an open source, unified model and set of language-specific SDKs for defining and executing data ingestion and data processing workflows. It supports a variety of stream processing engines, such as Flink, Spark, Samza. The primary driver behind Apache Beam is Google to allow the migration workflows into Google Cloud Dataflow. However, the limitations are huge, as such a wrapper needs to find the least common denominator supporting features. And most frameworks’ key benefit is the twenty percent that do not fit into such a wrapper. For these reasons, for instance, Kafka Streams does intentionally not support Apache Beam because it would have required too many design limitations.

Market Adoption of Table Format Frameworks

FIrst of all, we are still in the early stages. We are still at the innovation trigger in terms of the Gartner Hype Cycle, coming to the peak of inflated expectations. Most organizations are still evaluating, but not adopting these table formats in production across the organization yet.

Flashback: Container Wars – Kubernetes vs. Mesosphere vs. Cloud Foundry

The debate round Apache Iceberg reminds me of the container wars a few years ago. The term “Container Wars” refers to the competition and rivalry among different containerization technologies and platforms in the realm of software development and IT infrastructure.

The three competing technologies were Kubernetes, Mesosphere and Cloud Foundry. Here is where it went:

Google Trends for Kubernetes Mesosphere and Cloud Foundry

Cloud Foundry and Mesosphere were early. Kubernetes still won the battle. Why? I never understood all the technical details and differences. In the end, if the three frameworks are pretty similar, it is all about community adoption, the right timing of feature releases, good marketing, luck, and a few other factors. But it is good for the software industry to have one leading open source framework to build solutions and business models on, instead of three competing ones.

Present: Table Format Wars – Apache Iceberg vs. Hudi vs. Delta Lake

Obviously, Google Trends is no statistical evidence or sophisticated research. But I used it a lot in the past as an intuitive, simple, free tool to analyze market trends. Therefore, I also use this tool to see if Google searches overlap with my personal experience of the market adoption of Apache Iceberg, Hudi and Delta Lake (Apache XTable is too small yet to be added):

Google Trends and Market Adoption of Apache Iceberg Hudi Delta Lake

We obviously see a similar pattern as the container wars showed a few years ago. I have no idea where this is going. And if one technology wins, or if the frameworks differentiate enough to prove that there is no silver bullet. The future will show us.

My personal opinion? I think Apache Iceberg will win the race. Why? I cannot argue with any technical reasons. I just see many customers across all industries talk about it more and more. And more and more vendors start supporting it. But we will see. I actually do NOT care who wins. However, similarly to the container wars, I think it is good to have a single standard and vendors differentiating with features around it, like it is with Kubernetes.

But with this in mind, let’s explore the current strategy of the leading data platforms and cloud providers regarding table format support in their platforms and cloud services.

Data Platform and Cloud Vendor Strategies for Apache Iceberg

I won’t do any speculation in this section. The evolution of the table format frameworks moves quickly. And vendor strategies change quickly. Please refer to the vendor’s websites for the latest information. But here is a status quo about the data platform and cloud vendor strategies regarding the support and integration of Apache Iceberg.

  • Snowflake:
    • Supports Apache Iceberg for quite some time already
    • Adding better integrations and new features regularly
    • Internal and external storage options (with trade-offs) like Snowflake’s storage or Amazon S3
    • Announced Polaris, an open source catalog implementation for Iceberg, with commitment to support community-driven, vendor-agnostic bi-directional integration
  • Databricks:
    • Focuses on Delta Lake as the table format and (now open sourced) Unity as catalog
    • Acquired Tabular, the leading company behind Apache Iceberg
    • Unclear future strategy of supporting open Iceberg interface (in both directions) or only to feed data into its lakehouse platform and technologies like Delta Lake and Unity Catalog
  • Confluent:
    • Embeds Apache Iceberg as a first-class citizen into its data streaming platform (the product is called Tableflow)
    • Converts a Kafka Topic and related schema metadata (i.e., data contract) into an Iceberg table
    • Bi-directional integration between operational and analytical workloads
    • Analytics with embedded serverless Flink and its unified batch and streaming API or data sharing with 3rd party analytics engines like Snowflake, Databricks or Amazon Athena
  • More Data Platforms and Open Source Analytics Engines:
    • The list of technologies and cloud services supporting Iceberg grows every month
    • A few examples: Apache Spark, Apache Flink, ClickHouse, Dremio, Starburst using Trino (formerly PrestoSQL), Cloudera using Impala, Imply using Apache Druid, Fivetran
  • Cloud Service Providers (AWS, Azure, GCP, Alibaba):
    • Different strategies and integrations, but all cloud providers increase Iceberg support across their services these days, for instance:
    • Object Storage: Amazon S3, Azure Data Lake Storage (ALDS), Google Cloud Storage (GCS)
    • Catalogs: Cloud-specific like AWS Glue Catalog or vendor agnostic like Project Nessie or Hive Catalog
    • Analytics: Amazon Athena, Azure Synapse Analytics, Microsoft Fabric, Google BigQuery

The Shift Left Architecture moves data processing closer to the data source, leveraging real-time data streaming technologies like Apache Kafka and Flink to process data in motion directly after it is ingested. This approach reduces latency and improves data consistency and data quality.

Shift Left Architecture with Data Streaming into Data Lake Warehouse Lakehouse

Unlike ETL and ELT, which involve batch processing with the data stored at rest, the Shift Left Architecture enables real-time data capture and transformation. It aligns with the Zero ETL concept by making data immediately usable. But in contrast to Zero ETL, shifting data processing to the left side of the enterprise architecture avoids a complex, hard-to-maintain spaghetti architecture with many point-to-point connections.

The Shift Left Architecture also reduces the need for Reverse ETL by ensuring data is actionable in real-time for both operational and analytical systems. Overall, this architecture enhances data freshness, reduces costs, and speeds up the time-to-market for data-driven applications. Learn more about this concept in my blog post about “The Shift Left Architecture“.

Shift Left Architecture with Apacke Kafka Flink and Iceberg

Apache Iceberg as Open Table Format and Catalog for Seamless Data Sharing across Analytics Engines

An open table format and catalog introduces enormous benefits into the enterprise architecture: interoperability, freedom of choice of the analytics engines, faster time-to-market and reduced cost.

Apache Iceberg seems to become the de facto standard across vendors and cloud providers. However, it is still at an early stage and competing and wrapper technologies like Apache Hudi, Apache Paimon, Delta Lake and Apache XTable are trying to get momentum, too.

Iceberg and other open table formats are not just a huge win for the single storage and integration with multiple analytics / data / AI/ML platforms such as Snowflake, Databricks, Google BigQuery, et al., but also for the unification of operational and analytical workloads using data streaming with technologies such as Apache Kafka and Flink. The Shift Left Architecture is a significant benefit to reduce efforts, improve data quality and consistency, and enable real-time instead of batch applications and insights.

Finally, if you still wonder what the differences are between data streaming and lakehouses (and how they complement each other), check out this ten minute video:

What is your table format strategy? Which technologies and cloud services do you connect? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Iceberg – The Open Table Format for Lakehouse AND Data Streaming appeared first on Kai Waehner.

]]>
When (Not) to Choose Google Managed Service for Apache Kafka? https://www.kai-waehner.de/blog/2024/04/10/when-not-to-choose-google-apache-kafka-for-bigquery/ Wed, 10 Apr 2024 20:47:00 +0000 https://www.kai-waehner.de/?p=6302 Google announced its Apache Kafka for BigQuery cloud service at its conference Google Cloud Next 2024 in Las Vegas. Welcome to the data streaming club joining Amazon, Microsoft, IBM, Oracle, Confluent, and others. This blog post explores this new managed Kafka offering for GCP, reviews the current status of the data streaming landscape, and shares some criteria to evaluate when Kafka in general and Google Apache Kafka in particular should (not) be used.

The post When (Not) to Choose Google Managed Service for Apache Kafka? appeared first on Kai Waehner.

]]>
Google announced its Managed Service for Apache Kafka cloud service at its conference Google Cloud Next 2024 in Las Vegas. Welcome to the data streaming club joining Amazon, Microsoft, IBM, Oracle, Confluent, and others. This blog post explores this new managed Kafka offering for GCP, reviews the current status of the data streaming landscape, and shares some criteria to evaluate when Kafka in general and Google Apache Kafka in particular should (not) be used.

Google Apache Kafka for BigQuery GCP Cloud Service

Welcome Google Apache Kafka to the Data Streaming Club

Better late than never… Google announced a brand new Apache Kafka cloud service for GCP at Google Cloud Next 2024. All other leading cloud providers already have one, including AWS, Azure, Oracle, IBM, and Alibaba. Various other software vendors provide Kafka services, including Confluent, Aiven, Redpanda, WarpStream, and many more. Most leverage the open source Kafka project as its core component, others re-implement the Kafka protocol.

Apache Kafka and Apache Flink dominate the open source data streaming ecosystem. Vendors and cloud solutions provide cloud-native offerings. Some developers, data engineers and business people still struggle with a paradigm shift: Continuous data processing enables better data quality, reduced cost, and faster time to market with innovative new applications. Kafka and Flink are a match made in heaven for data streaming.

Use Cases for data streaming exist across all industries. Google Managed Service for Apache Kafka is potentially a good fit for some of them, but not for others.

Google Managed Service for Apache Kafka (MSK) – What is it?

Data Streaming is a NEW Software Category

Data streaming represents a new software category that revolutionizes the way businesses harness and process data in real time. Unlike traditional batch processing methods, data streaming enables continuous ingestion, analysis, and processing of data as it flows through systems. I explored this topic in the past when many people wanted to put Apache Kafka and its vendors into the integration platform category.

The Data Streaming Landscape 2024

Many software companies have emerged in the data streaming category in the last few years. And several mature players in the data market added support for data streaming in their platforms or cloud service ecosystem. Most software vendors use Kafka for their data streaming platforms. However, there is more than solutions powered by open source Kafka. Some vendors only use the Kafka protocol (e.g., Azure Event Hubs) or utterly different APIs (like Amazon Kinesis).

The following Data Streaming Landscape 2024 summarizes the current status of relevant products and cloud services for data streaming around Kafka and additional stream processing engines.

Data Streaming Landscape 2024 around Kafka Flink and Cloud

Forrester Wave for Streaming Data and IDG MarketScape for Stream Processing

Apache Kafka became the de facto standard for data streaming, similar to how Amazon S3 became the de facto standard for object storage.

In December 2023, the research company Forrester published “The Forrester Wave™: Streaming Data Platforms, Q4 2023“. Get free access to the report here. The leaders are Microsoft, Google and Confluent, followed by Oracle, Amazon, Cloudera, and a few others.

In April 2024, IDC named Confluent a leader in the IDC MarketScape for Worldwide Analytic Stream Processing 2024.

It would not be a surprise if we see a Gartner Magic Quadrant for Data Streaming soon, too. Gartner reports mention Kafka and related vendors more and more year by year.

Qualifying out a technology is often the easier option. Why evaluate a service if it does not solve the requirements? Let’s explore when NOT to use Kafka at all, and specifically when the Google Apache Kafka service is probably NOT the right choice for you.

When NOT to use Apache Kafka?

Apache Kafka has overlaps with technologies like a message broker (like IBM MQ, TIBCO or RabbitMQ), other streaming analytics platforms, and it actually is a database, too. But Apache Kafka is not an allrounder to solve every problem.

When NOT to Use Apache Kafka?

Apache Kafka is NOT:

  • A replacement for your favorite database, data warehouse or data lake. Instead it complements and integrates with these platforms.
  • An analytics platform for AI/ML model training, even though model scoring is often done within the streaming platform for critical or low-latency use cases.
  • A proxy for thousands of clients in bad networks.
  • An API Management solution, even though you can connect REST/HTTP producers and consumers against Kafka.
  • An IoT gateway, even though direct integration with IoT protocols like MQTT or OPC-UA is possible.
  • Hard real-time for safety-critical embedded workloads.

Read the thorough analysis “When NOT to use Apache Kafka?” for more details. Or watch this YouTube video:

When to Choose ANOTHER Kafka instead of Google’s?

If Apache Kafka is the right choice for your project, you still have plenty of options.

Comparison of Apache Kafka Data Streaming Offerings

Here are a few criteria that let you easily disqualify out Google Cloud Managed Service for Apache Kafka (MSK):

  • Non-GCP: If your use case requires on-premise, multi-cloud, hybrid cloud or edge deployments, then you need another offering.
  • Critical SLAs: If you need 24/7 critical support and consulting expertise, a dedicated Kafka vendor like Confluent is the better choice. Kafka is not just for analytics, but shines for transactional workloads, too. Google’s Managed Apache Kafka service is not GA yet. This will probably happen in the second half of 2024. Hence, don’t even consider it for critical applications before GA.
  • Serverless: A managed service is not always a truly managed service. The future will show where Google goes with Kafka. But right now, Google Apache Kafka is not serverless like e.g., Confluent Cloud. You pay for capacity pricing and cluster capacity management is required. Amazon even created a second service Amazon MSK Serverless to handle this issue with its traditional MSK offering. Learn why Amazon MSK is NOT always the right choice and adopt the learnings to Google.
  • Complete Data Streaming Platform: A data streaming platform requires more than just messaging: data integration with first and third party systems, stream processing for continuous data correlation, flexible (long-term) retention with Tiered Storage, data governance, and more. The future will show us where Google’s Kafka service goes. A detailed article compares Kafka offerings as car engine, complete car and self-driving car level 5. Google is a car, but not (yet) a Porsche (complete luxury car) and not yet a Google Waymo (self-driving car level 5). Google Apache Kafka even misses basic features for data streaming best practices, like defining data contracts in schemas for building data products in good data quality.

The Evolution of Data Streaming is NOT Stopping…

If you did not qualify out Kafka in general or Google Apache Kafka in particular yet, that’s great. Start evaluating Google Cloud’s Managed Service for Apache Kafka (MSK) service and compare it against self-managed open source Kafka and other semi-managed or fully-managed Kafka cloud services on GCP.

As we look ahead, the future possibilities for data streaming are boundless, promising more agile, intelligent, and real-time insights into the ever-increasing streams of data. I recently looked at the past, present and future of stream processing.

I often get the question if I am worried about the emerging competition as I work for Confluent where we “only do data streaming”?

No, I am not! Actually, the new Google Apache Kafka cloud service is great news for the industry! Data Streaming established itself as a new software category. Research analysts like Forrester and IDG already created dedicated waves and comparisons. What could be better than working with the people that invented Kafka and the company that created this software category across all industries and continents? And competition is always good for innovation, too.

Real-time data beats slow data. That’s true in almost every use case. At Confluent, we are now ~3000 people working only on one thing: Data Streaming. I think we should celebrate this Google announcement and look forward to more mass adoption of data streaming around the world.

And as a strategic Google partner, customers can

  1. leverage GCP credits to consume Confluent Cloud
  2. leverage GCPs security and private networking infrastructure
  3. integrate via fully managed connectors into various GCP services like Google Big Query or Google Cloud Storage and third party cloud solutions like MongoDB, Snowflake or Databricks.

Are you excited about the new Google Cloud Managed Service for Apache Kafka (MSK) service? Or do you use still plan to use open source Kafka or another cloud service like Confluent Cloud? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post When (Not) to Choose Google Managed Service for Apache Kafka? appeared first on Kai Waehner.

]]>
Why Tiered Storage for Apache Kafka is a BIG THING… https://www.kai-waehner.de/blog/2023/12/05/why-tiered-storage-for-apache-kafka-is-a-big-thing/ Tue, 05 Dec 2023 05:38:36 +0000 https://www.kai-waehner.de/?p=5787 Apache Kafka added Tiered Storage to separate compute and storage. The capability enables more scalable, reliable and cost-efficient enterprise architectures. This blog post explores the architecture, use cases, benefits, and a case study for storing Petabytes of data in the Kafka commit log. The end discusses why Tiered Storage does NOT replace other databases and how Apache Iceberg might change future Kafka architectures even more.

The post Why Tiered Storage for Apache Kafka is a BIG THING… appeared first on Kai Waehner.

]]>
Apache Kafka added Tiered Storage to separate compute and storage. The capability enables more scalable, reliable and cost-efficient enterprise architectures. This blog post explores the architecture, use cases, benefits, and a case study for storing Petabytes of data in the Kafka commit log. The end discusses why Tiered Storage does NOT replace other databases and how Apache Iceberg might change future Kafka architectures even more.

Tiered Storage for Apache Kafka - Use Cases Architecture Benefits

 

If you prefer watching a ten minute video, check out this summary about the “Evolution of Storage for Apache Kafka covering Tiered Storage, Direct Write to Object Storage and the relation to Open Table Formats such as Apache Iceberg”:

Now, let’s explore why Tiered Storage for Apache Kafka is a BIG THING:

Compute vs. Storage vs. Tiered Storage

Let’s define the terms compute, storage, and tiered storage to have the same understanding when exploring this in the context of the data streaming platform Apache Kafka.

Compute and Storage

Two fundamental components of a computing system are compute and storage. They serve different purposes in information processing.

Compute refers to the processing power and capability of a computer system to perform tasks, execute instructions, and carry out computations. The compute component includes the CPU (Central Processing Unit) and GPU (Graphics Processing Unit).

Storage refers to the components and systems that store and retrieving data over the long term. It is where data is persistently maintained for later use. Storage includes devices such as hard disk drives (HDDs), solid-state drives (SSDs), and other types of non-volatile memory, such as databases that keep data even when the power is turned off.

Tiered Storage

Tiered storage refers to a storage architecture that uses different classes or tiers of storage (e.g., Object Storage on S3) to efficiently manage and store data based on its access patterns, performance requirements, and cost considerations.

The goal of tiered storage is to optimize the use of storage resources, balancing performance and cost, by placing data on the most suitable storage media based on its characteristics and the organization’s policies.

Data placement and movement between these tiers can be automated based on policies and algorithms that analyze usage patterns, access frequency, and other factors. This ensures that the most critical and frequently accessed data lives in high-performance storage, while less critical or infrequently accessed data is moved to lower-cost, lower-performance storage.

Long-Term Storage in Apache Kafka

Apache Kafka is an open-source distributed streaming platform that is used for building real-time data pipelines and streaming applications. Kafka is the established de facto standard for data streaming. The event streaming platform handles large volumes of data, providing a scalable and fault-tolerant architecture.

Applications and data stores use Kafka for ingesting, storing, and processing real-time data streams, making it a fundamental component in building event-driven architectures and systems that require the processing of continuous data flows. Additionally, many use cases leverage Kafka not just for real-time data but to ensure data consistency across real-time, batch, and request-response APIs.

Use Cases for Apache Kafka as Storage System

While most people think about Kafka as a message broker, real-time analytics platform, or big data ingestion system, the distributed commit log with ordering guarantees and timestamps enables plenty of use cases for accessing data long after its creation or replaying historical data.

Use Cases for Replaying Historical Events with Apache Kafka

Here are a few examples for use cases that leverage long-term storage of data in Kafka:

  • New consumer: Deploy a new application / database / data warehouse, data lake and synchronize the state of the business objects.
  • Offloading: Reducing cost significantly by NOT consuming again and again from expensive or non-scalable systems (e.g. mainframe and MIPS)
  • Error-handling: Re-process historical data after fixing an issue in the business logic.
  • Compliance / regulatory processing: Replay historical data to analyze an incident.
  • Query and analyze existing events: Consume data from a notebook for data engineering, analytics, or reporting.
  • Schema changes in analytics platform: Re-process data after updating data contracts.
  • Model training: Batch ingestion into an AI framework to apply a machine learning algorithm
  • Disaster recovery: Operational data stores replay data again from the persistent commit log in the case of a failure.

Objections for Storing Data Long-Term in Kafka

Storing data long term in Kafka has a few drawbacks. The following arguments are valid concerns:

  • Cost: Storing large volumes of data on attached disks is much more expensive than external storage systems like an object store.
  • Scalability: Operating Kafka brokers with lots of data (say many gigabytes, or even terabytes, and more) is challenging, especially in the case of failures when you need to rebalance partitions.
  • Risk: Downtime or data inconsistencies happen if operations struggle with large volumes or when hardware needs to be migrated.

Therefore, you should NOT store big data sets in Kafka without Tiered Storage! With this in mind, let’s explore how Tiered Storage for Kafka solves these problems.

Introducing Tiered Storage for Apache Kafka

Apache Kafka’s backend is a distributed system running Kafka brokers. Each Kafka broker has processing and storage capabilities.

The applications are producers and consumers of events. Many interfaces communicate with Kafka brokers:

  • An application written in Java, Python, C++, Go, or any other programming language
  • A Kafka Connect source or sink connector connecting to IBM MQ, Spark, Snowflake, or any other data store or SaaS application
  • A stream processor built with Kafka-native Kafka Streams, KSQL, or external infrastructures like Apache Flink
  • Any other endpoint, like a HTTP interface or an out-of-the-box integration of another middleware or data platform

What is Tiered Storage for Kafka?

Tiered storage for Apache Kafka refers to the capability of configuring different storage tiers to optimize the storage infrastructure based on the access patterns and requirements of the data stored in Kafka brokers.

A Kafka cluster stores data in Kafka Topics. These topics can have different characteristics in terms of importance, access frequency, and retention policies.

The concept is like the general idea of tiered storage in storage systems, but it’s adapted to the specific needs of Kafka. Tiered Storage is one critical making the Kafka architecture cloud-native.

Kafka Architecture without Tiered Storage

Kafka applications communicate with logical Kafka Topics to produce messages to or consume messages from partitions:

Apache Kafka Architecture without Tiered Storage

The storage is a disk attached to the broker. This can be HDD or SDD disks on-premise or e.g. EBS volumes on AWS cloud.

Kafka Architecture with Tiered Storage

Tiered Storage for Kafka does NOT change how applications communicate with Kafka brokers. Tiered Storage is an implementation detail:

Apache Kafka Architecture with Tiered Storage

Besides the disks attached to the broker, Kafka offloads data to an external storage. Most times, this is an object storage like Amazon S3, Azure Blog Storage, Google Cloud Storage, or MinIO for Kubernetes.

Serverless cloud offerings handle the offloading for the operator. Self-managed solutions allow operators to configure hot and cold storage durations for each Kafka Topic.

Benefits of Tiered Storage for Apache Kafka

Let’s review the above-discussed objections to storing big data sets long-term in Kafka and how Tiered Storage helps:

  • Reduced cost: Most data is offloaded to an external storage. This reduces the storage cost significantly.
  • Improved scalability: Only data on the disks attached to the Kafka brokers must be rebalanced. As most data is offloaded, rebalancing only takes seconds or minutes; even if the external storage saves petabytes.
  • Reduced risk: Better scalability and separation of compute and storage makes operations much easier and significantly reduces the risk of downtime or data inconsistency.

The Implementation of Tiered Storage in Apache Kafka

Tiered Storage for Apache Kafka is available. However, be aware that different implementations exist with different features, maturity, and support levels.

And open source Apache Kafka only provides the interface for tiered storage. You must choose an open source implementation, build your own integration into an external storage system, or leverage a commercial product or cloud service that embeds tiered storage into its offering.

Keep in mind that the interface alone is not helpful. The implementation needs to be battle-tested and guarantee data consistency across the hot storage on the broker and cold storage in the external storage; even in the case of failure, network issues, etc.

Kafka consumers do not see the implementation details of Kafka’s Tiered Storage. They just consumed as if there was no tiered storage implementation (and still expect the same behavior). There are no API or code changes needed in Kafka client applications. Hence, you can easily migrate an existing deployment to a Kafka cluster leveraging Tiered Storage.

Many people ask about the performance impact of tiered storage for Kafka. The short answer: There is no performance impact for most scenarios. Real-time consumers consume from the memory / page cache as before. And replaying historical data from the event log does not differ much from the local disk or the remote object-store.

AK 3.6 Release Makes Tiered Storage Available

When writing this blog post (December 2023), KIP-405: Kafka Tiered Storage is available as early access in Apache Kafka 3.6. This release introduces Tiered Storage to Kafka. This release is only for non-production environments (see the early access notes for more information).

GA of this feature is just a foreseeable matter of time. The bulk of KIP-405 was part of early access in release 3.6. But there are a few additional features that are slated for 3.7. And GA likely comes after that in 3.8+.

KIP-405 Provides a Pluggable Storage API for Tiering

KIP-405 separates computation and storage in the Kafka broker for pluggable storage tiering natively in Kafka Tiered Storage, bringing a seamless storage extension to remote objects with minimal operational changes.

Apache Kafka’s LocalTieredStorage default implementation is a local file-based RemoteStorageManager. LocalTieredStorage facilitates the simulation of remote storage behavior in a controlled and isolated environment during testing. This is not meant for production use cases! Enterprises need to write their own implementation, embed an open-source alternative, or trust a software vendor respectively cloud service.

How Confluent, Uber, and Others use Tiered Storage

KIP-405 is only available in preview with Kafka 3.6. But some proprietary implementations already exist for years in production. This also helped to define the KIP with lessons learned from running Kafka in production with tiered storage under the hood.

Implementation details of tiered storage for Kafka vary, and there may be different approaches or tools available to achieve this, depending on the specific Kafka distribution or storage infrastructure being used. Organizations might also use external systems or cloud storage solutions to implement tiered storage strategies for Kafka.

Confluent pioneered tiered storage for Kafka and has provided the capability for several years already. It is available for the self-managed Confluent Platform and the fully managed Confluent Cloud in AWS, Azure, and GCP. Confluent chose the S3 interface to implement storage support for the cloud providers (AWS, Azure, GCP) and several on-premise solutions like PureStorage Flash Blade, Nutanix Objects, Netapp Object Storage, Dell EMC ECS, Hitachi Content Platform Object Storage, or MinIO for Kubernetes.

Uber, who had the lead in implementing the KIP-405 in open source Apache Kafka, runs its tiered storage against HDFS. Confluent and AWS contributed to refactoring, best practices and performance / integration testing. Satish Duggana, tech lead for Data and Streaming Infrastructure at Uber, presented the details of their implementations and deployment in a talk at Current 2023.

Other vendors like AWS with MSK and Aiven are adopting KIP-405 and provide their own tiered storage implementations these days.

Case Study: KOR Financial stores 160 Petabytes in Kafka for Regulatory Reporting

KOR is a cloud-native family of global trade repositories and regulatory reporting services that has adopted Confluent Cloud and a data streaming architecture to improve compliance processes.

Regulatory reporting is obviously a perfect use case for Tiered Storage in Kafka to replay historical data. As the Kafka log provides guaranteed ordering and timestamps, there is no need for another database or data lake besides Kafka.

Daan Gerits, Chief Data Officer, KOR Financial, explains at Diginomica: “At KOR Financial, we have a very specific problem that we are trying to solve, which is collecting trading information for regulators. And we decided to do it in a totally different way to the way that most people are doing it. Where others would be using data storage or big data technologies, we decided to go all in on Kafka. We are building our system to store 160 petabytes in Confluent Cloud and then work on top of that. We don’t have any other database. So it’s a long retention use case.”

Kafka is NOT a Database (Replacement)

Apache Kafka is a database. It provides ACID guarantees. Hundreds of companies for deploy Kafka for mission-critical deployments including transactional workloads. However, most times, Kafka is NOT competitive to other databases.

Can Apache Kafka Replace a Database like Oracle Hadoop NoSQL MongoDB Elastic MySQL et al

Kafka is an event streaming platform for messaging, storage, processing and integration at scale in real-time with zero downtime and zero data loss. Almost all deployments connect Kafka to database sources and sinks for data integration, decoupling and data consistency, where the heart of the cloud-native enterprise architecture is real-time, scalable and reliable.

Apache Kafka is complementary to database, data warehouse, data lake and Lakehouse architectures. I wrote a blog series about use cases and architectures for data streaming other storage platforms.

The Future: Apache Iceberg for Kafka?

The adoption of Tiered Storage for Apache Kafka is just getting started. Many teams will store (some) data longer in Kafka to offload data from expensive systems or replay historical data without needing another database.

However, most analytics platforms do NOT use the Kafka protocol to consume and query data. The trend across most data platforms goes towards Apache Iceberg as a standardized abstraction layer for storing and querying (non-real-time) data in an objects store or other storage.

Apache Iceberg is an open-source table format and processing framework for big data. It aims to provide the best of both worlds: the performance of a traditional table format with the flexibility of a schema-on-read approach. Iceberg addresses solves in managing large-scale and evolving data sets in distributed storage environments.

Apache Iceberg supports popular data processing frameworks, such as Apache Spark, Apache Flink, Apache Hive, Presto, and more. With Kafka’s Tiered Storage and especially the S3 support by some vendors, I can see how this can be an entire game changer for storing and processing events in real-time with the Kafka protocol or with other analytics engines and databases in near-real-time or batch.

The future will show us. For now, let’s be excited about how Tiered Storage for Kafka is the next big thing around data streaming.

Tiered Storage makes Kafka more Scalable, Cost-Efficient and Reliable

Tiered Storage for Apache Kafka makes event-driven architectures more scalable, cost-efficient and reliable. It enables new use cases that require another database or data lake in the past.

However, Kafka’s goal is still NOT to replace other data and analytics platforms. Design patterns like microservices and data mesh enable a true decoupling of applications and data stores. Kafka provides this decoupling. With tiered storage in mind for various use cases such as offloading, new consumers, or error-handling, you can consider new approaches for your cloud-native enterprise architecture.

Are you excited about Tiered Storage for Apache Kafka? How will you use it? Or do you already use an existing implementation, like Confluent Cloud? Let’s connect on LinkedIn and discuss it! Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

The post Why Tiered Storage for Apache Kafka is a BIG THING… appeared first on Kai Waehner.

]]>
The State of Data Streaming for Digital Natives in 2023 https://www.kai-waehner.de/blog/2023/07/18/the-state-of-data-streaming-for-digital-natives-in-2023/ Tue, 18 Jul 2023 06:19:44 +0000 https://www.kai-waehner.de/?p=5520 This blog post explores the state of data streaming in 2023 for digital natives born in the cloud. Data streaming allows integrating and correlating data in real-time at any scale to improve the most innovative applications leveraging Apache Kafka. I explore how data streaming helps as a business enabler, including customer stories from New Relic, Wix, Expedia, Apna, Grab, and more. A complete slide deck and on-demand video recording are included.

The post The State of Data Streaming for Digital Natives in 2023 appeared first on Kai Waehner.

]]>
This blog post explores the state of data streaming in 2023 for digital natives born in the cloud. The evolution of digital services and new business models requires real-time end-to-end visibility, fancy mobile apps, and integration with pioneering technologies like fully managed cloud services for fast time-to-market, 5G for low latency, or augmented reality for innovation. Data streaming allows integrating and correlating data in real-time at any scale to improve the most innovative applications leveraging Apache Kafka.

I look at trends for digital natives to explore how data streaming helps as a business enabler, including customer stories from New Relic, Wix, Expedia, Apna, Grab, and more. A complete slide deck and on-demand video recording are included.

The State of Data Streaming for Digital Natives in 2023

Digital natives are data-driven tech companies born in the cloud. The SaaS solutions are built on cloud-native infrastructure that provides elastic and flexible operations and scale. AI and Machine Learning improve business processes while the data flows through the backend systems.

The data-driven enterprise in 2023

McKinsey & Company published an excellent article on seven characteristics that will define the data-driven enterprise:

1) Data embedded in every decision, interaction, and process
2) Data is processed and delivered in real-time
3) Flexible data stores enable integrated, ready-to-use data
4) Data operating model treats data like a product
5) The chief data officer’s role is expanded to generate value
6) Data-ecosystem memberships are the norm
7) Data management is prioritized and automated for privacy, security, and resiliency

This quote from McKinsey & Company precisely maps to the value of data streaming for using data at the right time in the right context. The below success stories are all data-driven, leveraging these characteristics.

Digital natives born in the cloud

A digital native enterprise can have many meanings. IDC has a great definition:

IDC defines Digital native businesses (DNBs) as companies built based on modern, cloud-native technologies, leveraging data and AI across all aspects of their operations, from logistics to business models to customer engagement. All core value or revenue-generating processes are dependent on digital technology.”

Companies are born in the cloud, leverage fully managed services, and are consequently innovative with fast time to market.

AI and machine learning (beyond the buzz)

“ChatGPT, while cool, is just the beginning; enterprise uses for generative AI are far more sophisticated.” says Gartner. I can’t agree more. But even more interesting: Machine Learning (the part of AI that is enterprise ready) is already used in many companies.

While everybody talks about Generative AI (GenAI) these days, I prefer talking about real-world success stories that leverage analytic models for many years already to detect fraud, upsell to customers, or predict machine failures. GenAI is “just” another more advanced model that you can embed into your IT infrastructure and business processes the same way.

Data streaming at digital native tech companies

Adopting trends across industries is only possible if enterprises can provide and correlate information correctly in the proper context. Real-time, which means using the information in milliseconds, seconds, or minutes, is almost always better than processing data later (whatever later means):

Real-time Data beats Slow Data in Retail

Digital natives combine all the power of data streaming: Real-time messaging at any scale with storage for true decoupling, data integration, and data correlation capabilities. Apache Kafka is the de facto standard for data streaming.

Data streaming with the Apache Kafka ecosystem and cloud services are used throughout the supply chain of any industry. Here are just a few examples:

Digital Natives Born in the Cloud with Data Streaming

Search my blog for various articles related to this topic in your industry: Search Kai’s blog.

Elastic scale with cloud-native infrastructure

One of the most significant benefits of cloud-native SaaS offerings is elastic scalability out of the box. Tech companies can start new projects with a small footprint and pay-as-you-go. If the project is successful or if industry peaks come (like Black Friday or Christmas season in retail), the cloud-native infrastructure scales up; and back down after the peak:

Elastic Scale with Cloud-native Data Streaming

There is no need to change the architecture from a proof of concept to an extreme scale. Confluent’s fully managed SaaS for Apache Kafka is an excellent example. Learn how to scale Apache Kafka to 10+ GB per second in Confluent Cloud without the need to re-architect your applications.

Data streaming + AI / machine learning = real-time intelligence

The combination of data streaming with Kafka and machine learning with TensorFlow or other ML frameworks is nothing new. I explored how to “Build and Deploy Scalable Machine Learning in Production with Apache Kafka” in 2017, i.e., six years ago.

Since then, I have written many further articles and supported various enterprises deploying data streaming and machine learning. Here is an example of such an architecture:

Apache Kafka and Machine Learning - Kappa Architecture

Data mesh for decoupling, flexibility and focus on data products

Digital natives don’t (have to) rely on monolithic, proprietary, and inflexible legacy infrastructure. Instead, tech companies start from scratch with modern architecture. Domain-driven design and microservices are combined in a data mesh, where business units focus on solving business problems with data products:

Cluster Linking for data replication with the Kafka protocol

Digital natives leverage trends for enterprise architectures to improve cost, flexibility, security, and latency. Four essential topics I see these days at tech companies are:

  • Decentralization with a data mesh
  • Kappa architecture replacing Lambda
  • Global data streaming
  • AI/Machine Learning with data streaming

Let’s look deeper into some enterprise architectures that leverage data streaming.

Decentralization with a data mesh

There is no single technology or product for a data mesh!  But the heart of a decentralized data mesh infrastructure must be real-time, reliable, and scalable.

Data streaming with Apache Kafka is the perfect foundation for a data mesh: Dumb pipes and smart endpoints truly decouple independent applications. This domain-driven design allows teams to focus on data products:

Domain Driven Design DDD with Kafka for Industrial IoT MQTT and OPC UA

Contrary to a data lake or date warehouse, the data streaming platform is real-time, scalable and reliable. A unique advantage for building a decentralized data mesh.

Kappa architecture replacing Lambda

The Kappa architecture is an event-based software architecture that can handle all data at all scale in real-time for transactional AND analytical workloads.

The central premise behind the Kappa architecture is that you can perform real-time and batch processing with a single technology stack. The heart of the infrastructure is streaming architecture.

Kappa Architecture with one Pipeline for Real Time and Batch

Unlike the Lambda Architecture, in this approach, you only re-process when your processing code changes and need to recompute your results.

I wrote a detailed article with several real-world case studies exploring why “Kappa Architecture is Mainstream Replacing Lambda“.

Global data streaming

Multi-cluster and cross-data center deployments of Apache Kafka have become the norm rather than an exception.

Hybrid and Global Apache Kafka and Event Streaming Use Case

Several scenarios require multi-cluster Kafka deployments with specific requirements and trade-offs, including disaster recovery, aggregation for analytics, cloud migration, mission-critical stretched deployments, and global Kafka.

Learn all the details in my article “Architecture patterns for distributed, hybrid, edge and global Apache Kafka deployments“.

Natural language processing (NLP) with data streaming for real-time Generative AI (GenAI)

Natural Language Processing (NLP) helps many projects in the real world for service desk automation, customer conversation with a chatbot, content moderation in social networks, and many other use cases. Generative AI (GenAI) is “just” the latest generation of these analytic models. Many enterprises have combined NLP with data streaming for many years for real-time business processes.

Apache Kafka became the predominant orchestration layer in these machine learning platforms for integrating various data sources, processing at scale, and real-time model inference.

Here is an architecture that shows how teams easily add Generative AI and other machine learning models (like large language models, LLM) to their existing data streaming architecture:

Chatbot with Apache Kafka and Machine Learning for Generative AI

Time to market is critical. AI does not require a completely new enterprise architecture. The true decoupling allows the addition of new applications/technologies and embedding them into the existing business processes.

An excellent example is Expedia: The online travel company added a chatbot to the existing call center scenario to reduce costs, increase response times, and make customers happier. Learn more about Expedia and other case studies in my blog post “Apache Kafka for Conversational AI, NLP, and Generative AI“.

New customer stories of digital natives using data streaming

So much innovation is happening with data streaming. Digital natives lead the race. Automation and digitalization change how tech companies create entirely new business models.

Most digital natives use a cloud-first approach to improve time-to-market, increase flexibility, and focus on business logic instead of operating IT infrastructure. And elastic scalability gets even more critical when you start small but think big and global from the beginning.

Here are a few customer stories from worldwide telecom companies:

  • New Relic: Observability platform ingesting up to 7 billion data points per minute for real-time and historical analysis.
  • Wix: Web development services with online drag & drop tools built with a global data mesh.
  • Apna: India’s largest hiring platform powered by AI to match client needs with applications.
  • Expedia: Online travel platform leveraging data streaming for a conversational chatbot service incorporating complex technologies such as fulfillment, natural-language understanding, and real-time analytics.
  • Alex Bank: A 100% digital and cloud-native bank using real-time data to enable a new digital banking experience.
  • Grab: Asian mobility service that built a cybersecurity platform for monitoring 130M+ devices and generating 20M+ Al-driven risk verdicts daily.

Resources to learn more

This blog post is just the starting point. Learn more about data streaming and digital natives in the following on-demand webinar recording, the related slide deck, and further resources, including pretty cool lightboard videos about use cases.

On-demand video recording

The video recording explores the telecom industry’s trends and architectures for data streaming. The primary focus is the data streaming case studies. Check out our on-demand recording:

The State of Data Streaming for Digital Natives (Video)

Slides

If you prefer learning from slides, check out the deck used for the above recording:

Fullscreen Mode

Data streaming case studies and lightboard videos of digital natives

The state of data streaming for digital natives in 2023 is fascinating. New use cases and case studies come up every month. This includes better data governance across the entire organization, real-time data collection and processing data from network infrastructure and mobile apps, data sharing and B2B partnerships with new business models, and many more scenarios.

We recorded lightboard videos showing the value of data streaming simply and effectively. These five-minute videos explore the business value of data streaming, related architectures, and customer stories. Stay tuned; I will update the links in the next few weeks and publish a separate blog post for each story and lightboard video.

And this is just the beginning. Every month, we will talk about the status of data streaming in a different industry. Manufacturing was the first. Financial services second, then retail, telcos, digital natives, gaming, and so on…

Let’s connect on LinkedIn and discuss it! Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

The post The State of Data Streaming for Digital Natives in 2023 appeared first on Kai Waehner.

]]>
The Data Streaming Landscape 2023 https://www.kai-waehner.de/blog/2022/12/21/data-streaming-landscape-2023/ Wed, 21 Dec 2022 07:29:58 +0000 https://www.kai-waehner.de/?p=4953 Data streaming is a new software category to process data in motion. Apache Kafka is the de facto standard used by over 100,000 organizations. Plenty of vendors offer Kafka platforms and cloud services. Many complementary stream processing engines like Apache Flink and SaaS offerings have emerged. And competitive technologies like Pulsar and Redpanda try to get market share. This blog post explores the data streaming landscape of 2023 to summarize existing solutions and market trends.

The post The Data Streaming Landscape 2023 appeared first on Kai Waehner.

]]>
Data streaming is a new software category to process data in motion. Apache Kafka is the de facto standard used by over 100,000 organizations. Plenty of vendors offer Kafka platforms and cloud services. Many complementary stream processing engines like Apache Flink and SaaS offerings have emerged. And competitive technologies like Pulsar and Redpanda try to get market share. This blog post explores the data streaming landscape of 2023 to summarize existing solutions and market trends.

Data Streaming Landscape 2023 with Apache Kafka Flink and much more

Data streaming is a new software category

Data-driven applications are the new black. This approach increases the business value as the overall goal by increasing revenue, reducing cost, reducing risk, or improving the customer experience.

Plenty of software categories and related data platforms exist to process and analyze data:

  • Database: Store and execute transactional workloads.
  • Data Warehouse: Processing structured historical data to create recurring reports and unique insights.
  • Data Lake: Processing structured and semi- or unstructured big data sets with batch processing to create recurring reports and unique insights.
  • Lakehouse: A mix of data warehouse and data lake to process all data on one platform.
  • Data Streaming: Continuously process data in motion and provide data consistency across communication paradigms instead of storing and analyzing data at rest.

Of course, these data platforms often overlap a bit. I did a complete blog series exploring the use cases and how they complement each other.

  1. Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
  2. Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
  3. Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
  4. Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
  5. Lessons Learned from Building a Cloud-Native Data Warehouse

Data streaming use cases by business value

Use cases for data streaming exist across all industries:

Use Cases for Data Streaming with Apache Kafka by Business Value

Adding business value is crucial for any enterprise. With so many potential use cases, it is no surprise that more and more software vendors add Kafka support to their products. Search my blog for your favorite industry to find plenty of case studies and architectures. Or read about use cases for Apache Kafka across industries to get started.

The data streaming landscape of 2023

Data Streaming is a separate software category of data platforms. Many software vendors built their entire businesses around this category.

The data streaming landscape shows that most vendors use Kafka or implement its protocol because it has become the de facto standard.

New software companies have emerged in this category in the last few years. And several mature players in the data market added support for data streaming in their platforms or cloud service ecosystem.

Apache Kafka is the de facto standard for data streaming, like Amazon S3 is the de facto standard for S3 object storage. Most software vendors use Kafka for their data streaming platforms. However, there is more than Kafka. Some vendors only use the Kafka protocol (Azure Event Hubs) or utterly different APIs (like Amazon Kinesis).

The following Data Streaming Landscape 2023 summarizes the current status of relevant products and cloud services:

Data Streaming Landscape 2023 around Apache Kafka and Cloud

Please note: This is not a complete list of frameworks, cloud services, or vendors. It is not an official research landscape. If your favorite technology is not in this diagram, then I did not see it in my conversations with customers, prospects, partners, analysts, or the broader data streaming community. We will probably see many more logos in this diagram in a year or two, as this is still the beginning of the data streaming era.

Also, note that I focus on general data streaming infrastructure. Brilliant solutions exist for using and analyzing streaming data for specific scenarios, like time series databases, machine learning engines, or observability platforms. These are complementary and often connected out of the box to a streaming cluster.

Evaluation criteria for data streaming platforms

I often recommend using the following four aspects to look at different frameworks, platforms, and cloud services to evaluate a technology for your business project or enterprise architecture strategy:

  • Cloud-native: Is the solution elastic to scale up and down? Is it fully managed / serverless, or just a bunch of server instances hosted in the cloud? Can you automate the development, operations, and testing process using DevOps, GitOps, test-driven development, and similar principles?
  • Complete: Does the solution offer all required capabilities? Data streaming requires more than just messaging or data ingestion. Hence, does it provide connectors, data processing, governance, security, self-service, and so on?
  • Everywhere: Where can you use the solution? Cloud-only? Are all required cloud service providers supported? Is there an option to deploy in a data center or even at the edge (i.e., outside a data center)? How can you share data between regions, clouds or data centers? What use cases are supported (e.g., aggregation, disaster recovery, hybrid integration, etc.)?
  • Supported: Is the solution mature and battle-tested? Are public case studies available for your use case or industry? Does the vendor fully support the product? What are the SLAs? Are specific features excluded from commercial enterprise support? It is a shame that this aspect needs to be evaluated. Still, some vendors offer data streaming cloud services and exclude support in the terms and conditions (that many people don’t read in cloud services, unfortunately).

Let’s take a deeper look into the different categories and start with the leading technology: Native Apache Kafka…

Apache Kafka is the de facto standard for data streaming

Starting with the leader and de facto standard Apache Kafka and related vendors and SaaS offerings. Apache Kafka became the de facto standard for data streaming like Amazon S3 is the de facto standard for object storage:

De Facto Standard API - Amazon S3 for Object Storage and Apache Kafka for Event Streaming

Read the detailed blog post to learn more about the differences between an open-source standard like Kafka and a proprietary protocol like S3.

When you explore the data streaming world, there is no way not to look at the Apache Kafka ecosystem.

Apache Kafka adoption and growth

The growth of the Apache Kafka community in the last few years is impressive. Here are some statistics that Jay Kreps presented at the data streaming conference “Current – The Next Generation of Kafka Summit” in Austin, Texas, in October 2022:

  • >100,000 organizations using Apache Kafka
  • >41,000 Kafka meetup attendees
  • >32,000 Stack Overflow questions
  • >12,000 Jiras for Apache Kafka
  • >31,000 Open job listings request Kafka skills

And look at the increased number of active monthly unique users downloading the Kafka Java client library with Maven:

Sonatype Maven Kafka Client Downloads
Source: Sonatype

Fun fact: The leading conference for Kafka was rebranded from “Kafka Summit” to “Current 2022 – The Next Generation of Kafka Summit”. Why? Because data streaming is more than Kafka. Many complementary and competitive technologies were present, including vendors, booths, demos, and customer case studies. That’s a remarkable evolution of data streaming for the community and enterprises across the globe!

Apache Kafka Vendors: self-managed vs. cloud offerings

New software companies focus on data streaming. And traditional players like IBM and Amazon jumped on the bandwagon in the past few years. On a top level – to keep it simple – three kinds of offerings exist for Apache Kafka:

Comparison of Apache Kafka Data Streaming Offerings

I made a detailed comparison of on-premise Kafka vendors and cloud services using this car analogy. Only Amazon MSK Serverless (i.e., the fully managed service, not the partially Managed MSK) was not available when writing this comparison. Hence, read Confluent Cloud versus Amazon MSK Serverless.

Kafka-native Data Streaming Products and SaaS

Here are a few notes on each vendor as a summary.

  • Apache Kafka: The de facto standard for data streaming. Open source with a vast community. All the vendors in this list rely on (parts of) this project.
  • Confluent: Provides data streaming everywhere with Confluent Platform (self-managed) and Confluent Cloud (fully managed and available across cloud providers).
  • Cloudera: Provides Kafka as a self-managed offering. Focuses on combining many data technologies like Kafka, Hadoop, Spark, Flink, NiFi, and many more.
  • Red Hat: Provides Kafka as a partially managed cloud offering and self-managed Kafka on Kubernetes via OpenShift. Kafka is part of the integration portfolio that includes other open-source frameworks like Apache Camel.
  • TIBCO: Offers Kafka for Linux and Windows. Strange product (as Kafka experts know Kafka does not work well on Windows) and minimal documentation.
  • AWS: Provides two separate products with Amazon MSK (partially managed) and Amazon MSK Serverless (fully managed). Kafka support is excluded in the MSK offerings. AWS has hundreds of cloud services, and Kafka is part of that broad spectrum. Only available on AWS clouds.
  • Instaclustr and Aiven: Partially managed Kafka cloud offerings across cloud providers. The product portfolios offer various hosted services of open-source technologies. Instaclustr also offers a (semi-)managed offering for on-premise infrastructure.
  • Microsoft Azure HDInsight. A piece of Azure’s Hadoop infrastructure. Not intended for other use cases. Only available on Azure clouds.
  • Lenses and Conduktor: Tools for managing and monitoring Kafka clusters. Complementary to the other vendors.

This is no comparison. Just a list with a few notes. Make your own evaluation of your favorite vendors. Check what you need: Cloud-native? Complete? Everywhere? Supported?

Kafka-compatible open-source frameworks and SaaS

A few vendors don’t rely on open-source Apache Kafka but built their own implementations for different reasons. The Kafka protocol compatibility is limited (though marketing will not tell you). This can create risk in operating existing Kafka workloads against the cluster and differs in operations and execution (which can be good or bad).

Kafka-compatible Open-Source Frameworks and Cloud Services

Here are a few notes on each vendor as a summary:

  • Apache Pulsar: A competitor to Apache Kafka. Similar story and use cases, but different architecture (Kafka is one distributed cluster – after removing the ZooKeeper dependency in 2022), Pulsar is three distributed clusters (Pulsar brokers, ZooKeeper, BookKeeper). I wrote about Pulsar vs. Kafka two years ago, and I think the status is still the same (and it is too late now to get more market traction).
  • StreamNative: The primary vendor behind Apache Pulsar. Offers self-managed and fully managed solutions. StreamNative Cloud for Kafka is in beta and not production ready.
  • DataStax: A Pulsar offering integrated into the database-focused product portfolio. Not sure if the streaming product is just marketing or not. If you want to try out the Astra Streaming cloud service powered by Pulsar, it refers you to the multi-cloud DBaaS built on Apache Cassandra.
  • Redpanda: A new entrant into the data streaming market offering self-managed and fully managed products. Interesting approach to implementing the Kafka protocol with C++. It might take some market share if they can find the proper use cases and differentiators. Today, I don’t see Redpanda as an alternative to a Kafka-native offering because of its early stage in the maturity curve and no added value for solving business problems versus the added risk compared to Apache Kafka.
  • Azure Event Hubs: A mature, fully managed cloud service. The service does one thing, and that is done very well: Data ingestion via the Kafka protocol (with limited compatibility). Hence, it is not a complete streaming platform, but is more comparable to Amazon Kinesis or Google Cloud PubSub. Only available on Azure cloud.

Be careful about statements of vendors that reimplement the Kafka protocol. Most of these vendors oversell the Kafka protocol compatibility. Additionally, “benchmarketing” (i.e., picking a sweet spot or niche scenario where you perform better than your competitor) is the favorite marketing technique to “prove” differentiators to the real Apache Kafka.

Data streaming is more than Apache Kafka…

While Apache Kafka is the de facto standard for data streaming, many complementary and competitive technologies exist.

Data Streaming SaaS like Apache Flink Spark Databricks Amazon Kinesis and Google PubSub

Even more technologies emerge these days because of the growth of this software category across the globe and all industries. That’s excellent news. Data streaming is here to stay and grow.

The situation is challenging to explore as part of the data streaming landscape, as some products are complementary and competitive to the Apache Kafka ecosystem.

Some data streaming technologies are competitive to Kafka

In some situations, you must evaluate whether Apache Kafka or another technology is the right choice. Here are a few open-source and cloud competitors:

  • Amazon Kinesis: Data ingestion into AWS data stores. Mature product for a specific problem. Only available on AWS.
  • Google Cloud PubSub: Data ingestion into GCP data stores. Mature product for a specific problem. Only available on GCP.
  • Pravega and Hazelcast Jet: Open-source frameworks for stream processing. I added these to show that there are more than Kafka and Flink in the open-source world. Though, I see little market traction.

Amazon Kinesis and Google Cloud PubSub are excellent cloud services if you “just” want to ingest data into a specific cloud storage. If there are no other use cases, these tools might be the right choice (if pricing at scale and other limitations work for you).

Apache Kafka is a much more flexible and strategic data streaming platform. Many projects still start with data ingestion and build the first pipeline. But providing access to the same stream of events to any other data sink or for powerful stream processing with tools like Kafka Streams or Apache Flink is a significant advantage.

Some data streaming technologies are complementary to Kafka

Each stream processing framework or cloud service has trade-offs. There is no single size that fits all use cases. Here are a few mature and emerging technologies that complement Apache Kafka:

  • Apache Flink: Together with Kafka Streams (part of Apache Kafka), the leading open-source stream processing framework. Advanced features include ANSI SQL support and APIs for stream and batch workloads.
  • Decodable and Immerok: Two brand new cloud services. Very early stage. I still added them, as I think it is an excellent strategic move to build a data streaming cloud service on top of Apache Flink. Huge potential if it is combined with existing Kafka infrastructures in enterprises.
  • Spark Streaming: The streaming part of Apache Spark. I am still not 100 percent convinced. Kafka Streams and Apache Flink are the better choices for stream processing. However, the enormous installed base of Spark clusters in enterprises broadens adoption.
  • Databricks: The leading vendor behind Apache Spark. Getting or at least trying to get much more into the business of real-time data. I like the platform, but I am not convinced by the lakehouse story around “doing everything within one big data lake”. Check out my blog series “Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?“.

Most of these technologies complement Apache Kafka. But stream processing frameworks like Flink or cloud services like Databricks do NOT need Kafka as an ingestion layer. There are other options…

Flink, Spark, et al. can consume data from other streaming platforms or directly from data stores. However, be careful with the latter: If you use Flink or Spark Streaming for stream processing, that’s fine. But if the first thing to do is read the data from an S3 object store, well, that is data at rest. Don’t do stream processing with data at rest.

Or in other words, don’t store data in a database or data lake just to reverse it later. Almost all Spark Streaming examples and case studies I saw last year at conferences and customer meetings looked like this. That is an anti-pattern for stream processing!

To be clear: It is okay to ingest data from S3 or another data store to a stream processing application built with Kafka Streams, Flink, et al. This data can be used in the stateful backend for your tasks like enrichment purposes. A stream processing application is not just about real-time data feeds. It also correlates these real-time feeds with (already ingested) historical data. This is a common approach for metadata or business data that is updated less frequently (like from an SAP ERP system).

Why are Kafka Streams and KSQL missing in the data streaming landscape?

I intentionally did not put Kafka Streams and KSQL into the data streaming landscape. Both are Kafka-native stream processing technologies.

Kafka Streams, like Kafka Connect, are part of open-source Apache Kafka. Hence, the Java library is included if you download Kafka from the Apache website. It is already included in the data streaming landscape with the Kafka logo. You should always ask yourself if you need another framework besides Kafka Streams for stream processing. The significant benefit: One technology, one vendor, one infrastructure.

Many vendors exclude or do not focus on Kafka Streams and Kafka Connect and only offer incomplete Kafka; they want to sell their own integration and processing products instead.

KSQL is an abstraction layer on top of Kafka Streams to provide stream processing with streaming SQL. A great tool, also Kafka-native. It comes with a Confluent Community License and is free to use. Hence, like Kafka Streams, I see it as part of Kafka and did not explicitly put it into the data streaming landscape as a separate product. But you need to evaluate it against Flink, Decodable, and others, for your use case, of course.

The data streaming era is just beginning…

The data streaming landscape 2023 shows how a new software category is emerging. We are still in a very early stage. In most conversations with customers, partners, and the community, I hear statements like:

“We see the value, but we are not there yet – we now start with building first data streaming pipelines and have a roadmap for the next years to add more advanced stream processing”.

Data streaming is a long journey, as it is a paradigm shift. We hopefully see a Gartner Magic Quadrant for Event Streaming and a Forrester Wave for Data Streaming in the foreseeable, too. A new category takes time to create. But did you already notice how much more the analysts of Gartner, Forrester, and others already write about data streaming and the various vendors? I also wrote a dedicated blog explaining why data streaming is its own software category.

Looking at the competitive data streaming market, one of my favorite real-world examples for choosing the right stream processing technologies comes from DoorDash: Why companies migrate from Amazon SQS and Kinesis to Apache Kafka and Flink. The article explores the trade-offs between cloud-specific solutions like Kinesis or PubSub and an open ecosystem around open-source technologies like Kafka and Flink.

Last but not least, check out my Top 5 Data Streaming Trends for 2023 to understand how the data streaming landscape fits into emerging trends like data mesh, data sharing, and data governance.

What are your most relevant and exciting trends for data streaming and Apache Kafka in 2023 to set data in motion? What does your enterprise landscape for data streaming look like? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post The Data Streaming Landscape 2023 appeared first on Kai Waehner.

]]>
The Heart of the Data Mesh Beats Real-Time with Apache Kafka https://www.kai-waehner.de/blog/2022/07/28/the-heart-of-the-data-mesh-beats-real-time-with-apache-kafka/ Thu, 28 Jul 2022 06:08:38 +0000 https://www.kai-waehner.de/?p=4706 If there were a buzzword of the hour, it would undoubtedly be "data mesh"! This new architectural paradigm unlocks analytic and transactional data at scale and enables rapid access to an ever-growing number of distributed domain datasets for various usage scenarios. The data mesh addresses the most common weaknesses of the traditional centralized data lake or data platform architecture. And the heart of a decentralized data mesh infrastructure must be real-time, reliable, and scalable. Learn how the de facto standard for data streaming, Apache Kafka, plays a crucial role in building a data mesh.

The post The Heart of the Data Mesh Beats Real-Time with Apache Kafka appeared first on Kai Waehner.

]]>
If there were a buzzword of the hour, it would undoubtedly be “data mesh“! This new architectural paradigm unlocks analytic and transactional data at scale and enables rapid access to an ever-growing number of distributed domain datasets for various usage scenarios. The data mesh addresses the most common weaknesses of the traditional centralized data lake or data platform architecture. And the heart of a decentralized data mesh infrastructure must be real-time, reliable, and scalable. Learn how the de facto standard for data streaming, Apache Kafka, plays a crucial role in building a data mesh.

The Heart of the Data Mesh Beats Real Time with Apache Kafka

There is no single technology or product for a data mesh!

This post explores how Apache Kafka, as an open and scalable decentralized real-time platform, can be the basis of a data mesh infrastructure and – complemented by many other data platforms like a data warehouse, data lake, and lakehouse – solve real business problems.

There is no silver bullet or single technology/product/cloud service for implementing a data mesh. The key outcome of a data mesh architecture is the ability to build data products; with the right tool for the job. A good data mesh combines data streaming technology like Apache Kafka or Confluent Cloud with cloud-native data warehouse and data lake architectures from Snowflake, Databricks, Google BigQuery, et al.

What is a data mesh?

I won’t write yet another article describing the concepts of a data mesh. Zhamak Dehghani coined the term in 2019. The following data mesh architecture from 30,000-foot view explains the basic idea well:

Data Mesh for Domain Driven Design Microservices Data Mart Data Streaming

I summarize data mesh as the following three bullet points:

  • An architecture paradigm with several historical influences (domain-driven design, microservices, data marts, data streaming)
  • Not specific to a single technology or product; no single vendor can implement a data mesh alone
  • Handling data as a product is a fundamental change, enabling a more flexible architecture and independent solving of separate business problems
  • Decentralized services, not just analytics but also transactional workloads

Why handle data as a product?

Talking about innovative technology is insufficient to introduce a new architectural paradigm. Consequently, measuring the business value of the enterprise architecture is critical, too.

McKinsey finds that “when companies instead manage data like a consumer product—be it digital or physical—they can realize near-term value from their data investments and pave the way for quickly getting more value tomorrow. Creating reusable data products and patterns for piecing together data technologies enables companies to derive value from data today and tomorrow”:

McKinsey - Why Handle Data as a Product

For McKinsey, the benefits of this approach can be significant:

  • New business use cases can be delivered as much as 90 percent faster.
  • The total cost of ownership, including technology, development, and maintenance, can decline by 30 percent.
  • The risk and data-governance burden can be reduced.

What is data streaming with Apache Kafka and its relation to data mesh?

A data mesh enables flexibility through decentralization and best-of-breed data products. The heart of data sharing requires reliable real-time data at any scale between data producers and data consumers. Additionally, true decoupling between the decentralized data products is key to the success of the data mesh paradigm. Each domain must have access to shared data but also the ability to choose the right tool (i.e., technology, API, product, or SaaS) to solve its business problems.

That’s where data streaming fits into the data mesh story:

Flexibility through Decentralization and Best-of-Breed with Data Streaming

The de facto standard for data streaming is Apache Kafka. A cloud-native data streaming infrastructure that can link clusters with each other out-of-the-box enables building a modern data mesh. No Data Mesh will use just one technology or vendor. Learn from inspiring posts from your favorite data products vendors like AWS, Snowflake, Databricks, Confluent, and many more to successfully define and build your custom Data Mesh. Data Mesh is a journey, not a big bang. A data warehouse or data lake (or in modern days, a lakehouse) cannot be the only infrastructure for data mesh and data products.

I covered how to leverage the capabilities of Apache Kafka and its ecosystem like Kafka Connect, ksqlDB, Cluster Linking, etc. to build the heart of a data mesh in a separate blog post: Streaming Data Exchange with Kafka and a Data Mesh in Motion.

Example: Real-time data fabric in hybrid cloud

Here is one example spanning a streaming Data Mesh across multiple cloud providers like AWS, Azure, GCP, or Alibaba, and on-premise / edge sites:

Hybrid Cloud Streaming Data Mesh powered by Apache Kafka and Cluster Linking

This example shows all the characteristics discussed in the above sections for a Data Mesh:

  • Decentralized real-time infrastructure across domains and infrastructures
  • True decoupling between domains within and between the clouds
  • Several communication paradigms, including data streaming, RPC, and batch
  • Data integration with legacy and cloud-native technologies
  • Continuous stream processing where it adds value, and batch processing in some analytics sinks

Presentation: Building a decentralized data mesh with data streaming at its heart

The following slide deck walks you through the motivation, principles, and architectures of building a real-time data mesh powered by Apache Kafka using the Kappa architecture, hybrid cloud, and stream data sharing:

The data mesh provides flexibility and freedom of technology choice for each data product

The heart of a decentralized data mesh infrastructure must be real-time, reliable, and scalable. As the de facto standard for data streaming, Apache Kafka plays a crucial role in a cloud-native data mesh architecture. Nevertheless, data mesh is not bound to a specific technology. The beauty of the decentralized architecture is the freedom of technology choice for each business unit when building its data products.

Data sharing between domains within and across organizations is another aspect where data streaming helps in a data mesh. Real-time data beats slow data. That is not just true for most business problems across industries but also for replicating data between data centers, clouds, regions, or organizations. A streaming data exchange enables data sharing in real-time to build a data mash in motion.

Did you already start building your Data Mesh? What does the enterprise architecture look like? What frameworks, products, and cloud services do you use? Is the heart of your data mesh real-time in motion or some lakehouse at rest? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post The Heart of the Data Mesh Beats Real-Time with Apache Kafka appeared first on Kai Waehner.

]]>
Streaming Data Exchange with Kafka and a Data Mesh in Motion https://www.kai-waehner.de/blog/2021/11/14/streaming-data-exchange-data-mesh-apache-kafka-in-motion/ Sun, 14 Nov 2021 13:25:45 +0000 https://www.kai-waehner.de/?p=3412 Data Mesh is a new architecture paradigm that gets a lot of buzzes these days. This blog post looks into this principle deeper to explore why no single technology is the perfect fit to build a  Data Mesh. Examples show why an open and scalable decentralized real-time platform like Apache Kafka is often the heart of the Data Mesh infrastructure, complemented by many other data platforms to solve business problems.

The post Streaming Data Exchange with Kafka and a Data Mesh in Motion appeared first on Kai Waehner.

]]>
Data Mesh is a new architecture paradigm that gets a lot of buzzes these days. Every data and platform vendor describes how to build the best Data Mesh with their platform. The Data Mesh story includes cloud providers like AWS, data analytics vendors like Databricks and Snowflake, and Event Streaming solutions like Confluent. This blog post looks into this principle deeper to explore why no single technology is the perfect fit to build a Data Mesh. Examples show why an open and scalable decentralized real-time platform like Apache Kafka is often the heart of the Data Mesh infrastructure, complemented by many other data platforms, to solve business problems.

Streaming Data Exchange with Apache Kafka and Data Mesh in Motion

Data at Rest vs. Data in Motion

Before we get into the Data Mesh discussion, it is crucial to clarify the difference and relevance of Data at Rest and Data in Motion:

  • Data at Rest: Data is ingested and stored in a storage system (database, data warehouse, data lake). Business logic and queries execute against the storage. Everyday use cases include reporting with business intelligence tools, model training in machine learning,  and complex batch analytics like shuffling or map and reduce. As the data is at rest, the processing is too late for real-time use cases.
  • Data in Motion: Data is processed and correlated continuously while new events are fed into the platform. Business logic and queries execute in real-time. Common real-time use cases include inventory management, order processing, fraud detection, predictive maintenance, and many other use cases.

Real-time Data Beats Slow Data

Real-time beats slow data in almost all use cases across industries. Hence, ask yourself or your business team how they want or need to consume and process the data in the next project. Data at Rest and Data in Motion have trade-offs. Therefore, both concepts are complementary. For this reason, modern cloud infrastructures leverage both in their architecture. Serverless Event Streaming with Kafka combined with the AWS Lakehouse is a great resource to learn more.

However, while connecting a batch system to a real-time nervous system is possible, the other way round – connecting a real-time consumer to batch storage – is not possible. The Kappa vs. Lambda Architecture discussion gives more insights into this.

Kafka is a database. So, it is also possible to use it for data at rest. For instance, the replayability of historical events in guaranteed ordering is essential and helpful for many use cases. However, long-term storage in Kafka has several limitations, like limited query capabilities. Hence, for many use cases, event streaming and other storage systems are complementary, not competitive.

Data Mesh – An Architecture Paradigm

Data mesh is an implementation pattern (not unlike microservices or domain-driven design) but applied to data. Thoughtworks coined the term. You will find tons of resources on the web. Zhamak Dehghani gave a great introduction about “How to build the Data Mesh Foundation and its Relation to Event Streaming” at the Kafka Summit Europe 2021.

Domain-driven + Microservices + Event Streaming

Data Mesh is not an entirely new paradigm. It has several historical influences:

Data Mesh Architecture with Micorservices Domain-driven Design Data Marts and Event Streaming

The architectural paradigm unlocks analytical data at scale, rapidly unlocking access to an ever-growing number of distributed domain data sets for a proliferation of consumption scenarios such as machine learning, analytics, or data-intensive applications across the organization. A data mesh addresses the common failure modes of the traditional centralized data lake or data platform architecture.

Data Mesh is a Logical View, not Physical!

Data mesh shifts to a paradigm that draws from modern distributed architecture: considering domains as the first-class concern, applying platform thinking to create a self-serve data infrastructure, treating data as a product, and implementing open standardization to enable an ecosystem of interoperable distributed data products.

Here is an example of a Data Mesh:

Data Mesh with Apache Kafka

TL;DR: Data Mesh combines existing paradigms, including Domain-driven Design, Data Marts, Microservices, and Event Streaming.

Data as the Product

However, the differentiating aspect focuses on product thinking (“Microservice for Data”) with data as a first-class product. Data products are a perfect fit for Event Streaming with Data in Motion to build innovative new real-time use cases.

Data Product - The Domain Driven Microservice for Data

A Data Mesh with Event Streaming

Why is Event Streaming a good fit for data mesh?

Streams are real-time, so you can propagate data throughout the mesh immediately, as soon as new information is available. Streams are also persisted and replayable, so they let you capture both real-time AND historical data with one infrastructure. And because they are immutable, they make for a great source of record, which is helpful for governance.

Data in Motion is crucial for most innovative use cases. And as discussed before, real-time data beats slow data in almost all scenarios. Hence, it makes sense that the heart of a Data Mesh architecture is an Event Streaming platform. It provides true decoupling, scalable real-time data processing, and highly reliable operations across the edge, data center, and multi-cloud.

Kafka Streaming API – The De Facto Standard for Data in Motion

The Kafka API is the de facto standard for Event Streaming. I won’t explore this discussion again and again. Here are a few references before we move to the “Kafka + Data Mesh” content…

A Kafka-powered Data Mesh

I highly recommend watching Ben Stopford’s and Michael Noll’s talk about “Apache Kafka and the Data Mesh“. Several of the screenshots in this post are from that presentation, too. Kudos to my two colleagues! The talk explores the key concepts of a Data Mesh and how they are related to Event Streaming:

  • Domain-driven Decentralization
  • Data as a Self-serve Product
  • First-class Data Platform
  • Federated Governance

Let’s now explore how Event Streaming with Kafka fits into the Data Mesh architecture and how other solutions like a database or data lake complement it.

Data Exchange for Input and Output within a Data Mesh using Kafka

Data product, a “microservice for the data world”:

  • A node on the data mesh, situated within a domain.
  • Produces and possibly consumes high-quality data within the mesh.
  • Encapsulates all the elements required for its function, namely data plus code plus infrastructure.

A Data Mesh is not just one Technology!

The heart of a Data Mesh infrastructure must be real-time, decoupled, reliable, and scalable. Kafka is a modern cloud-native enterprise integration platform (also often called iPaaS today). Therefore, Kafka provides all the capabilities for the foundation of a Data Mesh.

However, not all components can or should be Kafka-based. Choose the right tool for a problem. Let’s explore in the following subsections how Kafka-native technologies and other solutions are used in a Data Mesh together.

Stream Processing within the Data Product with Kafka Streams and ksqlDB

An event-based data product aggregates and correlates information from one or more data sources in real-time. Stateless and stateful stream processing is implemented with Kafka-native tools such as Kafka Streams or ksqlDB:

Event Streaming within the Data Product with Stream Processing Kafka Streams and ksqlDB

Variety of Protocols and Communication Paradigms within the Data Product – HTTP, gRPC, MQTT, and more

Obviously, not every application uses just Event Streaming as a technology and communication paradigm. The above picture shows how one consumer application could also be a request/response technology like HTTP or gRPC to do a pull query. In contrast, another application continuously consumes the streaming push query with a native Kafka consumer in any programming language, such as Java, Scala, C, C++, Python, Go, etc.

The data product often includes complementary technologies. For instance, if you built a connected car infrastructure, you likely use MQTT for the last-mile integration, ingest the data into Kafka, and further processing with Event Streaming. The “Kafka + MQTT Blog Series” is an excellent example from the IoT space to learn about building a data product with complementary technologies.

Variety of Solutions within the Data Product – Event Streaming, Data Warehouse, Data Lake, and more

The beauty of microservice architectures is that every application can choose the right technologies. An application might or might not include databases, analytics tools, or other complementary components. The input and output data ports of the data product should be independent of the chosen solutions:

Data Stores within the Data Product with Snowflake MongoDB Oracle et al

Kafka Connect is the right Kafka-native technology to connect other technologies and communication paradigms with the Event Streaming platform. Evaluate if you need another integration middleware (like an ETL or ESB) or if the Kafka infrastructure is the better enterprise integration platform (iPaaS) for your data product within the data mesh.

A Global Streaming Data Exchange

The Data Mesh concept is relevant for global deployments, not just within a single project or region. Multiple Kafka clusters are the norm, not an exception. I wrote about customers using Event Streaming with Kafka in global architectures a long time ago.

Various architectures exist to deploy Kafka across data centers and multiple clouds. Some use cases require low latency and deploy some Kafka instances at the edge or in a 5G zone. Other use cases replicate data between regions, countries, or continents across the globe for disaster recovery, aggregation, or analytics use cases.

Here is one example spanning a streaming Data Mesh across multiple cloud providers like AWS, Azure, GCP, or Alibaba, and on-premise / edge sites:

Hybrid Cloud Streaming Data Mesh powered by Apache Kafka and Cluster Linking

This example shows all the characteristics discussed in the above sections for a Data Mesh:

  • Decentralized real-time infrastructure across domains and infrastructures
  • True decoupling between domains within and between the clouds
  • Several communication paradigms, including data streaming, RPC, and batch
  • Data integration with legacy and cloud-native technologies
  • Continuous stream processing where it adds value, and batch processing in some analytics sinks

Example: A Streaming Data Exchange across Domains in the Automotive Industry

The following example from the automotive industry shows how independent stakeholders (= domains in different enterprises) use a cross-company streaming data exchange:

Streaming Data Exchange with Data Mesh in Motion using Apache Kafka and Cluster Linking

Innovation does not stop at the own border. Streaming replication is relevant for all use cases where real-time is better than slow data (valid for most scenarios). A few examples:

  • End-to-end supply chain optimization from suppliers to the OEM to the intermediary to the aftersales
  • Track and trace across countries
  • Integration of 3rd party add-on services to the own digital product
  • Open APIs for embedding and combining external services to build a new product

I could go on and on with the list. Many data products need to be accessible by 3rd party in real-time at scale. Some API gateway or API management tool comes into play in such a situation.

A real-world example of a streaming data exchange powered by Kafka is the mobility service Here Technologies. They expose the Kafka API to directly consume streaming data from their mapping services (as an alternative option to their HTTP API):

Here Technologies Moblility Service Apache Kafka Open API

However, even if all collaborating partners use Kafka under the hood in their architecture, exposing the Kafka API directly to the outside world does not always make sense. Some technical capabilities (e.g., access control or connectivity to thousands of devices) and missing business functions (e.g., for monetization or reporting) of the Kafka ecosystem bring an API layer on top of the Event Streaming infrastructure into play in many real-world deployments.

Open API for 3rd Party Integration and Streaming API Management

API Gateways and API Management tools exist in many varieties, including open-source frameworks, commercial products, and SaaS cloud offerings. Features include technical routing, access control, monetization, and reporting.

However, most people still implement the Open API concept with RPC in mind. I guess 95+% still use HTTP(S) to make APIs accessible to other stakeholders (e.g., other business units or external parties). RPC makes little sense in a streaming Data Mesh architecture if the data needs to be processed at scale in real-time.

There is still an impedance mismatch between Event Streaming and API Management. But it gets better these days. Specifications like AsyncAPI, calling itself the “industry standard for defining asynchronous APIs”, and similar approaches bring Open API to the data streaming world. My post “Kafka versus API Management with tools like MuleSoft, Kong, or Apigee” is still pretty much accurate if you want to dive deeper into this discussion. IBM API Connect was one of the first vendors that integrated Kafka via Async API.

A great example of the evolution from RPC to streaming APIs is the machine learning space. “Streaming Machine Learning with Kafka-native Model Deployment” explores how model servers such as Seldon enhance their product with a native Kafka API besides HTTP and gRPC request-response communication:

Kafka-native Machine Learning Model Server Seldon

Journey to the Streaming Data Mesh with Kafka

The paradigm shift is enormous. Data Mesh is not a free lunch. The same was and still is true for microservice architectures, domain-driven design, Event Streaming, and other modern design principles.

In analogy to Confluent’s maturity model for Event Streaming, our team has described the journey for deploying a streaming Data Mesh:

Data Mesh Journey with Event Streaming and Apache Kafka

The efforts likely take a few years in most scenarios. The shift is not just about technologies, but, as necessary are adjustments to organizations and business processes. I guess most companies are still in a very early stage. Let me know where you are on this journey!

Streaming Data Exchange as Foundation for a Data Mesh

A Data Mesh is an implementation pattern, not a specific technology. However, most modern enterprise architectures require a decentralized streaming data infrastructure to build valuable and innovative data products in independent, truly decoupled domains. Hence, Kafka, being the de facto standard for Event Streaming,  comes into play in many Data Mesh architectures.

Many Data Mesh architectures span across many domains in various regions or even continents. The deployments run at the edge, on-prem, and multi-cloud. The integration connects to many solutions, technologies with different communication paradigms.

A cloud-native Event Streaming infrastructure with the capability to link clusters with each other out-of-the-box enables building a modern Data Mesh. No Data Mesh will use just one technology or vendor. Learn from the inspiring posts from your favorite data products vendors like AWS, Snowflake, Databricks, Confluent, and many more to define and build your custom Data Mesh successfully. Data Mesh is a journey, not a big bang.

Did you already start building your Data Mesh? How does the enterprise architecture look like? What frameworks, products, and cloud services do you use? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Streaming Data Exchange with Kafka and a Data Mesh in Motion appeared first on Kai Waehner.

]]>