Confluent Cloud Archives - Kai Waehner https://www.kai-waehner.de/blog/category/confluent-cloud/ Technology Evangelist - Big Data Analytics - Middleware - Apache Kafka Thu, 06 Mar 2025 07:15:37 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.2 https://www.kai-waehner.de/wp-content/uploads/2020/01/cropped-favicon-32x32.png Confluent Cloud Archives - Kai Waehner https://www.kai-waehner.de/blog/category/confluent-cloud/ 32 32 The Data Streaming Landscape 2025 https://www.kai-waehner.de/blog/2024/12/04/the-data-streaming-landscape-2025/ Wed, 04 Dec 2024 13:49:37 +0000 https://www.kai-waehner.de/?p=6909 Data streaming is a new software category. It has grown from niche adoption to becoming a fundamental part of modern data architecture, leveraging open source technologies like Apache Kafka and Flink. With real-time data processing transforming industries, the ecosystem of tools, platforms, and cloud services has evolved significantly. This blog post explores the data streaming landscape of 2025, analyzing key players, trends, and market dynamics shaping this space.

The post The Data Streaming Landscape 2025 appeared first on Kai Waehner.

]]>
Data streaming is a new software category. It has grown from niche adoption to becoming a fundamental part of modern data architecture. With real-time data processing transforming industries, the ecosystem of tools, platforms, and services has evolved significantly. This blog post explores the data streaming landscape of 2025, analyzing key players, trends, and market dynamics shaping this space.

The data streaming landscape of 2025 categorizes solutions by their adoption and completeness as fully-featured data streaming platforms, as well as their deployment models, which range from self-managed setups to BYOC (Bring Your Own Cloud) and fully managed cloud services like PaaS and SaaS. While Apache Kafka remains the backbone of this ecosystem, the landscape also includes stream processing engines like Apache Flink and competitive technologies such as Pulsar and Redpanda that are built on the Kafka protocol.

This blog also explores the latest market trends and provides an outlook for 2025 and beyond, highlighting potential new entrants and evolving use cases. By the end, you’ll gain a clear understanding of the data streaming platform landscape and its trajectory in the years to come.

The Data Streaming Landcape 2025 with Kafka Flink Confluent Amazon MSK Cloudera Event Hubs and Other Platforms

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download by free ebook about data streaming use cases and industry-specific success stories.

Data Streaming in 2025: The Rise of a New Software Category

Real-time data beats slow data. That’s true across almost all use cases in any industry. Event-driven applications powered by data streaming continuously process data from any data source. This approach increases the business value as the overall goal by increasing revenue, reducing cost, reducing risk, or improving the customer experience. And the event-driven architecture ensures a future-ready architecture.

Even top researchers and advisory firms such as Forrester, Gartner, and IDG recognize data streaming as a new software category. In December 2023, Forrester released “The Forrester Wave™: Streaming Data Platforms, Q4 2023,” highlighting Microsoft, Google, and Confluent as leaders, followed by Oracle, Amazon, and Cloudera.

Data Streaming is NOT just another data integration tool. Plenty of software categories and related data platforms exist to process and analyze data. I explored in a few dedicated series how data streaming. differs:

The Business Value of Data Streaming

A new software category opens use cases and adds business value across all industries:

Use Cases for Data Streaming with Apache Kafka by Business Value
Source: Lyndon Hedderly (Confluent)

Adding business value is crucial for any enterprise. With so many potential use cases, it is no surprise that more and more software vendors add Kafka support to their products.

Search my blog for your favorite industry to find plenty of case studies and architectures. Or read about use cases for Apache Kafka across industries to get started.

The Data Streaming Landscape of 2025

Data Streaming is a separate software category of data platforms. Many software vendors built their entire businesses around this category. Several mature players in the data market added support for data streaming in their platforms or cloud service ecosystem. Various SaaS startups have emerged in this category in the last few years.

The Data Streaming Landcape 2025 with Kafka Flink Confluent Amazon MSK Cloudera Event Hubs and Other Platforms

It all began with the open-source framework Apache Kafka, and today, the Kafka protocol is widely adopted across various implementations, including proprietary ones. What truly matters now is leveraging the capabilities of a complete data streaming platform—one that is fully compatible with the Kafka protocol. This includes built-in features like connectors, stream processing, security, data governance, and the elimination of self-management, reducing risks and operational effort.

The Kafka Protocol is the De Facto Standard of Data Streaming

Most software vendors use Kafka (or its protocol) at the core of their data streaming platformsApache Kafka has become the de facto standard for data streaming.

Apache Kafka is the de facto standard API for data streaming in 2024

Additionally, “benchmarketing” (i.e., picking a sweet spot or niche scenario where you perform better than your competitor) is the favorite marketing technique to “prove” differentiators to the real Apache Kafka. Some vendors also present misleading cost-efficiency comparisons by excluding critical cloud costs such as data transfer or storage, giving an incomplete picture of the true expenses.

Apache Kafka vs. Data Streaming Platform

Many still use Kafka merely as a dumb ingestion pipeline, overlooking its potential to power sophisticated, real-time data streaming use cases. One reason is that Kafka alone lacks the full capabilities of a comprehensive data streaming platform.

A complete solution requires more than “just” Kafka. Apache Flink is becoming the de facto standard for stream processing. Data integration capabilities (connectors, clients, APIs), data governance, security, and critical 24/7 SLAs and support are important for many data streaming projects.

The Data Streaming Landscape 2025 summarizes the current status of relevant products and cloud services, focusing on deployment models and the adoption/completeness of the data streaming platforms.

Data Streaming Vendors and Categories for the 2025 Landscape

The data streaming landscape changed this year. As most solutions evolve, I do not distinguish anymore between Kafka, non-Kafka, and stream processing as categories. Instead, I look at the adoption and completeness to assess the maturity of a data streaming solution from an open-source framework to a complete platform.

The deployment models also changed in the 2025 landscape. Instead of categorizing it into Self Managed, Partially Managed, and Fully Managed, I sort as follows: Self Managed, Bring Your Own Cloud (BYOC), and Cloud. The Cloud category is separated into PaaS (Platform as a Service) and SaaS (Software as a Service) to indicate that many Kafka cloud offerings are still NOT fully managed!

Please note: Intentionally, this data streaming landscape is not a complete list of frameworks, cloud services, or vendors. It is also not an official research. There is no statistical evidence. You might miss your favorite technology in this diagram. Then I did not see it in my conversations with customers, prospects, partners, analysts, or the broader data streaming community.

Also, note that I focus on general data streaming infrastructure. Brilliant solutions exist for using and analyzing streaming data for specific scenarios, like time-series databases, machine learning engines, observability platforms, or purpose-built IoT solutions. These are usually complementary, often connected out of the box via a Kafka connector, or even built on top of a data streaming platform (invisible for the end user).

Adoption and Completeness of Data Streaming (X-Axis)

Data streaming is adopted more and more across all industries. The concept is not new. In “The Past, Present and Future of Stream Processing“, I explored how the data streaming journey started decades ago with research and the first purpose-built proprietary products for specific use cases like stock trading.

Open source stream processing frameworks emerged during the big data and Hadoop era to make at least the ingestion layer a bit more real-time. Is anyone still remembering (or even still using) Apache Storm? 🙂

Today, most enterprises are realizing the value of data streaming for both analytical and operational use cases across industries. The cloud has brought a transformative shift, enabling businesses to start streaming and processing data with just a click, using fully managed SaaS solutions and intuitive UIs. Complete data streaming platforms now offer many built-in features that users previously had to develop themselves, including connectors, encryption, access control, governance, data sharing, and more.

Capabilities of a Complete Data Streaming Platform

Data streaming vendors are on the way to building a complete Data Streaming Platform (DSP). Capabilities include:

  • Messaging (“Streaming”): Transfer messages in real-time and persist for durability, decoupling, and slow consumers (near real-time, batch, API, file).
  • Data Integration: Connect to any legacy and cloud-native sources and sinks.
  • Stream Processing: Correlate events with stateless and stateful transformation or business logic.
  • Data Governance:  Ensure security, observability, data sovereignty, and compliance.
  • Developer Tooling: Enable flexibility for different personas such as software engineers, data scientists, and business analysts by providing different APIs (such as Java, Python, SQL, REST/HTTP), graphical user interfaces, and dashboards.
  • Operations Tooling and SaaS: Ease infrastructure management on premise respectively take over the entire operations burden in the public cloud with serverless offerings.
  • Uptime SLAs and Support: Provide the required guarantees and expertise for critical use cases.

Evolution from Open Source Adoption to a Data Streaming Organization

Modern data streaming is not just about adopting a product; it’s about transforming the way organizations operate and derive value from their data. Hence, the adoption goes beyond product features:

  • From open source and self-operations to enterprise-grade products and SaaS.
  • From human scale to automated, serverless elasticity with consumption-based pricing.
  • From dumb ingestion pipes to complex data pipelines and business applications.
  • From analytical workloads to critical transactional (and analytical) use cases.
  • From a single data streaming cluster to a powerful edge, hybrid, and multi-cloud architecture, including integration, migration, aggregation, and disaster recovery scenarios.
  • From wild adoption across business units with uncontrolled growth using various frameworks, cloud services, and integration tools to a center of excellence (CoE) with a strategic approach with standards, best practices, and knowledge sharing in an internal community.
  • From effortful and complex human management to enterprise-wide data governance, automation, and self-service APIs.

Data Streaming Deployment Models: Self-Managed vs. BYOC vs. Cloud (Y-Axis)

Different data streaming categories exist regarding the deployment model:

  • Self-Managed: Operate nodes like Kafka Broker, Kafka Connect, and Schema Registry by yourself with your favorite scripts and tools. This can be on-premise or in the public cloud in your VPC. Reduce the operations burden via a cloud-native platform (usually Kubernetes) and related operator tools that automate operations tasks like rolling upgrades or rebalancing Kafka Partitions.
  • Bring Your Own Cloud (BYOC): Allow organizations to host Kafka within their own cloud VPC. BYOC combines some of the benefits of cloud flexibility with enhanced security and control, while it outsources most of Kafka’s management to specialized vendors. The data plane is still customer-managed, but in contrast to self-managed Kafka, the customer does not need to worry about the complexity under the hood (like rebalancing, rolling upgrades, backups) – that is what cloud-native object storage and other magic code of the BYOC control plane service take over.
  • Cloud (PaaS or SaaS): Both PaaS and SaaS solutions operate within the cloud provider’s VPC. Fully managed SaaS for data streaming takes overall operational responsibilities, including scaling, failover handling, upgrades, and performance tuning, allowing users to focus solely on integration and business logic. In contrast, partially managed PaaS reduces the operational burden by automating certain tasks like rolling upgrades and rebalancing, but still requires some level of user involvement in managing the infrastructure. Fully Managed SaaS typically provides critical SLAs for support and uptime while partially managed PaaS cannot provide such guarantees.

Most organizations prefer SaaS for data streaming when business and technical requirements allow, as it minimizes operational complexity and maximizes scalability. Other deployment models are chosen when specific constraints or needs require them.

The Evolution of BYOC Kafka Cloud Services

Cloud and On-Premise deployment options are typically well understood, but BYOC (Bring Your Own Cloud) often requires more explanation due to its unique operating model and varying implementations across vendors.

In last year’s data streaming landscape 2024, I wrote the following about BYOC for Kafka:

I do NOT believe in this approach as too many questions and challenges exist with BYOC regarding security, support, and SLAs in the case of P1 and P2 tickets and outages. Hence, I put this in the category of self-managed. That is what it is, even though the vendor touches your infrastructure. In the end, it is your risk because you have to and want to control your environment.

This statement made sense because BYOC vendors at that time required access to the customer VPC and offered a shared support model. While this is still true for some BYOC solutions, my mind changed with the innovation of BYOC by one emerging vendor: WarpStream.

WarpStream’s BYOC Operating Model with Stateless Agents in the Customer VPC

WarpStream published a new operating model for BYOC: The customer only deploys stateless agents in its VPC and provides an object storage bucket to store the data. The control plane and metadata store are fully managed by the vendor as SaaS and the vendor takes over all the complexity.

Cloud-Native BYOC for Apache Kafka with WarpStream in the Public Cloud
Source: Confluent

With this innovation, BYOC is now a worthwhile third deployment option besides a self-managed and fully managed data streaming platform. It brings several benefits:

  • No access is needed by the BYOC cloud vendor to the customer VPC. The data plane (i.e., the “brokers” in the customer VPC) is stateless. The metadata/consensus is in the control plane (i.e., the cloud service in the WarpStream VPC).
  • The architecture solves sovereignty challenges and is a great fit for security and compliance requirements.
  • The cost of the BYOC offering is cheaper than self-managed Apache Kafka because it is built with cloud-native concepts and technologies in mind (e.g., zero disks and zero interzone networking fees, leveraging cloud object storage such as Amazon S3, Azure Blog Storage, or Google Cloud Storage).
  • The stateless architecture in the customer VPC makes autoscaling and elasticity very easy to implement/configure.

When to use BYOC?

WarpStream introduced an innovative share-nothing operating model that makes BYOC practical, secure, and cost-efficient. With that being said, I still recommend only looking at BYOC options for Apache Kafka in the public cloud if a fully managed and serverless data streaming platform does NOT work for you because of cost, security, or compliance reasons! When it comes to simplicity and ease of operation, nothing beats a fully managed cloud service.

And please keep in mind that NOT every BYOC cloud service provides these characteristics and benefits. Make sure to make a proper evaluation of your favorite solutions. For more details, look at my blog post: “Deployment Options for Apache Kafka: Self-Managed, Fully-Managed / Serverless and BYOC (Bring Your Own Cloud)“.

Changes in the Data Streaming Landscape from 2024 to 2025

My goal is NOT a growing landscape with tens or even hundreds of vendors and cloud services. Plenty of these pictures exist. Instead, I focus on a few technologies, vendors, and cloud offerings that I see in the field, with adoption by the broader open-source and cloud community.

I already discussed the important conceptual changes in the data streaming landscape:

  • Deployment Model: From self-managed, partially managed, and fully managed to self-managed, BYOC and cloud.
  • Streaming Categories: From different streaming categories to a single category for all data streaming platforms sorted by adoption and completeness.

Additionally, every year I modified the list of solutions compared to the data streaming landscape 2024 published one year ago.

Additions to the Data Streaming Landscape 2025

The following data streaming services were added:

  • Alibaba (Cloud): Confluent Data Streaming Service on Alibaba Cloud is an OEM partnership between Alibaba Cloud and Confluent to offer a fully managed SaaS in Mainland China. The service was announced end of 2021 and sees more and more traction in Asia. Alibaba is the contractor and first-level support for the end user.
  • Google Managed Service for Kafka (Cloud): Google announced this Kafka PaaS recently. The strategy looks very similar to Amazon’s MSK. Even the shortcut is the same: MSK. I explored when (not) to choose Google’s Kafka cloud service after the announcement. The service is still in preview, but available to a huge customer base already.
  • Oracle Streaming with Apache Kafka (Cloud): A partially managed Apache Kafka PaaS on Oracle Cloud Infrastructure (OCI). The service is in early access, but available to a huge customer base already.
  • WarpStream (BYOC): WarpStream was acquired by Confluent. It is still included with its logo as Confluent continues to keep the brand and solution separated (at least for now).

Removals from the Data Streaming Landscape 2025

There are NO REMOVALS this year, BUT I was close to removing two technologies:

  • Apache Pulsar and StreamNative: I was close to removing Apache Pulsar as I see zero traction in the market. Around 2020, Pulsar had some traction but focused too much on Kafka FUD instead of building a vibrant community. While Kafka simplified its architecture (ZooKeeper removal), Pulsar still includes three distributed systems (ZooKeeper or alternatives like etcd, BookKeeper, and Pulsar Broker). It also pivots to the Kafka protocol trying to get some more traction again. But it seems to be too late.
  • ksqlDB (formerly called KSQL): The product is feature complete. While it is still supported by Confluent, it will not get any new features. ksqlDB is still a great piece of software for some (!) Kafka-native stream processing projects but might be removed in the future. Confluent co-founder and Kafka co-creator Jay Kreps commented on X (former Twitter): “Confluent went through a set of experiments in this area. What we learned is that for *platform* layers you want a clean separation. We learned this the hard way: our source available stream processing layer KSQL, lost to open-source Apache Flink. We pivoted to Flink.

Vendor Overview for Data Streaming Platforms

All vendors of the landscape do some kind of data streaming. However, the offerings differ a lot in adoption, completeness, and vision. And many solutions are not available everywhere but only in one public cloud or only as self-managed. For detailed product information and experiences, the vendor websites and other blogs/conference talks are the best and most up-to-date resources. The following is just a summary to get an overview.

Before we do the deep dive, here again, the entire data streaming landscape for 2025:

The Data Streaming Landcape 2025 with Kafka Flink Confluent Amazon MSK Cloudera Event Hubs and Other Platforms

Self-Managed Data Streaming with Open Source and Proprietary Products

Self Managed Data Streaming Platforms including Kafka Flink Spark IBM Confluent Cloudera

The following list describes the open-source frameworks and proprietary products for self-managed data streaming deployments (in order of adoption and completeness):

  • Apache Pulsar: A competitor to Apache Kafka. Similar story and use cases, but different architecture. Kafka is a single distributed cluster – after removing the ZooKeeper dependency in 2022. Pulsar is three (!) distributed clusters: Pulsar brokers, ZooKeeper, and BookKeeper. Pulsar vs. Kafka explored the differences. And Kafka catches up to some missing features like Queues for Kafka.
  • StreamNative: The primary vendor behind Apache Pulsar. Not much market traction.
  • ksqlDB (usually called KSQL, even after Confluent’s rebranding): An abstraction layer on top of Kafka Streams to provide stream processing with streaming SQL. Hence, also Kafka-native. It comes with a Confluent Community License and is free to use. Sweet spot: Streaming ETL.
  • Redpanda: Implements the Kafka protocol with C++. Trying out different market strategies to define Redpanda as an alternative to a Kafka-native offering. Still in the early stage in the maturity curve. Adding tons of (immature) product features in parallel to find the right market fit in a growing Kafka market. Recently acquired Benthos to provide connectivity to data sources and sinks (similar to Kafka Connect).
  • Ververica: Well-known Flink company. Acquired by Chinese tech giant Alibaba in 2019. Not much traction in Europe and the US. Sweet spot: Flink in Mainland China.
  • Apache Flink: Becoming the de facto standard for stream processing. Open-source implementation. Provides advanced features including a powerful scalable compute engine, freedom of choice for developers between SQL, Java, and Python, APIs for Complex Event Processing (CEP), and unified APIs for stream and batch workloads.
  • Spark Streaming: The streaming part of the open-source big data processing framework Apache Spark. The enormous installed base of Spark clusters in enterprises broadens adoption thanks to solutions from Cloudera, Databricks, and the cloud service providers. Sweet spot: Analytics in (micro)batches with data stored at rest in the data lake/lakehouse.
  • Apache Kafka: The de facto standard for data streaming. Open-source implementation with a vast community. Almost all vendors rely on (parts of) this project. Often underestimated: Kafka includes data integration and stream processing capabilities with Kafka Connect and Kafka Streams, making even the open-source Apache download already more powerful than many other data streaming frameworks and products.
  • IBM / Red Hat AMQ Streams: Provides Kafka as self-managed Kafka on Kubernetes via OpenShift. Kafka is part of the integration portfolio that includes other open-source frameworks like Apache Camel. Sweet spot: Existing IBM customers.
  • Cloudera: Provides Kafka, Flink, and other open-source data and analytics frameworks as a self-managed offering. The main strategy is offering one product with a vast combination of many open-source frameworks that can be deployed on any infrastructure. Sweet spot: Analytics.
  • Confluent Platform: Focuses on a complete data streaming platform including Kafka and Flink, and various advanced data streaming capabilities for operations, integration, governance, and security. Sweet spot: Unifying operational and analytical workloads, and combination with the fully managed cloud service.

Data Streaming with Bring Your Own Cloud (BYOC)

Bring Your Own Cloud BYOC Data Streaming Platforms Redpanda Databricks WarpStream

BYOC is an emerging category and is mainly used for specific challenges such as strict data security and compliance requirements. The following vendors provide dedicated BYOC offerings for data streaming (in order of adoption and completeness)

  • WarpStream (Confluent): A new entrant into the data streaming market. The cloud service is a Kafka-compatible data streaming platform built directly on top of S3. Innovated the BYOC model to enable secure and cost-effective data streaming for workloads that don’t have strong latency requirements.
  • Redpanda: The first BYOC offering on the market for data streaming. The biggest concern is the shared responsibility model of this solution because the vendor requires access to the customer VPC for operations and support. This is against the key principles of BYOC regarding security and compliance and why organizations (have to) look for BYOC instead of SaaS solutions.
  • Databricks: Cloud-based data platform that provides a collaborative environment for data engineering, data science, and machine learning, built on top of Apache Spark. Data Streaming is enabled by Spark Streaming and focuses mainly on analytical workloads that are optimized from batch to near real-time.

Partially Managed Data Streaming Cloud Platforms (PaaS)

PaaS Cloud Streaming Platforms like Red Hat Amazon MSK Google Oracle Aiven

Here is an overview of relevant PaaS data streaming cloud services (in order of adoption and completeness):

  • Google Cloud Managed Service for Apache Kafka (MSK): Initially branded as Google Managed Kafka for BigQuery (likely for a better marketing push), the service enables data ingestion into lakehouses on GCP such as Google BigQuery.
  • Amazon Managed Service for Apache Flink (MSF): A partially managed service by AWS that allows customers to transform and analyze streaming data in real-time with Apache Flink. It still provides some (costly) gaps for auto-scaling and is not truly serverless. Supports all Flink interfaces, i.e., SQL, Java, and Python. And non-Kafka connectors, too. Only available on AWS.
  • Oracle OCI Streaming with Apache Kafka: The service is still in early access, but available to a huge customer base already on Oracle’s cloud infrastructure.
  • Microsoft Azure HDInsight. A piece of Azure’s Hadoop infrastructure. Not intended for other use cases beyond data ingestion for batch analytics.
  • Instaclustr: Partially managed Kafka cloud offerings across cloud providers. The product portfolios offer various hosted services of open-source technologies. Instaclustr also offers a (semi-)managed offering for on-premise infrastructure.
  • Amazon Kinesis: Data ingestion into AWS data stores. Mature product for a specific problem. Only available on AWS.
  • Aiven: Partially managed Kafka cloud offerings across cloud providers. The product portfolios offer various hosted services of open-source technologies.
  • IBM / Red Hat AMQ Streams: Provides Kafka as a partially managed cloud offering on OpenShift (through Red Hat). Sweet spot: Existing IBM customers.
  • Amazon Managed Service for Apache Kafka (MSK): AWS has hundreds of cloud services, and Kafka is part of that broad spectrum. MSK is only available in public AWS cloud regions; not on Outposts, Local Zones, Wavelength, etc. MSK is likely the largest partially managed Kafka service across all clouds. It evolved with new features like support for Kafka Connect and Tiered Storage. But lacks connectivity outside the AWS ecosystem and a data governance narrative.

Fully Managed Data Streaming Cloud Services (SaaS)

Fully Managed SaaS Data Streaming Platform including Confluent Alibaba Microsoft Event Hubs

Here is an overview of relevant SaaS data streaming cloud services (in order of adoption and completeness):

  • Decodable: A relatively new cloud service for Apache Flink in the early stage. Huge potential if it is combined with existing Kafka infrastructures in enterprises. But also provides pre-built connectors for non-Kafka systems. Main Opportunity: Combination with another cloud vendor that only does Kafka or other messaging/streaming without stream processing capabilities.
  • StreamNative Cloud: The primary vendor behind Apache Pulsar. Offers self-managed and fully managed solutions. StreamNative Cloud for Kafka is still in a very early stage of maturity, not sure if it will ever strengthen.
  • Ververica: Stream processing as a service powered by Apache Flink on all major cloud providers. Huge potential if it is combined with existing Kafka infrastructures in enterprises. Main Opportunity: Combination with another cloud vendor that only does Kafka or other messaging/streaming without stream processing capabilities.
  • Redpanda Cloud: Redpanda provides its data streaming as a serverless offering. Not much information is available on the website about this part of the vendor’s product portfolio.
  • Amazon MSK Serverless: Different functionalities and limitations than Amazon MSK. MSK Serverless still does not get much traction because of its limitations. Both MSK offerings exclude Kafka support in their SLAs (please read the terms and conditions).
  • Google Cloud DataFlow: Fully managed service for executing Apache Beam pipelines within the Google Cloud Platform ecosystem. Mature product for a specific problem. Only available on GCP.
  • Azure Event Hubs: A mature, fully managed cloud service. The service does one thing, and that is done very well: Data ingestion via the Kafka protocol. Hence, it is not a complete streaming platform but is more comparable to Amazon Kinesis or Google Cloud PubSub. The limited compatibility with the Kafka protocol and the high cost of the service for higher throughput are the two major blockers that come up regularly in conversations.
  • Confluent Cloud: A complete data streaming platform including Kafka and Flink as a fully managed offering. In addition to deep integration, the platform provides connectivity, security, data governance, self-service portal, disaster recovery tooling, and much more to be the most complete DSP on all major public clouds.

Potential for the Data Streaming Landscape 2026

Data streaming is a journey. So is the development of event streaming platforms and cloud services. Several established software and cloud vendors might get more traction with their data streaming offerings. And some startups might grow significantly. The following shows a few technologies that might evolve and see growing adoption in 2025:

  • New startups around the Kafka protocol emerge. The list of new frameworks and cloud services is growing every quarter. A few names I saw in some social network posts (but not much beyond in the real world): AutoMQ, S2, Astradot, Bufstream, Responsive, tansu, Tektite, Upstash. While some focus on the messaging/streaming part, others focus on a particular piece such as building database capabilities.
  • Streaming databases like Materialize or RisingWave might become a new software category. My feeling: Very early stage of the hype cycle. We will see in 2025 if and where this technology gets more broadly adopted and what the use cases are. It is hard to answer how these will compete with emerging real-time analytics databases like Apache Druid, Apache Pinot, ClickHouse, Timeplus, Tinybird, et al. I know there are differences, but the broader community and companies need to a) understand these differences and b) find business problems for it.
  • Stream Processing SaaS startups emerge: Quix and Bytewax provide stream processing with Python. Quix now also offers a hosted offering based on Kafka Streams; as does Responsive. DeltaStream provides Apache Flink as SaaS. And many more startups emerge these days. Let’s see which of these gets traction in the market with an innovative product and business model.
  • Traditional data management vendors like MongoDB or Snowflake try to get deeper into the data streaming business. I am still a fan of separation of concerns; so I think these should keep their sweet spot and (only) provide streaming ingestion and CDC as use cases, but not (try to) compete with data streaming vendors.

Fun fact: The business model of almost all emerging startups is fully managed cloud services, not selling licenses for on-premise deployments. Many are based on open-source or open-core, and others only provide a proprietary implementation.

Although they are not aiming to be full data streaming platforms (and thus are not part of the platform landscape), other complementary tools are gaining momentum in the data streaming ecosystem. For instance, Conduktor is developing a proxy for Kafka clusters, and Lenses, though relatively quiet since its acquisition by Celonis, has recently introduced updates to its user-friendly management and developer tools. These tools address gaps that some data streaming platforms leave unfilled.

Data Streaming: A Journey, Not a Sprint

Data streaming isn’t a sprint—it’s a journey! Adopting event-driven architectures with technologies like Apache Kafka or Apache Flink requires rethinking how applications are designed, developed, deployed, and monitored. Modern data strategies involve legacy integration, cloud-native microservices, and data sharing across hybrid and multi-cloud environments.

The data streaming landscape in 2025 highlights the emergence of a new software category, though it’s still in its early stages. Building such a category takes time. In discussions with customers, partners, and the community, a common sentiment emerges: “We understand the value but are just starting with the first data streaming pipelines and have a long-term plan to implement advanced stream processing and build a strategic data streaming organization.”

The Forrester Wave: Streaming Data Platforms, Q4 2023, and other industry reports from Gartner and IDG signal that this category is progressing through the hype cycle.

Last but not least, check out my Top Data Streaming Trends for 2025 to understand how the data streaming landscape fits into emerging trends:

  1. Democratization of Kafka: Apache Kafka has transitioned from a specialized tool to a foundational platform in modern data infrastructure.
  2. Kafka Protocol as the Standard: Vendors adopt the Kafka protocol over the framework, enabling flexibility with compatibility and performance trade-offs.
  3. BYOC Deployment Model: Bring Your Own Cloud gains traction for balancing security, compliance, and managed services.
  4. Flink Becomes the Standard for Stream Processing: Apache Flink rises as the premier framework for stream processing, building integration pipelines and business applications.
  5. Data Streaming for Real-Time Predictive AI and GenAI: Real-time model inference drives predictive and generative AI applications.
  6. Data Streaming Organizations: Companies unify real-time data strategies to standardize tools, governance, and collaboration.

Which are your favorite open-source frameworks or cloud services for data streaming? What are your most relevant and exciting trends around Apache Kafka and Flink in 2024 to set data in motion? What does your enterprise landscape for data streaming look like? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter. Make sure to download by free ebook about data streaming use cases and industry examples.

The post The Data Streaming Landscape 2025 appeared first on Kai Waehner.

]]>
Top Trends for Data Streaming with Apache Kafka and Flink in 2025 https://www.kai-waehner.de/blog/2024/12/02/top-trends-for-data-streaming-with-apache-kafka-and-flink-in-2025/ Mon, 02 Dec 2024 14:02:07 +0000 https://www.kai-waehner.de/?p=6923 Apache Kafka and Apache Flink are leading open-source frameworks for data streaming that serve as the foundation for cloud services, enabling organizations to unlock the potential of real-time data. Over recent years, trends have shifted from batch-based data processing to real-time analytics, scalable cloud-native architectures, and improved data governance powered by these technologies. Looking ahead to 2025, the data streaming ecosystem is set to undergo even greater changes. Here are the top trends shaping the future of data streaming for businesses.

The post Top Trends for Data Streaming with Apache Kafka and Flink in 2025 appeared first on Kai Waehner.

]]>
The evolution of data streaming has transformed modern business infrastructure, establishing real-time data processing as a critical asset across industries. At the forefront of this transformation, Apache Kafka and Apache Flink stand out as leading open-source frameworks that serve as the foundation for cloud services, enabling organizations to unlock the potential of real-time data. Over recent years, trends have shifted from batch-based data processing to real-time analytics, scalable cloud-native architectures, and improved data governance powered by these technologies. Looking ahead to 2025, the data streaming ecosystem is set to undergo even greater changes. Here are the top trends shaping the future of data streaming for businesses.

Data Streaming Trends for 2025 - Leading with Apache Kafka and Flink

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch.

The Top Data Streaming Trends

Some followers might notice that this became a series with articles about the top 5 data streaming trends for 2021, the top 5 for 2022, the top 5 for 2023, and the top 5 for 2024. Trends change over time, but the huge value of having a scalable real-time infrastructure as the central data hub stays. Data streaming with Apache Kafka is a journey and evolution to set data in motion.

I recently explored the past, present, and future of data streaming tools and strategies from the past decades. Data streaming is becoming more and more mature and standardized, but also innovative.

Let’s now look at the top trends coming up more regularly in conversations with customers, prospects, and the broader data streaming community across the globe:

  1. The Democratization of Kafka: Apache Kafka has transitioned from a specialized tool to a key pillar in modern data infrastructure.
  2. Kafka Protocol as the Standard: Vendors adopt the Kafka wire protocol, enabling flexibility with compatibility and performance trade-offs.
  3. BYOC Deployment Model: Bring Your Own Cloud gains traction for balancing security, compliance, and managed services.
  4. Flink Becomes the Standard for Stream Processing: Apache Flink rises as the premier framework for stream processing, building integration pipelines and business applications.
  5. Data Streaming for Real-Time Predictive AI and GenAI: Real-time model inference drives predictive and generative AI applications.
  6. Data Streaming Organizations: Companies unify real-time data strategies to standardize processes, tools, governance, and collaboration.

The following sections describe each trend in more detail. The trends are relevant for many scenarios; no matter if you use open-source frameworks like Apache Kafka and Flink, a commercial platform, or a fully managed cloud service like Confluent Cloud.

Trend 1: The Democratization of Kafka

In the last decade, Apache Kafka has become the standard for data streaming, evolving from a specialized tool to an essential utility in the modern tech stack. With over 150,000 organizations using Kafka today, it has become the de facto choice for stream processing. Yet, with a market crowded by offerings from AWS, Microsoft Azure, Google GCP, IBM, Oracle, Confluent, and various startups, companies can no longer rely solely on Kafka for differentiation. The vast array of Kafka-compatible solutions means that businesses face more choices than ever, but also new challenges in selecting the solution that balances cost, performance, and features.

The Challenge: Finding the Right Fit in a Crowded Kafka Market

For end users, choosing the right Kafka solution is becoming increasingly complex. Basic Kafka offerings cover standard streaming needs but may lack advanced features, such as enhanced security, data governance, or integration and processing capabilities, that are essential for specific industries. In such a diverse market, businesses must navigate trade-offs, considering whether a low-cost option meets their needs or whether investing in a premium solution with added capabilities provides better long-term value.

The Solution: Prioritizing Features for Your Strategic Needs

As Kafka solutions evolve, users must look beyond price and consider features that offer real strategic value. For example, companies handling sensitive customer data might benefit from Kafka products with top-tier security features. Those focused on analytics may look for solutions with strong integrations into data platforms and low cost for high throughput. By carefully selecting a Kafka product that aligns with industry-specific requirements, businesses can leverage the full potential of Kafka while optimizing for cost and capabilities.

For instance, look at Confluent’s various cluster types for different requirements and use cases in the cloud:

Confluent Cloud Cluster Types for Different Requirements and Use Cases
Source: Confluent

As an example, Freight Clusters was introduced to provide an offering with up to 90 percent less cost. The major trade-off is higher latency. But this is perfect for high volume log analytics at GB/sec scale.

The Business Value: Affordable and Customized Data Streaming

Kafka’s commoditization means more affordable, customizable options for businesses of all sizes. This competition reduces costs, making high-performance data streaming more accessible, even to smaller organizations. By choosing a tailored solution, businesses can enhance customer satisfaction, speed up decision-making, and innovate faster in a competitive landscape.

Trend 2: The Kafka Protocol, not Apache Kafka, is the New Standard for Data Streaming

With the rise of cloud-native architectures, many vendors have shifted to supporting the Kafka protocol rather than the open-source Kafka framework itself, allowing for greater flexibility and cloud optimization. This change enables businesses to choose Kafka-compatible tools that better align with specific needs, moving away from a one-size-fits-all approach.

Confluent introduced its KORA engine, i.e., Kafka re-architected to be cloud-native. A deep technical whitepaper goes into the details (this is not a marketing document but really for software engineers).

Confluent KORA - Apache Kafka Re-Architected to be Cloud Native
Source: Confluent

Other players followed Confluent and introduced their own cloud-native “data streaming engines”. For instance, StreamNative has URSA powered by Apache Pulsar, Redpanda talks about its R1 Engine implementing the Kafka protocol, and Ververica recently announced VERA for its Flink-based platform.

Some vendors rely only on the Kafka protocol with a proprietary engine from the beginning. For instance, Azure Event Hubs or WarpStream. Amazon MSK also goes in this direction by adding proprietary features like Tiered Storage or even introducing completely new product options such as Amazon MSK Express brokers.

The Challenge: Limited Compatibility Across Kafka Solutions

When vendors implement the Kafka protocol instead of the entire Kafka framework, it can lead to compatibility issues, especially if the solution doesn’t fully support Kafka APIs. For end users, this can complicate integration, particularly for advanced features like Exactly-Once Semantics, the Transaction API, Compacted Topics, Kafka Connect, or Kafka Streams, which may not be supported or working as expected.

The Solution: Evaluating Kafka Protocol Solutions Critically

To fully leverage the flexibility of Kafka protocol-based solutions, a thorough evaluation is essential. Businesses should carefully assess the capabilities and compatibility of each option, ensuring it meets their specific needs. Key considerations include verifying the support of required features and APIs (such as the Transaction API, Kafka Streams, or Connect).

It is also crucial to evaluate the level of product support provided, including 24/7 availability, uptime SLAs, and compatibility with the latest versions of open-source Apache Kafka. This detailed evaluation ensures that the chosen solution integrates seamlessly into existing architectures and delivers the reliability and performance required for modern data streaming applications.

The Business Value: Expanded Options and Cost-Efficiency

Kafka protocol-based solutions offer greater flexibility, allowing businesses to select Kafka-compatible services optimized for their specific environments. This flexibility opens doors for innovation, enabling companies to experiment with new tools without vendor lock-in.

For instance, innovations such as a “direct write to S3 object store” architecture, as seen in WarpStream, Confluent Freight Clusters, and other data streaming startups that also build proprietary engines around the Kafka protocol. The result is a more cost-effective approach to data streaming, though it may come with trade-offs, such as increased latency. Check out this video about the evolution of Kafka Storage to learn more.

Trend 3: BYOC (Bring Your Own Cloud) as a New Deployment Model for Security and Compliance

As data security and compliance concerns grow, the Bring Your Own Cloud (BYOC) model is gaining traction as a new way to deploy Apache Kafka. BYOC allows businesses to host Kafka in their own Virtual Private Cloud (VPC) while the vendor manages the control plane to handle complex orchestration tasks like partitioning, replication, and failover.

This BYOC approach offers organizations enhanced control over their data while retaining the operational benefits of a managed service. BYOC provides a middle ground between self-managed and fully managed solutions, addressing specific regulatory and security needs without sacrificing scalability or flexibility.

Cloud-Native BYOC for Apache Kafka with WarpStream in the Public Cloud
Source: Confluent

The Challenge: Balancing Security and Ease of Use

Ensuring data sovereignty and compliance is non-negotiable for organizations in highly regulated industries. However, traditional fully managed cloud solutions can pose risks due to vendor access to sensitive data and infrastructure. Many BYOC solutions claim to address these issues but fall short when it comes to minimizing external access to customer environments. Common challenges include:

  • Vendor Access to VPCs: Many BYOC offerings require vendors to have access to customer VPCs for deployment, cluster management, and troubleshooting. This introduces potential security vulnerabilities.
  • IAM Roles and Elevated Privileges: Cross-account Identity and Access Management (IAM) roles are often necessary for managing BYOC clusters, which can expose sensitive systems to unnecessary risks.
  • VPC Peering Complexity: Traditional BYOC solutions often rely on VPC peering, a complex and expensive setup that increases operational overhead and opens additional points of failure.

These limitations create significant challenges for security-conscious organizations, as they undermine the core promise of BYOC: control over the data environment.

The Solution: Gaining Control with a “Zero Access” BYOC Model

WarpStream redefines the BYOC model with a “zero access” architecture, addressing the challenges of traditional BYOC solutions. Unlike other BYOC offerings using the Kafka protocol, WarpStream ensures that no data leaves the customer’s environment, delivering a truly secure-by-default platform. Hence this section discusses specifically WarpStream, not BYOC Kafka offerings in general.

WarpStream BYOC Zero Access Kafka Architecture with Control and Data Plane
Source: WarpStream

Key features of WarpStream include:

  • Zero Access to Customer VPCs: WarpStream eliminates vendor access by deploying stateless agents within the customer’s environment, handling compute operations locally without requiring cross-account IAM roles or elevated privileges to reduce security risks.
  • Data/Metadata Separation: Raw data remains entirely within the customer’s network for full sovereignty, while only metadata is sent to WarpStream’s control plane for centralized management, ensuring data security and compliance.
  • Simplified Infrastructure: WarpStream avoids complex setups like VPC peering and cross-IAM roles, minimizing operational overhead while maintaining high performance.

Comparison with Other BYOC Solutions using the Kafka protocol:

Unlike most other BYOC offerings (e.g., Redpanda), WarpStream doesn’t require direct VPC access or elevated permissions, avoiding risks like data exposure or remote troubleshooting vulnerabilities. Its “zero access” architecture ensures unparalleled security and compliance.

The Business Value: Secure, Compliant, and Scalable Data Streaming

WarpStream’s innovative approach to BYOC delivers exceptional business value by addressing security and compliance concerns while maintaining operational simplicity and scalability:

  • Uncompromised Security: The zero-access architecture ensures that raw data remains entirely within the customer’s environment, meeting the strictest security and compliance requirements for regulated industries like finance, healthcare, and government.
  • Operational Efficiency: By eliminating the need for VPC peering, cross-IAM roles, and remote vendor access, WarpStream simplifies BYOC deployments and reduces operational complexity.
  • Cost Optimization: WarpStream’s reliance on cloud-native technologies like object storage reduces infrastructure costs compared to traditional disk-based approaches. Stateless agents also enable efficient scaling without unnecessary overhead.
  • Data Sovereignty: The data/metadata split guarantees that data never leaves the customer’s environment, ensuring compliance with regulations such as GDPR and HIPAA.
  • Peace of Mind for Security Teams: With no vendor access to the VPC or object storage, WarpStream’s zero-access model eliminates concerns about external breaches or elevated privileges, making it easier to gain buy-in from security and infrastructure teams.
BYOC Strikes the Balance Between Control and Managed Services

BYOC offers businesses the ability to strike a balance between control and managed services, but not all BYOC solutions are created equal. WarpStream’s “zero access” architecture sets a new standard, addressing the critical challenges of security, compliance, and operational simplicity. By ensuring that raw data never leaves the customer’s environment and eliminating the need for vendor access to VPCs, WarpStream delivers a BYOC model that meets the highest standards of security and performance. For organizations seeking a secure, scalable, and compliant approach to data streaming, WarpStream represents the future of BYOC data streaming.

But just to be clear: If a data streaming project goes to the cloud, fully managed Kafka (and Flink) should always be the first option as it is much easier to manage and operate to focus on fast time-to-market and business innovation. Choose BYOC only if fully managed does not work for you because of security requirements.

Apache Flink has emerged as the premier choice for organizations seeking a robust and versatile framework for continuous stream processing. Its ability to handle complex data pipelines with high throughput, low latency, and advanced stateful operations has solidified its position as the de facto standard for stream processing. Flink’s support for Java, Python, and SQL further enhances its appeal, enabling developers to build powerful data-driven applications using familiar tools.

Apache Flink Adoption Curve Compared to Kafka

As Flink adoption grows, it increasingly complements Apache Kafka as part of the modern data streaming ecosystem, while the Kafka Streams (Java-only) library remains relevant for lightweight, application-embedded use cases.

The Challenge: Handling Complex, High-Throughput Data Streams

Modern businesses increasingly rely on real-time data for both operational and analytical needs, spanning mission-critical applications like fraud detection, predictive maintenance, and personalized customer experiences, as well as Streaming ETL for integrating and transforming data. These diverse use cases demand robust stream processing capabilities that can address the challenges of:

Apache Flink’s versatility makes it uniquely positioned to meet the demands of both streaming ETL for data integration and building real-time business applications. Flink provides:

  • Low Latency: Near-instantaneous processing is crucial for enabling real-time decision-making in business applications, timely updates in analytical systems, and supporting transactional workloads where rapid processing and immediate consistency are essential for ensuring smooth operations and seamless user experiences.
  • High Throughput and Scalability: The ability to process millions of events per second, whether for aggregating operational metrics or moving massive volumes of data into data lakes or warehouses, without bottlenecks.
  • Stateful Processing: Support for maintaining and querying the state of data streams, essential for performing complex operations like aggregations, joins, and pattern detection in business applications, as well as data transformations and enrichment in ETL pipelines.
  • Multiple Programming Languages: Support for Java, Python, and SQL ensures accessibility for a wide range of developers, enabling efficient implementation across various use cases.

The rise of cloud services has further propelled Flink’s adoption, with offerings from major providers like Confluent, Amazon, IBM, and emerging startups. These cloud-native solutions simplify Flink deployments, making it easier for organizations to operationalize real-time analytics.

While Apache Flink has emerged as the de facto standard for stream processing, other frameworks like Apache Spark and its streaming module, Structured Streaming, continue to compete in this space. However, Spark Streaming has notable limitations that make it less suitable for many of the complex, high-throughput workloads modern enterprises demand.

The Challenges with Spark Streaming

Apache Spark, originally designed as a batch processing framework, introduced Spark Streaming and later Structured Streaming to address real-time processing needs. However, its batch-oriented roots present inherent challenges:

  • Micro-Batch Architecture: Spark Structured Streaming relies on micro-batches, where data is divided into small time intervals for processing. This approach, while effective for certain workloads, introduces higher latency compared to Flink’s true streaming architecture. Applications requiring millisecond-level processing or transactional workloads may find Spark unsuitable.
  • Limited Stateful Processing: While Spark supports stateful operations, its reliance on micro-batches adds complexity and latency. This makes Spark Streaming less efficient for use cases that demand continuous state updates, such as fraud detection or complex event processing (CEP).
  • Fault Tolerance Complexity: Spark’s recovery model is rooted in its lineage-based approach to fault tolerance, which can be less efficient for long-running streaming applications. Flink, by contrast, uses checkpointing and savepoints to handle failures more gracefully to ensure state consistency with minimal overhead.
  • Performance Overhead: Spark’s general-purpose design often results in higher resource consumption compared to Flink, which is purpose-built for stream processing. This can lead to increased infrastructure costs for high-throughput workloads.
  • Scalability Challenges for Stateful Workloads: While Spark scales effectively for batch jobs, its scalability for complex stateful stream processing is more limited, as distributed state management in micro-batches can become a bottleneck under heavy load.

By addressing these limitations, Apache Flink provides a more versatile and efficient solution than Apache Spark for organizations looking to handle complex, real-time data processing at scale.

Flink’s architecture is purpose-built for streaming, offering native support for stateful processing, low-latency event handling, and fault-tolerant operation, making it the preferred choice for modern real-time applications. But to be clear: Apache Spark, including Spark Streaming, has its place in data lakes and lakehouses to process analytical workloads.

Flink’s technical capabilities bring tangible business benefits, making it an essential tool for modern enterprises. By providing real-time insights, Flink enables businesses to respond to events as they occur, such as detecting and mitigating fraudulent transactions instantly, reducing losses, and enhancing customer trust.

The support of Flink for both transactional workloads (e.g., fraud detection or payment processing) and analytical workloads (e.g., real-time reporting or trend analysis) ensures versatility across a range of critical business functions. Scalability and resource optimization keep infrastructure costs manageable, even for demanding, high-throughput workloads, while features like checkpointing streamline failure recovery and upgrades, minimizing operational overhead.

Flink stands out with its dual focus on streaming ETL for data integration and building business applications powered by real-time analytics. Its rich APIs for Java, Python, and SQL make it easy for developers to implement complex workflows, accelerating time-to-market for new applications.

Data streaming has powered AI/ML infrastructure for many years because of its capabilities to scale to high volumes, process data in real-time, and integrate with transactional (payments, orders, ERP, etc.) and analytical (data warehouse, data lake, lakehouse) systems. My first article about Apache Kafka and Machine Learning was published in 2017: “How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka“.

As AI continues to evolve, real-time model inference powered by data streaming is opening up new possibilities for predictive and generative AI applications. By integrating model inference with stream processors such as Apache Flink, businesses can perform on-demand predictions for fraud detection, customer personalization, and more.

The Challenge: Provide Context for AI Applications In Real-Time

Traditional batch-based AI inference is too slow for many applications, delaying responses and leading to missed opportunities or wrong business decisions. To fully harness AI in real-time, businesses need to embed model inference directly within streaming pipelines.

Generative AI (GenAI) demands new design patterns like Retrieval Augmented Generation (RAG) to ensure accuracy, relevance, and reliability in its outputs. Without data streaming, RAG struggles to provide large language models (LLMs) with the real-time, domain-specific context they need, leading to outdated or hallucinated responses. Context is essential to ensure that LLMs deliver accurate and trustworthy outputs by grounding them in up-to-date and precise information.

Apache Flink enables real-time model inference by connecting data streams to external AI models through APIs. This setup allows companies to use centralized model servers for inference, providing flexibility and scalability while keeping data streams fast and responsive.

Flink’s ability to process data in real-time also enables advanced machine learning workflows, supporting predictive analytics and generative AI use cases that drive innovation.

GenAI Remote Model Inference with Stream Processing using Apache Kafka and Flink

Apache Flink enables real-time model inference by connecting data streams to external AI models through APIs. This setup allows companies to use centralized model servers for inference, providing flexibility and scalability while keeping data streams fast and responsive. By processing data in real-time, Flink ensures that generative AI models operate with the most current and relevant context, reducing errors and hallucinations.

Flink’s real-time processing capabilities also support advanced machine learning workflows. This enables use cases like predictive analytics, anomaly detection, and generative AI applications that require instantaneous decision-making. The ability to join live data streams with historical or external datasets enriches the context for model inference, enhancing both accuracy and relevance.

Additionally, Flink facilitates feature extraction and data preprocessing directly within the stream to ensure that the inputs to AI models are optimized for performance. This seamless integration with model servers and vector databases allows organizations to scale their AI systems effectively, leveraging real-time insights to drive innovation and deliver immediate business value.

The Business Value: Immediate, Actionable AI Insights

Real-time AI model inference with Flink enables businesses to provide personalized customer experiences, detect fraud as it happens, and perform predictive maintenance with minimal latency. This real-time responsiveness empowers companies to make AI-driven decisions in milliseconds, improving customer satisfaction and operational efficiency.

By integrating Flink with event-driven architectures like Apache Kafka, businesses can ensure that AI systems are always fed with up-to-date and trustworthy data, further enhancing the reliability of predictions.

The integration of Flink and data streaming offers a clear path to measurable business impact. By aligning real-time AI capabilities with organizational goals, they can drive innovation while reducing operational costs, such as automating customer support to lower reliance on service agents.

Furthermore, Flink’s ability to process and enrich data streams at scale supports strategic initiatives like hyper-personalized marketing or optimizing supply chains in real-time. These benefits directly translate into enhanced competitive positioning, faster time-to-market for AI-driven solutions, and the ability to make more confident, data-driven decisions at the speed of business.

Trend 6: Becoming a Data Streaming Organization

To harness the full potential of data streaming, companies are shifting toward structured, enterprise-wide data streaming strategies. Moving from a tactical, ad-hoc approach to a cohesive top-down strategy enables businesses to align data streaming with organizational goals, driving both efficiency and innovation.

The Challenge: Fragmented Data Streaming Efforts

Many companies face challenges due to disjointed streaming efforts, leading to data silos and inconsistencies that prevent them from reaping the full benefits of real-time data processing. At Confluent, we call this the enterprise adoption barrier:

Data Streaming Maturity Model - The Enterprise Adoption Barrier
Source: Confluent

This fragmentation results in inefficiencies, duplication of efforts, and a lack of standardized processes. Without a unified approach, organizations struggle with:

  • Data Silos: Limited data sharing across teams creates bottlenecks for broader use cases.
  • Inconsistent Standards: Different teams often use varying schemas, patterns, and practices, leading to integration challenges and data quality issues.
  • Governance Gaps: A lack of defined roles, responsibilities, and policies results in limited oversight, increasing the risk of data misuse and compliance violations.

These challenges prevent organizations from scaling their data streaming capabilities and realizing the full value of their real-time data investments.

The Solution: Building an Integrated Data Streaming Organization

By adopting a comprehensive data streaming strategy, businesses can create a unified data platform with standardized tools and practices. A dedicated streaming platform team, often called the Center of Excellence (CoE), ensures consistent operations. An internal developer platform provides governed, self-serve access to streaming resources.

Key elements of a data streaming organization include:

  • Unified Platform: Move from disparate tools and approaches to a single, standardized data streaming platform. This includes consistent policies for cluster management, multi-tenancy, and topic naming, ensuring a reliable foundation for data streaming initiatives.
  • Self-Service: Provide APIs, UIs, and other interfaces for teams to onboard, create, and manage data streaming resources. Self-service capabilities ensure governed access to topics, schemas, and streaming capabilities, empowering developers while maintaining compliance and security.
  • Data as a Product: Adopt a product-oriented mindset where data streams are treated as reusable assets. This includes formalizing data products with clear contracts, ownership, and metadata, making them discoverable and consumable across the organization.
  • Alignment: Define clear roles and responsibilities, from platform operators and developers to data product owners. Establishing an enterprise-wide data streaming function ensures coordination and alignment across teams.
  • Governance: Implement automated guardrails for compliance, quality, and access control. This ensures that data streaming efforts remain secure, trustworthy, and scalable.

The Business Value: Consistent, Scalable, and Agile Data Streaming

Becoming a Data Streaming Organization unlocks significant value by turning data streaming into a strategic asset. The benefits include:

  • Enhanced Agility: A unified platform reduces time-to-market for new data-driven products and services, allowing businesses to respond quickly to market trends and customer demands.
  • Operational Efficiency: Streamlined processes and self-service capabilities reduce the overhead of managing multiple tools and teams, improving productivity and cost-effectiveness.
  • Scalable Innovation: Standardized patterns and reusable data products enable the rapid development of new use cases, fostering a culture of innovation across the enterprise.
  • Improved Governance: Clear policies and automated controls ensure data quality, security, and compliance, building trust with customers and stakeholders.
  • Cross-Functional Collaboration: By breaking down silos, organizations can leverage data streams across teams, creating a network effect that accelerates value creation.

To successfully adopt a Data Streaming Organization model, companies must combine technical capabilities with cultural and structural change. This involves not just deploying tools but establishing shared goals, metrics, and education to bring teams together around the value of real-time data. As organizations embrace data streaming as a strategic function, they position themselves to thrive in a data-driven world.

Embracing the Future of Data Streaming

As data streaming continues to mature, it has become the backbone of modern digital enterprises. It enables real-time decision-making, operational efficiency, and transformative AI applications. Trends such as the commoditization of Kafka, the adoption of the Kafka protocol, BYOC deployment models, and the rise of Flink as the standard for stream processing demonstrate the rapid evolution and growing importance of this technology. These innovations not only streamline infrastructure but also empower organizations to harness real-time insights, foster agility, and remain competitive in the ever-changing digital landscape.

These advancements in data streaming present a unique opportunity to redefine data strategy. Leveraging data streaming as a central pillar of IT architecture allows businesses to break down silos, integrate machine learning into critical workflows, and deliver unparalleled customer experiences. The convergence of data streaming with generative AI, particularly through frameworks like Flink, underscores the importance of embracing a real-time-first approach to data-driven innovation.

Looking ahead, organizations that invest in scalable, secure, and strategic data streaming infrastructures will be positioned to lead in 2025 and beyond. By adopting these trends, enterprises can unlock the full potential of their data, drive business transformation, and solidify their place as leaders in the digital era. The journey to set data in motion is not just about technology—it’s about building the foundation for a future where real-time intelligence powers every decision and every experience.

What trends do you see for data streaming? Which ones are your favorites? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Top Trends for Data Streaming with Apache Kafka and Flink in 2025 appeared first on Kai Waehner.

]]>
Deployment Options for Apache Kafka: Self-Managed, Fully-Managed / Serverless and BYOC (Bring Your Own Cloud) https://www.kai-waehner.de/blog/2024/09/12/deployment-options-for-apache-kafka-self-managed-fully-managed-serverless-and-byoc-bring-your-own-cloud/ Thu, 12 Sep 2024 13:43:31 +0000 https://www.kai-waehner.de/?p=6808 BYOC (Bring Your Own Cloud) is an emerging deployment model for organizations looking to maintain greater control over their cloud environments. Unlike traditional SaaS models, BYOC allows businesses to host applications within their own VPCs to provide enhanced data privacy, security, and compliance. This approach leverages existing cloud infrastructure. It offers more flexibility for custom configurations, particularly for companies with stringent security needs. In the data streaming sector around Apache Kafka, BYOC is changing how platforms are deployed. Organizations get more control and adaptability for various use cases. But it is clearly NOT the right choice for everyone!

The post Deployment Options for Apache Kafka: Self-Managed, Fully-Managed / Serverless and BYOC (Bring Your Own Cloud) appeared first on Kai Waehner.

]]>
BYOC (Bring Your Own Cloud) is an emerging deployment model for organizations looking to maintain greater control over their cloud environments. Unlike traditional SaaS models, BYOC allows businesses to host applications within their own VPCs to provide enhanced data privacy, security, and compliance. This approach leverages existing cloud infrastructure. It offers more flexibility for custom configurations, particularly for companies with stringent security needs. In the data streaming sector around Apache Kafka, BYOC is changing how platforms are deployed. Organizations get more control and adaptability for various use cases. But it is clearly NOT the right choice for everyone!

Apache Kafka Deployment Options - Serverless vs Self-Managed vs BYOC Bring Your Own Cloud

BYOC (Bring Your Own Cloud) – A New Deployment Model for Cloud Infrastructure

BYOC (Bring Your Own Cloud) is a deployment model where organizations choose their preferred cloud infrastructure to host applications or services, rather than using a serverless / fully managed cloud solution selected by a software vendor; typically known as Software as a Service (SaaS). This model gives businesses flexibility to leverage their existing cloud services (like AWS, Google Cloud, Microsoft Azure, or Alibaba) while integrating third-party applications that are compatible with multiple cloud environments.

BYOC helps companies maintain control over their cloud infrastructure, optimize costs, ensure compliance with security standards. BYOC is typically implemented within an organization’s own cloud VPC. Unlike SaaS models, BYOC offers enhanced privacy and compliance by maintaining control over network architecture and data management.

However, BYOC also has some serious drawbacks! The main challenge is scaling a fleet of co-managed clusters running in customer environments with all the reliability expectations of a cloud service. Confluent has shied away from offering a BYOC deployment model for Apache Kafka based on Confluent Platform because doing BYOC at scale requires a different architecture. WarpStream has built this architecture, with a BYOC-native platform that was designed from the ground up to avoid the pitfalls of traditional BYOC. 

The Data Streaming Landscape

Data Streaming is a separate software category of data platforms. Many software vendors built their entire businesses around this category. The data streaming landscape shows that most vendors use Kafka or implement its protocol because Apache Kafka has become the de facto standard for data streaming.

New software companies have emerged in this category in the last few years. And several mature players in the data market added support for data streaming in their platforms or cloud service ecosystem. Most software vendors use Kafka for their data streaming platforms. However, there is more than Kafka. Some vendors only use the Kafka protocol (Azure Event Hubs) or utterly different APIs (like Amazon Kinesis).

The following Data Streaming Landscape 2024 summarizes the current status of relevant products and cloud services.

Data Streaming Landscape 2024 around Kafka Flink and Cloud

The Data Streaming Landscape evolves. Last year, I added WarpStream as a new entrant into the market. WarpStream uses the Kafka protocol and provides a BYOC offering for Kafka in the cloud. In my next update of the data streaming landscape, I need to do yet another update: WarpStream is now part of Confluent. There are also many other new entrants. Stay tuned for a new “Data Streaming Landscape 2025” in a few weeks (subscribe to my newsletter to stay up-to-date with all the things data streaming).

Confluent Acquisition of WarpStream

Confluent had two product offerings:

  • Confluent Platform: A self-managed data streaming platform powered by Kafka, Flink, and much more that you can deploy everywhere (on-premise data center, public cloud VPC, edge like factory or retail store, and even stretched across multiple regions or clouds).
  • Confluent Cloud: A fully managed data streaming platform powered by Kafka, Flink, and much more that you can leverage as a serverless offering in all major public cloud providers (Amazon AWS, Microsoft Azure, Google Cloud Platform).

Why did Confluent acquire WarpStream? Because many customers requested a third deployment option: BYOC for Apache Kafka.

As Jay Kreps described in the acquisition announcement: “Why add another flavor of streaming? After all, we’ve long offered two major form factors–Confluent Cloud, a fully managed serverless offering, and Confluent Platform, a self-managed software offering–why complicate things? Well, our goal is to make data streaming the central nervous system of every company, and to do that we need to make it something that is a great fit for a vast array of use cases and companies.”

Read more details about the acquisition of WarpStream by Confluent in Jay’s blog post: Confluent + WarpStream = Large-Scale Streaming in your Cloud. In summary, WarpStream is not dead. The WarpStream team clarified the status quo and roadmap of this BYOC product for Kafka in its blog post: “WarpStream is Dead, Long Live WarpStream“.

Let’s dig deeper into the three deployment options and their trade-offs.

Deployment Options for Apache Kafka

Apache Kafka can be deployed in three primary ways: self-managed, fully managed/serverless, and BYOC (Bring Your Own Cloud).

  • In self-managed deployments, organizations handle the entire infrastructure, including setup, maintenance, and scaling. This provides full control but requires significant operational effort.
  • Fully managed or serverless Kafka is offered by providers like Confluent Cloud or Azure Event Hubs. The service is hosted and managed by a third-party, reducing operational overhead but with limited control over the underlying infrastructure.
  • BYOC deployments allow organizations to host Kafka within their own cloud VPC. BYOC combines some of the benefits of cloud flexibility with enhanced security and control, while it outsources most of Kafka’s management to specialized vendors.

Confluent’s Kafka Products: Self-Managed Platform vs. BYOC vs. Serverless Cloud

Using the example of Confluent’s product offerings, we can see why there are three product categories for data streaming around Apache Kafka.

There is no silver bullet. Each deployment option for Apache Kafka has its pros and cons. The key differences are related to the trade-offs between “ease of management” and “level of control”.

Cloud-Native BYOC for Apache Kafka with WarpStream in the Public Cloud
Source: Confluent

If we go into more detail, we see that different use cases require different configurations, security setups, and levels of control while also focusing on being cost effective and providing the right SLA and latency for each use case.

Trade-Offs of Confluent’s Deployment Options for Apache Kafka

On a high level, you need to figure out if you want or have to managed the data plane(s) and control plane of your data streaming infrastructure:

Confluent Deployment Types for Apache Kafka On Premise Edge and Public Cloud
Source: Confluent

If you follow my blog, you know that a key focus is exploring various use cases, architectures and success stories across all industries. And use cases such as log aggregation or IoT sensor analytics required very different deployment characteristics than an instant payment platform or fraud detection and prevention.

Choose the right Kafka deployment model for your use case. Even within one organization, you will probably need different deployments because of security, data privacy and compliance requirements, but also staying cost efficient for high-volume workloads.

BYOC for Apache Kafka with WarpStream

Self-managed Kafka and fully managed Kafka are pretty well understood in the meantime. However, why is BYOC needed as a third option and how to do it right?

I had plenty of customer conversations across industries. Common feedback is that most organizations have a cloud-first strategy, but many also (have to) stay hybrid for security, latency or cost reasons.

And let’s be clear: If a data streaming project goes to the cloud, fully managed Kafka (and Flink) should always be the first option as it is much easier to manage and operate to focus on fast time to market and business innovation. Having said that, sometimes, security, cost or other reasons require BYOC.

How Is BYOC Implemented in WarpStream?

Let’s explore why WarpStream is an excellent option for Kafka as BYOC deployment and when to use it instead of serverless Kafka in the cloud:

  • WarpStream provides BYOC, meaning single-tenant service so that a customer has its own “instance” of Kafka (to use the protocol, it is not Apache Kafka under the hood).
  • However, under the hood, the system still uses cloud-native serverless systems like Amazon S3 for scalability, cost-efficiency and high availability (but the customer does not see this complexity and does not have to care about it).
  • As a result, the data plane is still customer managed (that’s what they need for security or other reasons), but in contrary to self-managed Kafka, the customer does not need to worry about the complexity under the hood (like rebalancing, rolling upgrades, backups) – that is what S3 and other magic code of the WarpStream service takes over.
  • The magic is the stateless agents in the customer VPC. It makes this solution scalable and still easy to operate (compared to the self-managed deployment option) while the customer has its own instance.
  • Many use cases are around lift and shift of existing Kafka deployments (like self-managed Apache Kafka or another vendor like Kafka as part of Cloudera or Red Hat). Some companies want to “lift and shift” and keep the feeling of control they are used to, while still offloading most of the management to the vendor.

I wrote this summary after reading the excellent article of my colleague Jack Vanlightly: BYOC, Not “The Future Of Cloud Services” But A Pillar Of An Everywhere Platform. This article goes into much more technical detail and is a highly recommended read for any architect and developer.

Benefits of WarpStream’s BYOC Implementation for Kafka

Most vendors have dubios BYOC implementations.

For instance, if the vendor needs to access the VPC of the customercheaper than AK self managed because cloud native (zero disks, zero interzone networking fees) and headaches for responsibilities in the case of failures.

WarpStream’s BYOC-native implementation differs from other vendors and provides various benefits because of its novel architecture:

  • WarpStream does not need access to the customer VPC. The data plane (i.e., the brokers in the customer VPC) are stateless. The metadata/consensus is in the control plane (i.e., the cloud service in the WarpStream VPC).
  • The architecture solves sovereignty challenges and is a great fit for security and compliance requirements.
  • The cost of WarpStream’s BYOC offering is cheaper than self-managed Apache Kafka because it is built with cloud-native concepts and technologies in mind (e.g., zero disks and zero interzone networking fees, leveraging cloud object storage such as Amazon S3).
  • The stateless architecture in the customer VPC makes autoscaling and elasticity very easy to implement/configure.

The Main Drawbacks of BYOC for Apache Kafka

BYOC is an excellent choice if you have specific security, compliance or cost requirements that need this deployment option. However, there are some drawbacks:

  • The latency is worse than self-managed Kafka or serverless Kafka as WarpStream directly touches the Amazon S3 object storage (in contrast to “normal Kafka”).
  • Kafka using BYOC is NOT fully managed, like e.g. Confluent Cloud, so you have more efforts to operate it. Also, keep in mind that most Kafka cloud services are NOT serverless but just provision Kafka for you and you still need to operate it.
  • Additional components of the data streaming platform (such as Kafka Connect connectors and stream processors such as Kafka Streams or Apache Flink) are not part of the BYOC offering (yet). This adds some complexity to operations and development.

Therefore, once again, I recommend to only look at BYOC options for Apache Kafka in the public cloud if a fully managed and serverless data streaming platform does NOT work for you because of cost, security or compliance reasons!

BYOC Complements Self-Managed and Serverless Apache Kafka – But BYOC Should NOT be the First Choice!

BYOC (Bring Your Own Cloud) offers a flexible and powerful deployment model, particularly beneficial for businesses with specific security or compliance needs. By allowing organizations to manage applications within their own cloud VPCs, BYOC combines the advantages of cloud infrastructure control with the flexibility of third-party service integration.

But once again: If a data streaming project goes to the cloud, fully managed Kafka (and Flink) should always be the first option as it is much easier to manage and operate to focus on fast time to market and business innovation. Choose BYOC only if fully managed does not work for you, e.g. because of security requirements.

In the data streaming domain around Apache Kafka, the BYOC model complements existing self-managed and fully managed options. It offers a middle ground that balances ease of operation with enhanced privacy and security. Ultimately, BYOC helps companies tailor their cloud environments to meet diverse and developing business requirements.

What is your deployment option for Apache Kafka? A self-managed deployment in the data center or at the edge? Serverless Cloud with a service such as Confluent Cloud? Or did you (have to) choose BYOC? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Deployment Options for Apache Kafka: Self-Managed, Fully-Managed / Serverless and BYOC (Bring Your Own Cloud) appeared first on Kai Waehner.

]]>
The Shift Left Architecture – From Batch and Lakehouse to Real-Time Data Products with Data Streaming https://www.kai-waehner.de/blog/2024/06/15/the-shift-left-architecture-from-batch-and-lakehouse-to-real-time-data-products-with-data-streaming/ Sat, 15 Jun 2024 06:12:44 +0000 https://www.kai-waehner.de/?p=6480 Data integration is a hard challenge in every enterprise. Batch processing and Reverse ETL are common practices in a data warehouse, data lake or lakehouse. Data inconsistency, high compute cost, and stale information are the consequences. This blog post introduces a new design pattern to solve these problems: The Shift Left Architecture enables a data mesh with real-time data products to unify transactional and analytical workloads with Apache Kafka, Flink and Iceberg. Consistent information is handled with streaming processing or ingested into Snowflake, Databricks, Google BigQuery, or any other analytics / AI platform to increase flexibility, reduce cost and enable a data-driven company culture with faster time-to-market building innovative software applications.

The post The Shift Left Architecture – From Batch and Lakehouse to Real-Time Data Products with Data Streaming appeared first on Kai Waehner.

]]>
Data integration is a hard challenge in every enterprise. Batch processing and Reverse ETL are common practices in a data warehouse, data lake or lakehouse. Data inconsistency, high compute cost, and stale information are the consequences. This blog post introduces a new design pattern to solve these problems: The Shift Left Architecture enables a data mesh with real-time data products to unify transactional and analytical workloads with Apache Kafka, Flink and Iceberg. Consistent information is handled with streaming processing or ingested into Snowflake, Databricks, Google BigQuery, or any other analytics / AI platform to increase flexibility, reduce cost and enable a data-driven company culture with faster time-to-market building innovative software applications.

The Shift Left Architecture

Data Products – The Foundation of a Data Mesh

A data product is a crucial concept in the context of a data mesh that represents a shift from traditional centralized data management to a decentralized approach.

McKinsey finds that “when companies instead manage data like a consumer product—be it digital or physical—they can realize near-term value from their data investments and pave the way for quickly getting more value tomorrow. Creating reusable data products and patterns for piecing together data technologies enables companies to derive value from data today and tomorrow”:

McKinsey - Why Handle Data as a Product

According to McKinsey, the benefits of the data product approach can be significant:

  • New business use cases can be delivered as much as 90 percent faster.
  • The total cost of ownership, including technology, development, and maintenance, can decline by 30 percent.
  • The risk and data-governance burden can be reduced.

Data Product from a Technical Perspective

Here’s what a data product entails in a data mesh from a technical perspective:

  1. Decentralized Ownership: Each data product is owned by a specific domain team. Applications are truly decoupled.
  2. Sourced from Operational and Analytical Systems: Data products include information from any data source, including the most critical systems and analytics/reporting platforms.
  3. Self-contained and Discoverable: A data product includes not only the raw data but also the associated metadata, documentation, and APIs.
  4. Standardized Interfaces: Data products adhere to standardized interfaces and protocols, ensuring that they can be easily accessed and utilized by other data products and consumers within the data mesh.
  5. Data Quality: Most use cases benefit from real-time data. A data product ensures data consistency across real-time and batch applications.
  6. Value-Driven: The creation and maintenance of data products are driven by business value.

In essence, a data product in a data mesh framework transforms data into a managed, high-quality asset that is easily accessible and usable across an organization, fostering a more agile and scalable data ecosystem.

Anti-Pattern: Batch Processing and Reverse ETL

The “Modern” Data Stack leverages traditional ETL tools or data streaming for ingestion into a data lake, data warehouse or lakehouse. The consequence is a spaghetti architecture with various integration tools for batch and real-time workloads mixing analytical and operational technologies:

Data at Rest and Reverse ETL

Reverse ETL is required to get information out of the data lake into operational applications and other analytical tools. As I have written about it previously, the combination of data lakes and Reverse ETL is an anti-pattern for the enterprise architecture largely due to the economic and organizational inefficiencies Reverse ETL creates. Event-driven data products enable a much simpler and more cost-efficient architecture.

One key reason for the need of batch processing and Reverse ETL patterns is the common use of the Lambda architecture: A data processing architecture that handles real-time and batch processing separately using different layers. This still widely exists in enterprise architectures. Not just for big data use cases like Hadoop/Spark and Kafka, but also for the integration with transactional systems like file-based legacy monoliths or Oracle databases.

Contrary, the Kappa Architecture handles both real-time and batch processing using a single technology stack. Learn more about “Kappa replacing Lambda Architecture” in its own article. TL;DR: The Kappa architecture is possible by bringing even legacy technologies into an event-driven architecture using a data streaming platform. Change Data Capture (CDC) is one of the most common helpers for this.

Traditional ELT in the Data Lake, Data Warehouse, Lakehouse

It seems like nobody does data warehouse anymore today. Everyone talks about a lakehouse merging data warehouse and data lake. Whatever term you use or prefer… The integration process these days looks like the following:

ETL and ELT Data Integration to Data Lake Warehouse Lakehouse in Batch

Just ingesting all the raw data into a data warehouse / data lake / lakehouse has several challenges:

  • Slower Updates: The longer the data pipeline and the more tools are used, the slower the update of the data product.
  • Longer Time-to-Market: Development efforts are repeated because each business unit needs to do the same or similar processing steps again instead of consuming from a curated data product.
  • Increased Cost: The cash cow of analytics platforms charge is compute, not storage. The more your business units use DBT, the better for the analytics SaaS provider.
  • Repeating Efforts: Most enterprises have multiple analytics platforms, including different data warehouses, data lakes, and AI platforms. ELT means doing the same processing again, again, and again.
  • Data Inconsistency: Reverse ETL, Zero ETL,  and other integration patterns make sure that your analytical and especially operational applications see inconsistent information. You cannot connect a real-time consumer or mobile app API to a batch layer and expect consistent results.

Data Integration, Zero ETL and Reverse ETL with Kafka, Snowflake, Databricks, BigQuery, etc.

These disadvantages are real! I have not met a single customer in the past months who disagreed and told me these challenges do not exist. To learn more, check out my blog series about data streaming with Apache Kafka and analytics with Snowflake:

  1. Snowflake Integration Patterns: Zero ETL and Reverse ETL vs. Apache Kafka
  2. Snowflake Data Integration Options for Apache Kafka (including Iceberg)
  3. Apache Kafka + Flink + Snowflake: Cost Efficient Analytics and Data Governance

The blog series can be applied to any other analytics engine. It is a worthwhile read, no matter if you use Snowflake, Databricks, Google BigQuery, or a combination of several analytics and AI platforms.

The solution for this data mess creating data inconsistency, outdated information, and ever-growing cost is the Shift Left Architecture

Shift Left to Data Streaming for Operational AND Analytical Data Products

The Shift Left Architecture enables consistent information from reliable, scalable data products, reduces the compute cost, and allows much faster time-to-market for operational and analytical applications with any kind of technology (Java, Python, iPaaS, Lakehouse, SaaS, “you-name-it”) and communication paradigm (real-time, batch, request-response API):

Shift Left Architecture with Data Streaming into Data Lake Warehouse Lakehouse

Shifting the data processing to the data streaming platform enables:

  • Capture and stream data continuously when the event is created
  • Create data contracts for downstream compatibility and promotion of trust with any application or analytics / AI platform
  • Continuously cleanse, curate and quality check data upstream with data contracts and policy enforcement
  • Shape data into multiple contexts on-the-fly to maximize reusability (and still allow downstream consumers to choose between raw and curated data products)
  • Build trustworthy data products that are instantly valuable, reusable and consistent for any transactional and analytical consumer (no matter if consumed in real-time or later via batch or request-response API)

While shifting to the left with some workloads, it is crucial to understand that developers/data engineers/data scientists can usually still use their favourite interface like SQL or a  programming language such as Java or Python.

Data Streaming is the core fundament of the Shift Left Architecture to enable reliable, scalable real-time data products with good data quality. The following architecture shows how Apache Kafka and Flink connect any data source, curate data sets (aka stream processing / Streaming ETL) and share the processed events with any operational or analytical data sink:

Shift Left Architecture with Apacke Kafka Flink and Iceberg

The architecture shows an Apache Iceberg table as alternative consumer. Apache Iceberg is an open table format designed for managing large-scale datasets in a highly performant and reliable way, providing ACID transactions, schema evolution, and partitioning capabilities. It optimizes data storage and query performance, making it ideal for data lakes and complex analytical workflows. Iceberg evolves to the de facto standard with support from most major vendors in the cloud and data management space, including AWS, Azure, GCP, Snowflake, Confluent, and many more coming (like Databricks after its acquisition of Tabular).

From the data streaming perspective, the Iceberg table is just a button click away from the Kafka Topic and its Schema (using Confluent’s Tableflow – I am sure other vendors will follow soon with own solutions). The big advantage of Iceberg is that data needs to be stored only once (typically in a cost-efficient and scalable object store like Amazon S3). Each downstream application can consume the data with its own technology without any need for additional coding or connectors. This includes data lakehouses like Snowflake or Databricks AND data streaming engines like Apache Flink.

Video: Shift Left Architecture

I summarized the above architectures and examples for the Shift Left Architecture in a short ten minute video if you prefer listening to content:

Apache Iceberg – The New De Facto Standard for Lakehouse Table Format?

Apache Iceberg is such a huge topic and a real game changer for enterprise architectures, end users and cloud vendors. I will write another dedicated blog, including interesting topics such as:

  • Confluent’s product strategy to embed Iceberg tables into its data streaming platform
  • Snowflake’s open source Iceberg project Polaris
  • Databricks’ acquisition of Tabular (the company behind Apache Iceberg) and the relation to Delta Lake and open sourcing its Unity Catalog
  • The (expected) future of table format standardization, catalog wars, and other additional solutions like Apache Hudi or Apache XTable for omni-directional interoperability across lakehouse table formats.

Stay tuned and subscribe to my newsletter to receive new articles.

Business Value of the Shift Left Architecture

Apache Kafka is the de facto standard for data streaming building a Kappa Architecture. The Data Streaming Landscape shows various open source technologies and cloud vendors. Data Streaming is a new software category. Forrester published “The Forrester Wave™: Streaming Data Platforms, Q4 2023“. The leaders are Microsoft, Google and Confluent, followed by Oracle, Amazon, Cloudera, and a few others.

Building data products more left in the enterprise architecture with a data streaming platform and technologies such as Kafka and Flink creates huge business value:

  • Cost Reduction: Reducing the compute cost in one or even multiple data platforms (data lake, data warehouse, lakehouse, AI platform, etc.).
  • Less Development Effort: Streaming ETL, data curation and data quality control already executed instantly (and only once) after the event creation.
  • Faster Time to Market: Focus on new business logic instead of doing repeated ETL jobs.
  • Flexibility: Best of breed approach for choosing the best and/or most cost-efficient technology per use case.
  • Innovation: Business units can choose any programming language, tool or SaaS to do real-time or batch consumption from data products to try and fail or scale fast.

The unification of transactional and analytical workloads is finally possible to enable good data quality, faster time to market for innovation and reduced cost of the entire data pipeline. Data consistency matters across all applications and databases… A Kafka Topic with a  data contract (= Schema with policies) brings data consistency out of the box!

How does your data architecture look like today? Does the Shift Left Architecture make sense to you? What is your strategy to get there? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post The Shift Left Architecture – From Batch and Lakehouse to Real-Time Data Products with Data Streaming appeared first on Kai Waehner.

]]>
Data Streaming is not a Race, it is a Journey! https://www.kai-waehner.de/blog/2023/04/13/data-streaming-is-not-a-race-it-is-a-journey/ Thu, 13 Apr 2023 07:15:31 +0000 https://www.kai-waehner.de/?p=5352 Data Streaming is not a race, it is a journey! Event-driven architectures and technologies like Apache Kafka or Apache Flink require a mind shift in architecting, developing, deploying, and monitoring applications. This blog post explores success stories from data streaming journeys across industries, including banking, retail, insurance, manufacturing, healthcare, energy & utilities, and software companies.

The post Data Streaming is not a Race, it is a Journey! appeared first on Kai Waehner.

]]>
Data Streaming is not a race, it is a Journey! Event-driven architectures and technologies like Apache Kafka or Apache Flink require a mind shift in architecting, developing, deploying, and monitoring applications. Legacy integration, cloud-native microservices, and data sharing across hybrid and multi-cloud setups are the norm, not an exception. This blog post explores success stories from data streaming journeys across industries, including banking, retail, insurance, manufacturing, healthcare, energy & utilities, and software companies.

Data Streaming is not a Race it is a Journey

Data Streaming is a Journey, not a Race!

Confluent’s maturity model is used across thousands of customers to analyze the status quo, deploy real-time infrastructure and applications, and plan for a strategic event-driven architecture to ensure success and flexibility in the future of the multi-year data streaming journey:

Data Streaming Maturity Model
Source: Confluent

The following sections show success stories from various companies across industries that moved through the data streaming journey. Each journey looks different. Each company has different technologies, vendors, strategies, and legal requirements.

Before we begin, I must stress that these journeys are NOT just successful because of the newest technologies like Apache Kafka, Kubernetes, and public cloud providers like AWS, Azure, and GCP. Success comes with a combination of technologies and expertise.

Consulting – Expertise from Software Vendors and System Integrators

All the below success stories combined open source technologies, software vendors, cloud providers, internal business and technical experts, system integrators, and consultants of software vendors.

Long story short: Technology and expertise are required to make your data streaming journey successful.

We not only sell data streaming products and cloud services but also offer advice and essential support. Note that the bundle numbers (1 to 5) in the following diagram are related to the above data streaming maturity curve:

Professional Services along the Data Streaming Journey
Source: Confluent

Other vendors have similar strategies to support you. The same is true for the system integrators. Learn together. Bring people from different companies into the room to solve your business problems.

With this background, let’s look at the fantastic data streaming journeys we heard about at past Kafka Summits, Data in Motion events, Confluent blog posts, or other similar public knowledge-sharing alternatives.

Customer Journeys across Industries powered by Apache Kafka

Apache Kafka is the DE FACTO standard for data streaming. However, each data streaming journey looks different. Don’t underestimate how much you can learn from other industries. These companies might have different legal or compliance requirements, but the overall IT challenges are often very similar.

We cover stories from various industries, including banks, retailers, insurers, manufacturers, healthcare organizations, energy providers, and software companies.

The following data streaming journeys are explored in the below sections:

  1. AO.com: Real-Time Clickstream Analytics
  2. Nordstrom: Event-driven Analytics Platform
  3. Migros: End-to-End Supply Chain with IoT
  4. NordLB: Bank-wide Data Streaming for Core Banking
  5. Raiffeisen Bank International: Strategic Real-Time Data Integration Platform Across 13 Countries
  6. Allianz: Legacy Modernization and Cloud-Native Innovation
  7. Optum (UnitedHealth Group): Self-Service Data Streaming
  8. Intel: Real-Time Cyber Intelligence at Scale with Kafka and Splunk
  9. Bayer: Transition from On-Premise to Multi-Cloud
  10. Tesla: Migration from a Message Broker (RabbitMQ) to Data Streaming (Kafka)
  11. Siemens: Integration between On-Premise SAP and Salesforce CRM in the Cloud
  12. 50Hertz: Transition from Proprietary SCADA to Cloud-Native Industrial IoT

There is no specific order besides industries. If you read the stories, you see that all are different, but you can still learn a lot from them, no matter what industry your business is in.

For more information about data streaming in a specific industry, just search my blog for case studies and architectures. Recent blog posts focus on the state of data streaming in manufacturing, in financial services, and so on.

AO.com – Real-Time Clickstream Analytics

AO.com is an electrical retailer. The hyper-personalized online retail experience turns each customer visit into a one-on-one marketing opportunity. The technical implementations correlation of historical customer data with real-time digital signals.

Years ago, Apache Hadoop and Apache Spark referred to this kind of clickstream analytics as the “Hello World” example. AO.com does real-time clickstream analytics powered by data streaming.

AO.com’s journey started with relatively simple data streaming pipelines leveraging the Schema Registry for decoupling and API contracts. Over time, the platform thinking of data streaming added more business value, and operations shifted to the fully-managed Confluent Cloud to focus on implementing business applications, not operations infrastructure.

Apache Kafka at AO for Retail and Customer 360 Omnichannel Clickstream Analytics
Source: AO.com

Nordstrom – Event-driven Analytics Platform

Nordstrom is an American luxury department store chain. In the meantime, 50+% of revenue comes from online sales. They built the Nordstrom Analytical Platform (NAP) as the heart of its event-driven architecture. Nordstrom can use a singular event for analytical, functional, operational, and model-building purposes.

As Nordstrom said at Kafka Summit: “If it’s not in NAP, it didn’t happen.”

Nordstrom’s journey started in 1901 with the first stores. The last 25 years brought the retailer into the digital world and online retail. NAP has been a core component since 2017 for real-time analytics. The future is all automated decision-making in real-time.

Nordstrom Analytics Platform with Apache Kafka and Data Lake for Online Retail
Source: Nordstrom

Migros – End-to-End Supply Chain with IoT

Migros is Switzerland’s largest retail company, the largest supermarket chain, and the largest employer. They leverage data streaming with Confluent to distribute master data across many systems.

Migros’ supply chain is optimized with a single data streaming pipeline (including replaying entire days of events). For instance, real-time transportation information is visualized with MQTT and Kafka. Afterward, more advanced business logic was implemented, like forecasting the truck arrival time and planning, respectively, rescheduling truck tours.

The data streaming journey started with a single Kafka cluster. And grew to various independent Kafka clusters and a strategic Kafka-powered enterprise integration platform.

History of Apache Kafka for Retail and Logistics at Migros
Source: Migros

NordLB – Bank-wide Data Streaming for Core Banking

Norddeutsche Landesbank (NordLB) is one of the largest commercial banks in Germany. They implemented an enterprise-wide transformation. The new Confluent-powered core banking platform enables event-based and truly decoupled stream processing. Use cases include improved real-time analytics, fraud detection, and customer retention.

Unfortunately, NordLB’s slide is only available in German. But I guess you can still follow their data streaming journey moving over the years from on-premise big data batch processing with Hadoop to real-time data streaming and analytics in the cloud with Kafka and Confluent:

Enterprise Wide Apache Kafka Adoption at NordLB for Banking and Financial Services
Source: Norddeutsche Landesbank

Raiffeisen Bank International – Strategic Real-Time Data Integration Platform across 13 Countries

Raiffeisen Bank International (RBI) operates as a corporate and investment bank in Austria and as a universal bank in Central and Eastern Europe (CEE).

RBI’s bank transformation across 13 countries includes various components:

  • Bank-wide transformation program
  • Realtime Integration Center of Excellence (“RICE“)
  • Central platform and reference architecture for self-service re-use
  • Event-driven integration platform (fully-managed cloud and on-premise)
  • Group-wide API contracts (=schemas) and data governance

While I don’t have a nice diagram of RBI’s data streaming journey over the past few years, I can show you one of the most impressive migration stories. The context is horrible had to move from Ukraine data centers to the public cloud after the Russian invasion started in early 2022. The journey was super impressive from the IT perspective, as the migration happened without downtime or data loss.

RBI’s CTO recapped:

“We did this migration in three months, because we didn’t have a choice.”

Migration from On Premise to Cloud in Ukraine at Raiffeisen Bank International
Source: McKinsey

Allianz – Legacy Modernization and Cloud-Native Innovation

Allianz is a European multinational financial services company headquartered in Munich, Germany. Its core businesses are insurance and asset management. The company is one of the largest insurers and financial services groups.

A large organization like Allianz does not have just one enterprise architecture. This section has two independent stories of two separate Allianz business units.

One team of Allianz has built a so-called Core Insurance Service Layer (CISL). The Kafka-powered enterprise architecture is flexible and evolutionary. Why? Because it needs to integrate with hundreds of old and new applications. Some are running on the mainframe or use a batch file transfer. Others are cloud-native, running in containers or as a SaaS application in the public cloud.

Data streaming ensures the decoupling of applications via events and the reliable integration of different backend applications, APIs and communication paradigms. Legacy and new applications connect but also allow running in parallel (old V1 and new V2) before migrating away and shutting down from the legacy component.

Alliance Insurance Core Integration Layer with Data Streaming and Kafka
Source: Allianz

Allianz Direct is a separate Kafka deployment. This business unit started with a greenfield approach to build insurance as a service in the public cloud. The cloud-native platform is elastic, scalable, and open. Hence, one platform can be used across countries with different legal, compliance, and data governance requirements.

This data streaming journey is best described by looking at a quote from Allianz Direct’s CTO:

Data Streaming is a Cloud Journey at Allianz Direct
Source: Linkedin (November 2022)

Optum – Self-Service Data Streaming

Optum is an American pharmacy benefit manager and healthcare provider
(UnitedHealth Group subsidiary). Optum started with a single data source connecting to Kafka. Today, they provide Kafka as a Service within the UnitedHealth Group. The service is centrally managed and used by over 200 internal application teams.

The benefits of this data streaming approach are the following characteristics: repeatable, scalable, and a cost-efficient way to standardize data. Optum leverages data streaming for many use cases, from mainframe via change data capture (CDC) to modern data processing and analytics tools.

IT Modernization with Apache Kafka and Data Streaming in Healthcare at Optum
Source: Optum

Intel – Real-Time Cyber Intelligence at Scale with Kafka and Splunk

Intel Corporation (commonly known as Intel) is an American multinational corporation and technology company headquartered in Santa Clara, California. It is one of the world’s largest semiconductor chip manufacturers by revenue.

Intel’s Cyber Intelligence Platform leverages the entire Kafka-native ecosystem, including Kafka Connect, Kafka Streams, Multi-Region Clusters (MRC), and more…

Their Kafka maturity curve shows how Intel started with a few data pipelines, for instance, getting data from various sources into Splunk for situational awareness and threat intelligence use cases. Later, Intel added Kafka-native stream processing for streaming ETL (to pre-process, filter, and aggregate data) instead of ingesting raw data (with high $$$ bills) and for advanced analytics by combining streaming analytics with Machine Learning.

Cybersecurity Intelligence Platform with Data Streaming Apache Kafka Confluent and Splunk at Intel
Source: Intel

Brent Conran (Vice President and Chief Information Security Officer, Intel) described the benefits of data streaming:

“Kafka enables us to deploy more advanced techniques in-stream, such as machine learning models that analyze data and produce new insights. This helps us reduce the meantime to detect and respond.”

Bayer AG – Transition from On-Premise to Multi-Cloud

Bayer AG is a German multinational pharmaceutical and biotechnology company and one of the largest pharmaceutical companies in the world. They adopted a cloud-first strategy and started a multi-year transition to the cloud.

While Bayer AG has various data streaming projects powered by Kafka, my favorite is the success story of their Monsanto acquisition. The Kafka-based cross-data center DataHub was created to facilitate migration and to drive a shift to real-time stream processing.

Instead of using many words, let’s look at Bayer’s impressive data streaming journey from on-premise to multi-cloud, connecting various cloud-native Kafka and non-Kafka technologies over the years.

Hybrid Multi Cloud Data Streaming Journey at Bayer in Healthcare and Pharma
Source: Bayer AG

Tesla – Migration from a Message Broker (RabbitMQ) to Data Streaming (Kafka)

Tesla is a carmaker, utility company, insurance provider, and more. The company processes trillions of messages per day for IoT use cases.

Tesla’s data streaming journey is fascinating because it focuses on migrating from a message broker to stream processing. The initial reasons for Kafka were the scalability and reliability of processing high volumes of IoT sensor data.

But as you can see, more use cases were added quickly, such as Kafka Connect for data integration.

Tesla Automotive Journey from RabbitMQ to Kafka Connect
Source: Tesla

Tesla’s blog post details its advanced stream processing use cases. I often call streaming analytics the “secret sauce” (as data is only valuable if you correlate it in real-time instead of just ingesting it into a batch data warehouse or data lake).

Siemens – Integration between On-Premise SAP and Salesforce CRM in the Cloud

Siemens is a German multinational conglomerate corporation and the largest industrial manufacturing company in Europe.

One strategic data-streaming use case allowed Siemens to move “from batch to faster“ data processing. For instance, Siemens connected its SAP PMD to Kafka on-premise. The SAP infrastructure is very complex, with 80% ”customized ERP”. They improved the business processes and integration workflow from daily or weekly batches to real-time communication by integrating SAP’s proprietary IDoc messages within the event streaming platform.

Siemens later migrated from self-managed on-premise Kafka to Confluent Cloud via Confluent Replicator. Integrating Salesforce CRM via Kafka Connect was the first step of Siemens’ cloud strategy. As usual, more and more projects and applications join the data streaming journey as it is super easy to tap into the event stream and connect it to your favorite tools, APIs, and SaaS products.

Siemens Apache Kafka and Data Streaming Journey to the Cloud across the Supply Chain
Source: Siemens

A few important notes on this migration:

  • Confluent Cloud was chosen because it enables focusing on business logic and handing over complex operations to the expert, including mission-critical SLAs and support.
  • Migration from one Kafka cluster to another (in this case, from open source to Confluent Cloud) is possible with zero downtime and no data loss. Confluent Replicator and Cluster Linking, in conjunction with the expertise and skills of consultants, make such a migration simple and safe.

50Hertz – Transition from Proprietary SCADA to Cloud-Native Industrial IoT

50hertz is a transmission system operator for electricity in Germany. They built a cloud-native 24/7 SCADA system with Confluent. The solution is developed and tested in the public cloud but deployed in safety-critical air-gapped edge environments. A unidirectional hardware gateway helps replicate data from the edge into the cloud safely and securely.

This data streaming journey is one of my favorites as it combines so many challenges well known in any OT/IT environment across industries like manufacturing, energy and utilities, automotive, healthcare, etc.:

  • From monolithic and proprietary systems to open architectures and APIs
  • From batch with mainframe, files, and old Windows servers to real-time data processing and analytics
  • From on-premise data centers (and air-gapped production facilities) to a hybrid edge and cloud strategy, often with a cloud-first focus for new use cases.

The following graphic is fantastic. It shows the data streaming journey from left to right, including the bridge in the middle (as such a project is not a big bang but usually takes years, at a minimum):

50Hertz Journey with Data Streaming for OT IT Modernization
Source: 50Hertz

Technical Evolution during the Data Streaming Journey

The different data streaming journeys showed you a few things: It is not a big bang. Old and new technology and processes need to work together. Only technology AND expertise drive successful projects.

Therefore, I want to share a few more resources to think about this from a technical perspective:

The data streaming journey never ends. Apache Kafka and Apache Flink increasingly join forces to build innovative real-time stream processing applications. We are just at the beginning.

What does your data streaming journey look like? Where are you right now, and what are your plans for the next few years? Let’s connect on LinkedIn and discuss it! Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

The post Data Streaming is not a Race, it is a Journey! appeared first on Kai Waehner.

]]>
When NOT to choose Amazon MSK Serverless for Apache Kafka? https://www.kai-waehner.de/blog/2022/08/30/when-not-to-choose-amazon-msk-serverless-for-apache-kafka/ Tue, 30 Aug 2022 05:54:21 +0000 https://www.kai-waehner.de/?p=4777 Apache Kafka became the de facto standard for data streaming. Various cloud offerings emerged and improved in the last years. Amazon MSK Serverless is the latest Kafka product from AWS. This blog post looks at its capabilities to explore how it relates to "the normal" partially managed Amazon MSK, when the serverless version is a good choice, and when other fully-managed cloud services like Confluent Cloud are the better option.

The post When NOT to choose Amazon MSK Serverless for Apache Kafka? appeared first on Kai Waehner.

]]>
Apache Kafka became the de facto standard for data streaming. Various cloud offerings emerged and improved in the last years. Amazon MSK Serverless is the latest Kafka product from AWS. This blog post looks at its capabilities to explore how it relates to “the normal” partially managed Amazon MSK, when the serverless version is a good choice, and when other fully-managed cloud services like Confluent Cloud are the better option.

Is Amazon MSK Serverless for Apache Kafka a Self-Driving Car or just a Car Engine

Disclaimer: I work for Confluent. While AWS is a strong strategic partner, it also offers the competitive offering Amazon MSK. This post is not about comparing every feature but explaining the concepts behind the alternatives. Read articles and docs from the different vendors to make your own evaluation and decision. View this post as a list of criteria to not forget important aspects in your cloud service selection.

What is a fully-managed Kafka cloud offering?

Before we get started looking at Kafka cloud services like Amazon MSK and Confluent Cloud, it is important to understand what fully managed means actually:

Fully managed and serverless Apache Kafka

Make sure to evaluate the technical solution. Most Kafka cloud solutions market their offering as “fully managed”. However, almost all Kafka cloud offerings are only partially managed! In most cases, the customer must operate the Kafka infrastructure, fix bugs, and optimize scalability and performance.

Cloud-native Apache Kafka re-engineered for the cloud

Operating Apache Kafka as a fully-managed offering in the cloud requires several additional components to the core of open-source Kafka. A cloud-native Kafka SaaS has features like:

  • The latest stable version with non-disruptive rolling upgrades
  • Elastic scale (up and down!)
  • Self-balancing clusters that take over the complexity and risk of rebalancing partitions across Kafka brokers
  • Tiered storage for cost-efficient long-term storage and better scalability (as the cold storage does not require a rebalancing of partitions and other complex operations tasks)
  • Complete solution “on top of the infrastructure”, including connectors, stream processing, security, and data governance – all in a single fully-managed SaaS

To learn more about building a cloud-native Kafka service, I highly recommend reading the following paper:  The Cloud-Native Chasm: Lessons Learned from Reinventing Apache Kafka as a Cloud-Native, Online Service”.

Comparison of Apache Kafka products and cloud services

Apache Kafka became the de facto standard for data streaming. The open-source community is vast. Various vendors added Kafka and related tooling to their offerings or provide a Kafka cloud service. I wrote a blog post in 2021: “Comparison of Open Source Apache Kafka vs. Vendors including Confluent, Cloudera, Red Hat, Amazon MSK“:

Comparison of Apache Kafka Products and Services including Confluent Amazon MSK Cloudera IBM Red Hat

The article uses a car analogy – from the motor engine to the self-driving car – to explore the different Kafka offerings available on the market. I also cover a few other vehicles, meaning (partly) Kafka-compatible technologies. The goal is not a feature-by-feature comparison (that would be outdated the day after the publication). Instead, the intention is to educate about the different deployment models, product strategies, and trade-offs from the options.

The above post is worth reading to understand how comparing different Kafka solutions makes sense. However, products develop and innovate… Tech comparisons get outdated quickly. In the meantime, AWS released a new product: Amazon MSK Serverless. This blog post explores what it is, when to use it, and how it differs from other Kafka products. It compares especially Amazon MSK (the partially managed service) and Confluent Cloud (a fully-managed competitor to Amazon MSK Serverless).

How does Amazon MSK Serverless fit into the Kafka portfolio?

Keeping the car analogy of my previous post, I wonder: Is it a self-driving car, a complete car you drive by yourself, or just a car engine to build your own car? Interestingly, you can argue for all three. 🙂 Let’s explore this in the following sections.

Is Amazon MSK Serverless a Self-Driving Car of Apache Kafka

Introducing Amazon MSK Serverless

Amazon MSK Serverless is a cluster type for Amazon MSK to run Apache Kafka without having to manage and scale cluster capacity. MSK Serverless automatically provisions and scales compute and storage resources. Thus, you can use Apache Kafka on demand and pay for the data you stream and retain.

Amazon MSK is one of the hundreds of cloud services that AWS provides. AWS is a one-stop shop for all cloud needs. That’s definitely a key strength of AWS (and similar to Azure and GCP).

Amazon MSK Serverless is built to solve the problems that come with Amazon MSK (the partially managed Kafka service that is marketed as a fully-managed solution even though it is not): A lot of hidden ops, infra, and downtime costs. This AWS podcast has a great episode that introduces Amazon MSK Serverless and when to use it as a replacement for Amazon MSK.

What Amazon does NOT tell you about MSK Serverless

AWS has great websites, documentation, and videos for its cloud services. This is not different for Amazon MSK. However, a few important details are not obvious… 🙂 Let’s explore a few key points to make sure everybody understands what Amazon MSK Serverless is and what it is not.

Amazon MSK Serverless is incomplete Kafka

If you follow my blogs, then this might be boring. Despite that, too many people think about Kafka as a message queue and data transportation pipeline. That’s what it is, but Kafka is much more:

  • Real-time messaging at any scale
  • Data integration with Kafka Connect
  • Data processing (aka stream processing) with Kafka Streams (or 3rd party Kafka-native components like KSQL)
  • True decoupling (the most underestimated feature of Kafka because of its built-in storage capabilities) and replayability of events with flexible retention times
  • Data governance with service contracts using Schema Registry (to be fair, this is not part of open source Kafka, but a separate component and accessible from GitHub or by vendors like Confluent or Red Hat – but it is used in almost all serious Kafka projects)

As I won’t repeat myself, here are a few articles explaining why Kafka is more than a message queue like you find it in Amazon MSK Serverless:

TL;DR: AWS provides a cloud service for every problem. You can glue them together to build a solution. However, similar to a monolithic application that provides inflexibility in a single package, a mesh of too many independent glued services using different technologies is also very hard to operate and maintain. And the cost for so many services plus networking and data transfer will bring up many surprises in such a setup.

You should ask yourself a few questions:

  • How do you implement data integration and business logic with Amazon MSK Serverless?
  • What’s the consequence regarding cost, SLAs, and end-to-end latency respectively delivery guarantees of combining Amazon MSK Serverless with various other products like Amazon Kinesis Data Analytics, AWS Glue, AWS Data Pipeline, or a 3rd party integration tool?
  • What is your security and data governance strategy around streaming data? How do you build an event-based data hub that enforces compliant communication between data producers and independent downstream consumers?

Spoilt for Choice: Amazon MSK and Amazon MSK Serverless are different products

Amazon MSK is NOT fully-managed. It is partially managed. After providing the brokers, you need to deploy, operate and monitor Kafka brokers and Kafka Connect connectors, and realize rebalancing with the open source tool Cruise Control. Check out AWS’ latest MSK sizing post: “Best practices for right-sizing your Apache Kafka clusters to optimize performance and cost“. Seriously? A ten pages long and very technical article explaining how to operate a “fully-managed cloud Kafka service”?

You might think that Amazon MSK Serverless is the successor of Amazon MSK to solve these problems. However, there are now two products to choose from: Amazon MSK and Amazon MSK Serverless.

Amazon does NOT recommend using Amazon MSK Serverless for all use cases! It is recommended if you don’t know the workloads or if they often change in volume.
Amazon recommends “the normal” Amazon MSK for predictable workloads as it is more cost-effective (and because it is not workable because of its many tough limitations). MSK Connect is also not supported yet and coming at some point in the future.

It is totally okay to provide different products for different use cases. Confluent also has different offerings for different SLAs and functional requirements in its cloud offering. Multi-tenant basic clusters and dedicated clusters are available, but you never have to self-manage the cluster or fix bugs or performance issues yourself.

You should ask yourself a few questions:

  • Which projects require Amazon MSK and which require Amazon MSK Serverless?
  • How will the project scale as your grows?
  • What’s the migration/ upgrade plan if your workload exceeds MSK Serverless partition/retention limits?
  • What is the total cost of ownership (TCO) for MSK plus all the other cloud services I need to combine it with?

Amazon MSK Serverless excludes Kafka support

Amazon MSK service level agreements say: “The Service Commitment DOES NOT APPLY to any unavailability, suspension or termination … caused by the underlying Apache Kafka or Apache Zookeeper engine software that leads to request failures …

Amazon MSK Serverless is part of the Amazon MSK product and has the same limitation. Excluding Kafka support from the MSK offering is (or should be) a blocker for any serious data streaming project!

Not much more to add here… Do you really want to buy a specific product that excludes the support for its core capability? Please also ask your manager if he agrees and takes the risk.

You should ask yourself a few questions:

  • Who is responsible and takes the risk if you hit a Kafka issue in your project using Amazon MSK or Amazon MSK Serverless?
  • How do you react to security incidents related to the Apache Kafka open source project?
  • How do you fix performance or scalability issues (on both the client and server side)?

When NOT to use Amazon MSK Serverless?

Let’s go back to the car analogy. Is Amazon MSK Serverless a self-driving car?

Obviously, Amazon MSK Serverless is self-driving. That’s what a serverless product is. Similar to Amazon S3 for object storage or AWS Lambda for serverless functions.

However, Amazon MSK Serverless is NOT a complete car! It does not provide enterprise support for its functionality. And it does not provide more than just the core of data streaming.

Therefore, Amazon MSK Serverless is a great self-driving AWS product for some use cases. But you should evaluate the following facts before deciding for or against this cloud service.

24/7 Enterprise support for the product

MSK excludes Kafka support from its product Amazon MSK. Amazon MSK Serverless is part of Amazon MSK and uses its SLAs.

I am amazed at how many enterprises use Amazon MSK without reading the SLAs. Most people are not aware that Kafka support is excluded from the product.

This makes Amazon MSK Serverless a car engine, not a complete car, right? Do you really want to build your own car and take over the burden and risk of failures while driving on the street?

If you need to deploy mission-critical workloads with 24/7 SLAs, you can stop reading and qualify out Amazon MSK (including Amazon MSK Serverless) until AWS adds serious SLAs to this product in the future.

Complete data streaming platform

AWS has a service for everything. You can glue them together. In our cars analogy, it would be many cars or vehicles in your enterprise architecture. Most of us learned the hard way that distributed microservices are no free lunch.

The monolithic data lake (now pitched as lakehouse) from vendors like Databricks and Snowflake) is no better approach. Use the right technology for a problem or data product. Finding the right mix between focus and independence is crucial. Kafka’s role is the central or decentralized real-time data hub to transport events. This includes data integration and processing, and to decouple systems from each other.

A modern data flow requires a simple, reliable and governed way to integrate and process data. Leveraging Kafka’s ecosystem like Kafka Connect and Kafka Streams enables mission-critical end-to-end latency and SLAs in a cost-efficient infrastructure. Development, operations, and monitoring are much harder and more costly if you glue together several services to build a real-time data hub.

However, Kafka is not a silver bullet. Hence, you need to understand when NOT to use Kafka and how it relates to data warehouses, data lakes, and other applications.

After a long introduction to this aspect, long story short: If you use Amazon MSK Serverless, it is the data ingestion component in your enterprise architecture. No fully managed components other than Kafka and no native integrations to other 1st party cloud AWS services like S3 or Redshift, and 3rd party cloud services like Snowflake, Databricks, or MongoDB. You must combine Amazon MSK Serverless with several other AWS services for event processing and storage. Additionally, connectivity needs to be implemented and operated by your project team using Kafka Connect connectors, or another 1st or 3rd party ETL tools, or custom glue code).

Amazon MSK Serverless only supports AWS Identity and Access Management (IAM) authentication, which limits you to Java clients only. There is no way to use the open source clients for other programming languages. Python, C++, .NET, Go, JavaScript, etc. are not supported with Amazon MSK Serverless.

MSK Connect allows deploying Kafka Connect connectors (that are available open source, licensed from other vendors, or self-built) into this platform. Similar to Amazon MSK, this is not a fully-managed product. You deploy, operate, and monitor the connectors by yourself. Look at the fully managed connectors in Confluent Cloud to understand the difference. Also, note that AWS will only support Connect workers. But it will not support the connectors themselves even if running on MSK Connect.

Event-driven architecture with true decoupling between the microservices

An event-driven architecture powered by data streaming is great for single integration infrastructure. However, the story goes far beyond this. Modern enterprise architectures leverage principles like microservices, domain-driven design, and data mesh to build decentralized applications and data products.

A streaming data exchange enables such a decentralized architecture with real-time data sharing. A critical capability for such a strategic enterprise component is long-term data storage. It

  • decouples independent applications
  • handles backpressure for slow consumers (like batch systems or request-response web services)
  • enables replayability of historical events (e.g., for a Python consumer in the machine learning platform from data engineers).

Data Mesh with Apache Kafka

The storage capability of Kafka is a key differentiator against message queues like IBM MQ, Rabbit MQ, or AWS SQS. Retention time is an important feature to set the right storage options per Kafka topic. Confluent makes Kafka even better by providing Tiered Storage to separate storage from compute for a much more cost-efficient and scalable solution with infinite storage capabilities.

Amazon MSK Serverless has a limited retention time of 24 hours. This is good enough for many data ingestion use cases but not for building a real-time data mesh across business units or even across organizations.  Another tough requirement of Amazon MSK Serverless is the limitation of 120 partitions. Not really a limit that allows building a strategic platform around it.

As Amazon MSK Serverless is a new product, expect the limitations to change and improve over time. Check the docs for updates. UPDATE Q1 2023: Amazon MSK Serverless added unlimited retention time and support for more partitions. That’s excellent news for this service. With this update, if retention time is your critical criterion, Amazon MSK Serverless is stronger since 2023. However, check the storage costs and compare different cloud offerings for this.

But anyway, these limitations prove how hard it is to build a fully-managed Kafka offering (like Confluent Cloud) compared to a partially managed Kafka offering (like “the normal” Amazon MSK).

Hybrid AWS and multi-cloud Kafka deployments

The most obvious point: Amazon MSK Serverless is only a reasonable discussion if you run your apps in the public AWS cloud. For anything else, like multi-cloud with Azure or GCP, AWS edge offerings like AWS Outpost or Wavelength, hybrid environments, or edge deployments like a factory or retail store, AWS is no option.

If you need to deploy outside the public AWS cloud, check my comparison of Kafka offerings, including Confluent, IBM, Red Hat, and Cloudera.

I want to emphasize that no product or service is 100% cloud agnostic. For instance, building Confluent Cloud on AWS, Azure, and GCP includes unique challenges under the hood. Confluent Cloud is built on Kubernetes. Hence, the template and many automation mechanisms can be reused across cloud vendors. But storage, compute, pricing, support, and many other characteristics and features differ at each cloud service provider.

Having said this, as you leverage a SaaS like Confluent Cloud with no knowledge or access to the technical infrastructure. You don’t see these issues under the hood. On the developer level, you produce and consume messages with the Kafka API and configure additional features like fully-managed connectors, data governance, or cluster linking. All the operations complexity is handled by the vendor. No matter which cloud you run on.

Coopetition: The winners are AWS and Confluent Cloud

The reason for this post was the evolution of the Amazon MSK product line.  Hence, if you read this a year later, the product features and limitations might look completely different once again. Use blog posts like this to understand how to evaluate different solutions and SaaS offerings. Then do your own accurate research before making a product decision.

Amazon MSK Serverless is a great new AWS service that helps AWS customers with some times of projects. But it has tough limitations for some other projects. Additionally, Amazon MSK (including Amazon MSK Serverless) excludes Kafka support! And it is not a complete data streaming platform. Be careful not to create a mess of glue code between tens of serverless cloud services and applications. Confluent Cloud is the much more sophisticated fully-managed Kafka cloud offering (on AWS and everywhere). I am not saying this because I am a Confluent employee but because almost everybody agrees on this 🙂 And it is not really a surprise as Confluent only focuses on data streaming with 2000 people employees and employs many full-time committers to the Apache Kafka open source project. Amazon has zero, by the way 🙂

By the way: Did you know you can use your AWS credits to consume Confluent Cloud like any other native AWS service? This is because of the strong partnership between Confluent and AWS. Yes, there is coopetition. That’s how the world looks like today…

Confluent Cloud provides a complete cloud-native platform including 99.99% SLA, fully managed connectors and stream processing, and maybe most interesting to readers of this post, integration with AWS services (S3, Redshift, Lambda, Kinesis, etc.) plus AWS security and networking (VPC Peering, Private Link, Transit Gateway, etc.). Confluent and AWS work closely together on hybrid deployments, leveraging AWS edge services like AWS Wavelength for 5G scenarios.

Which Kafka cloud service do you use today? What are your pros and cons? Do you plan a migration – e.g., from Amazon MSK to Confluent Cloud or from open source Kafka to Amazon MSK Serverless? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post When NOT to choose Amazon MSK Serverless for Apache Kafka? appeared first on Kai Waehner.

]]>
Data Warehouse and Data Lake Modernization: From Legacy On-Premise to Cloud-Native Infrastructure https://www.kai-waehner.de/blog/2022/07/15/data-warehouse-data-lake-modernization-from-legacy-on-premise-to-cloud-native-saas-with-data-streaming/ Fri, 15 Jul 2022 06:03:28 +0000 https://www.kai-waehner.de/?p=4649 The concepts and architectures of a data warehouse, a data lake, and data streaming are complementary to solving business problems. Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let's explore this dilemma in a blog series. This is part 3: Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure.

The post Data Warehouse and Data Lake Modernization: From Legacy On-Premise to Cloud-Native Infrastructure appeared first on Kai Waehner.

]]>
The concepts and architectures of a data warehouse, a data lake, and data streaming are complementary to solving business problems. Storing data at rest for reporting and analytics requires different capabilities and SLAs than continuously processing data in motion for real-time workloads. Many open-source frameworks, commercial products, and SaaS cloud services exist. Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let’s explore this dilemma in a blog series. Learn how to build a modern data stack with cloud-native technologies. This is part 3: Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure.

Data Warehouse and Data Lake Modernization with Data Streaming

Blog Series: Data Warehouse vs. Data Lake vs. Data Streaming

This blog series explores concepts, features, and trade-offs of a modern data stack using data warehouse, data lake, and data streaming together:

  1. Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
  2. Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
  3. THIS POST: Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
  4. Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
  5. Lessons Learned from Building a Cloud-Native Data Warehouse

Stay tuned for a dedicated blog post for each topic as part of this blog series. I will link the blogs here as soon as they are available (in the next few weeks). Subscribe to my newsletter to get an email after each publication (no spam or ads).

Data warehouse modernization: From legacy on-premise to cloud-native infrastructure

Many people talk about data warehouse modernization when they move to a cloud-native data warehouse. Though, what does data warehouse modernization mean? Why do people move away from their on-premise data warehouse? What are the benefits?

Many projects I have seen in the wild went through the following steps:

  1. Select a cloud-native data warehouse
  2. Get data into the new data warehouse
  3. [Optional] Migrate from the old to the new data warehouse

Let’s explore these steps in more detail and understand the technology and architecture options.

1. Selection of a cloud-native data warehouse

Many years ago, cloud computing was a game-changer for operating infrastructure. AWS innovated by providing not just EC2 virtual machines but also storage, like AWS S3 as a service.

Cloud-native data warehouse offerings are built on the same fundamental change. Cloud providers brought their analytics cloud services, such as AWS Redshift, Azure Synapse, or GCP BigQuery. Independent vendors rolled out a cloud-native data warehouse or data lake SaaS such as Snowflake, Databricks, and many more. While each solution has its trade-offs, a few general characteristics are true for most of them:

  • Cloud-native: A modern data warehouse is elastic, scales for small up to extreme workloads, and automates most business processes around development, operations, and monitoring.
  • Fully managed: The vendor takes over the operations burden. This includes scaling, failover handling, upgrades, and performance tuning. Some offerings are truly serverless, while many services require capacity planning and manual or automated scaling up and down.
  • Consumption-based pricing: Pay-as-you-go enables getting started in minutes and scaling costs with broader software usage. Most enterprise deployments allow commitment to getting price discounts.
  • Data sharing: Replicating data sets across regions and environments is a common feature to offer data locality, privacy, lower latency, and regulatory compliance.
  • Multi-cloud and hybrid deployments: While cloud providers usually only offer the 1st party service on their cloud infrastructure, 3rd party vendors provide a multi-cloud strategy. Some vendors even offer hybrid environments, including on-premise and edge deployments.

Plenty of comparisons exist in the community, plus analyst research from Gartner, Forrester, et al. Looking at vendor information and trying out the various cloud products using free credits is crucial, too. Finding the right cloud-native data warehouse is its own challenge and not in this blog post.

2. Data streaming as (near) real-time and hybrid integration layer

Data ingestion into data warehouses and data lakes was already covered in part two of this blog series. The more real-time, the better for most business applications. Near real-time ingestion is possible with specific tools (like AWS Kinesis or Kafka) or as part of the data fabric (the streaming data hub where a tool like Kafka plays a bigger role than just data ingestion).

The often more challenging part is data integration. Most data warehouse and data lake pipelines require ETL to ingest data. As the next-generation analytics platform is crucial for making the right business decisions, the data ingestion and integration platform must also be cloud-native! Tools like Kafka provide the reliable and scalable integration layer to get all required data into the data warehouse.

Integration of legacy on-premise data into the cloud-native data warehouse

In a greenfield project, the project team is lucky. Data sources run in the same cloud, using open and modern APIs, and scale as well as the cloud-native data warehouse.

Unfortunately, the reality is brownfield almost always, even if all applications run in public cloud infrastructure. Therefore, the integration and replication of data from legacy and on-premise applications is a general requirement.

Data is typically consumed from legacy databases, data warehouses, applications, and other proprietary platforms. The replication into the cloud data warehouse usually needs to be near real-time and reliable.

A data streaming platform like Kafka is perfect for replicating data across data centers, regions, and environments because of its elastic scalability and true decoupling capabilities. Kafka enables connectivity to modern AND legacy systems via connectors, proprietary APIs, programming languages, or open REST interfaces:

Accelerate modernization from on-prem to AWS with Kafka and Confluent Cluster Linking

A common scenario in such a brownfield project is the clear separation of concerns and true decoupling between legacy on-premise and modern cloud workloads. Here, Kafka is deployed on-premise to connect to legacy applications.

Tools like MirrorMaker, Replicator, or Confluent Cluster Linking replicate events in real-time into the Kafka cluster in the cloud. The Kafka brokers provide access to the incoming events. Downstream consumers read the data into the data sinks at their own pace; real-time, near real-time, batch, or request-response via API. Streaming ETL is possible at any site – where it makes the most sense from a business or security perspective and is the most cost-efficient.

Example: Confluent Cloud + Snowflake = Cloud-native Data Warehouse Modernization

Here is a concrete example of data warehouse modernization using cloud-native data streaming and data warehousing with Confluent Cloud and Snowflake:

Cloud-native Data Warehouse Modernization with Apache Kafka Confluent Snowflake

For modernizing the data warehouse, data is ingested from various legacy and modern data sources using different communication paradigms, APIs, and integration strategies. The data is transmitted in motion from data sources via Kafka (and optional preprocessing) into the Snowflake data warehouse. The whole end-to-end pipeline is scalable, reliable, and fully managed, including the connectivity and ingestion between the Kafka and Snowflake clusters.

However, there is more to the integration and ingestion layer: The data streaming platform stores the data for true decoupling and slow downstream applications; not every consumer is or can be real-time. Most enterprise architectures do not ingest data into a single data warehouse or data lake or lakehouse. The reality is that different downstream applications need access to the same information; even though vendors of data warehouses and data lakes tell you differently, of course 🙂

By consuming events from the streaming data hub, each application domain decides by itself if it

  • processes events within Kafka with stream processing tools like Kafka Streams or ksqlDB
  • builds own downstream applications with its code and technologies (like Java, .NET, Golang, Python)
  • integrates with 3rd party business applications like Salesforce or SAP
  • ingests the raw or preprocessed and curated data from Kafka into the sink system (like a data warehouse or data lake)

3. Data warehouse modernization and migration from legacy to cloud-native

An often misunderstood concept is the buzz around data warehouse modernization: Companies rarely take the data of the existing on-premise data warehouse or data lake, write a few ETL jobs, and put the data into a cloud-native data warehouse for the sake of doing it.

If you think about a one-time lift-and-shift from an on-premise data warehouse to the cloud, then a traditional ETL tool or a replication cloud service might be the easiest. However, usually, data warehouse modernization is more than that! 

What is data warehouse modernization?

A data warehouse modernization can mean many things, including replacing and migrating the existing data warehouse, building a new cloud-native data warehouse from scratch, or optimizing a legacy ETL pipeline of a cloud-native data warehouse.

In all these cases, data warehouse modernization requires business justification, for instance:

  • Application issues in the legacy data warehouse, such as too slow data processing with legacy batch workloads, result in wrong or conflicting information for the business people.
  • Scalability issues in the on-premise data warehouse as the data volume grows too much.
  • Cost issues because the legacy data warehouse does not offer reasonable pricing with pay-as-you-go models.
  • Connectivity issues as legacy data warehouses were not built with an open API and data sharing in mind. Cloud-native data warehouses run on cost-efficient and scalable object storage, separate storage from computing, and allow data consumption and sharing. (but keep in mind that Reverse ETL is often an anti-pattern!)
  • A strategic move to the cloud with all infrastructure. The analytics platform is no exception if all old and new applications go to the cloud.

Cloud-native applications usually come with innovation, i.e., new business processes, data formats, and data-driven decision-making. From a data warehouse perspective, the best modernization is to start from scratch. Consume data directly from the existing data sources, ETL it, and do business intelligence on top of the new data structures.

I have seen many more projects where customers use change data capture (CDC) from Oracle databases (i.e., the leading core system) instead of trying to replicate data from the legacy data warehouse (i.e., the analytics platform) as scalability, cost, and later shutdown of legacy infrastructure benefits from this approach.

Data warehouse migration: Continuous vs. cut-over

The project is usually a cut-over when you need to do a real modernization (i.e., migration) from a legacy data warehouse to a cloud-native one. This way, the first project phase integrates the legacy data sources with the new data warehouse. The old and new data warehouse platforms operate in parallel, so that old and new business processes go on. After some time (months or years later), when the business is ready to move, the old data warehouse will be shut down after legacy applications are either migrated to the new data warehouse or replaced with new applications:

Data Warehouse Offloading Integration and Replacement with Data Streaming 

My article “Mainframe Integration, Offloading and Replacement with Apache Kafka” illustrates this offloading and long-term migration process. Just scroll to the section “Mainframe Offloading and Replacement in the Next 5 Years” in that post and replace the term ‘mainframe’ with ‘legacy data warehouse’ in your mind.

A migration and cut-over is its project and can include the legacy data warehouse; or not. Data lake modernization (e.g., from a self- or partially managed Cloudera cluster running on-premise in the data center to a fully managed Databricks or Snowflake cloud infrastructure) follows the same principles. And mixing the data warehouse (reporting) and data lake (big data analytics) into a single infrastructure does not change this either.

Data warehouse modernization is NOT a big bang and NOT a single tool approach!

Most data warehouse modernization projects are ongoing efforts over a long period. You must select a cloud-native data warehouse, get data into the new data warehouse from various sources, and optionally migrate away from legacy on-premise infrastructure.

Data streaming for data ingestion, business applications, or data sharing in real-time should always be a separate component in the enterprise architecture. It has very different requirements regarding SLAs, uptime, through, latency, etc. Putting all real-time and analytical workloads into the same cluster makes little sense from a cost, risk, or business value perspective. The idea of a modern data flow and building a data mesh is the separation of concerns with domain-driven design and focusing on data products (using different, independent APIs, technologies, and cloud services).

For more details, browse other posts of this blog series:

  1. Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
  2. Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
  3. THIS POST: Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
  4. Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
  5. Lessons Learned from Building a Cloud-Native Data Warehouse

What cloud-native data warehouse(s) do you use? How does data streaming fit into your journey? Did you integrate or replace your legacy on-premise data warehouse(s); or start from greenfield in the cloud? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Data Warehouse and Data Lake Modernization: From Legacy On-Premise to Cloud-Native Infrastructure appeared first on Kai Waehner.

]]>
Comparison: JMS Message Queue vs. Apache Kafka https://www.kai-waehner.de/blog/2022/05/12/comparison-jms-api-message-broker-mq-vs-apache-kafka/ Thu, 12 May 2022 05:13:19 +0000 https://www.kai-waehner.de/?p=4430 Comparing JMS-based message queue (MQ) infrastructures and Apache Kafka-based data streaming is a widespread topic. Unfortunately, the battle is an apple-to-orange comparison that often includes misinformation and FUD from vendors. This blog post explores the differences, trade-offs, and architectures of JMS message brokers and Kafka deployments. Learn how to choose between JMS brokers like IBM MQ or RabbitMQ and open-source Kafka or serverless cloud services like Confluent Cloud.

The post Comparison: JMS Message Queue vs. Apache Kafka appeared first on Kai Waehner.

]]>
Comparing JMS-based message queue (MQ) infrastructures and Apache Kafka-based data streaming is a widespread topic. Unfortunately, the battle is an apple-to-orange comparison that often includes misinformation and FUD from vendors. This blog post explores the differences, trade-offs, and architectures of JMS message brokers and Kafka deployments. Learn how to choose between JMS brokers like IBM MQ or RabbitMQ and open-source Kafka or serverless cloud services like Confluent Cloud.

JMS Message Queue vs Apache Kafka Comparison

Motivation: The battle of apples vs. oranges

I have to discuss the differences and trade-offs between JMS message brokers and Apache Kafka every week in customer meetings. What annoys me most is the common misunderstandings and (sometimes) intentional FUD in various blogs, articles, and presentations about this discussion.

I recently discussed this topic with Clement Escoffier from Red Hat in the “Coding over Cocktails” Podcast: JMS vs. Kafka: Technology Smackdown. A great conversation with more agreement than you might expect from such an episode where I picked the “Kafka proponent” while Clement took over the role of the “JMS proponent”.

These aspects motivated me to write a blog series about “JMS, Message Queues, and Apache Kafka”:

I will link the other posts here as soon as they are available. Please follow my newsletter to get updated in real-time about new posts. (no spam or ads)

Special thanks to my colleague and long-term messaging and data streaming expert Heinz Schaffner for technical feedback and review of this blog series. He has worked for TIBCO, Solace, and Confluent for 25 years.

10 comparison criteria: JMS vs. Apache Kafka

This blog post explores ten comparison criteria. The goal is to explain the differences between message queues and data streaming, clarify some misunderstandings about what an API or implementation is, and give some technical background to do your evaluation to find the right tool for the job.

The list of products and cloud services is long for JMS implementations and Kafka offerings. A few examples:

  • JMS implementations of the JMS API (open source and commercial offerings): Apache ActiveMQ, Apache Qpid (using AMQP), IBM MQ (formerly MQSeries, then WebSphere MQ), JBoss HornetQ, Oracle AQ, RabbitMQ, TIBCO EMS, Solace, etc.
  • Apache Kafka products, cloud services, and rewrites (beyond the valid option of using just open-source Kafka): Confluent, Cloudera, Amazon MSK, Red Hat, Redpanda, Azure Event Hubs, etc.

Here are the criteria for comparing JMS message brokers vs. Apache Kafka and its related products/cloud services:

  1. Message broker vs. data streaming platform
  2. API Specification vs. open-source protocol implementation
  3. Transactional vs. analytical workloads
  4. Push vs. pull message consumption
  5. Simple vs. powerful and complex API
  6. Storage for durability vs. true decoupling
  7. Server-side data-processing vs. decoupled continuous stream processing
  8. Complex operations vs. serverless cloud
  9. Java/JVM vs. any programming language
  10. Single deployment vs. multi-region (including hybrid and multi-cloud) replication

Let’s now explore the ten comparison criteria.

1. Message broker vs. data streaming platform

TL;DR: JMS message brokers provide messaging capabilities to produce and consume messages. Apache Kafka is a data streaming platform that combines messaging, storage, data integration, and stream processing capabilities.

The most important aspect first: The comparison of JMS and Apache Kafka is an apple to orange comparison for several reasons. I would even further say that not both can be fruit, as they are so different from each other.

JMS API (and implementations like IBM MQ, RabbitMQ, et al)

JMS (Java Message Service) is a Java application programming interface (API) that provides generic messaging models. The API handles the producer-consumer problem, which can facilitate the sending and receiving of messages between software systems.

Therefore, the central capability of JMS message brokers (that implement the JMS API) is to send messages from a source application to another destination in real-time. That’s it. And if that’s what you need, then JMS is the right choice for you! But keep in mind that projects must use additional tools for data integration and advanced data processing tasks.

Apache Kafka (open source and vendors like Confluent, Cloudera, Red Hat, Amazon, et al)

Apache Kafka is an open-source protocol implementation for data streaming. It includes:

  • Apache Kafka is the core for distributed messaging and storage. High throughput, low latency, high availability, secure.
  • Kafka Connect is an integration framework for connecting external sources/destinations to Kafka.
  • Kafka Streams is a simple Java library that enables streaming application development within the Kafka framework.

This combination of capabilities enables the building of end-to-end data pipelines and applications. That’s much more than what you can do with a message queue.

2. JMS API specification vs. Apache Kafka open-source protocol implementation

TL;DR: JMS is a specification that vendors implement and extend in their opinionated way. Apache Kafka is the open-source implementation of the underlying specified Kafka protocol.

It is crucial to clarify the terms first before you evaluate JMS and Kafka:

  • Standard API: Specified by industry consortiums or other industry-neutral (often global) groups or organizations specify standard APIs. Requires compliance tests for all features and complete certifications to become standard-compliant. Example: OPC-UA.
  • De facto standard API: Originates from an existing successful solution (an open-source framework, a commercial product, or a cloud service). Examples: Amazon S3 (proprietary from a single vendor). Apache Kafka (open source from the vibrant community).
  • API Specification: A specification document to define how vendors can implement a related product. There are no complete compliance tests or complete certifications for the implementation of all features. The consequence is a “standard API” but no portability between implementations. Example: JMS. Specifically for JMS, note that in order to be able to use the compliance suite for JMS, a commercial vendor has to sign up to very onerous reporting requirements towards Oracle.

The alternative kinds of standards have trade-offs. If you want to learn more, check out how Apache Kafka became the de facto standard for data streaming in the last few years.

Portability and migrations became much more relevant in hybrid and multi-cloud environments than in the past decades where you had your workloads in a single data center.

JMS is a specification for message-oriented middleware

JMS is a specification currently maintained under the Java Community Process as JSR 343. The latest (not yet released) version JMS 3.0 is under early development as part of Jakarta EE and rebranded to Jakarta Messaging API. Today, JMS 2.0 is the specification used in prevalent message broker implementations. Nobody knows where JMS 3.0 will go at all. Hence, this post focuses on the JMS 2.0 specification to solve real-world problems today.

I often use the term “JMS message broker” in the following sections as JMS (i.e., the API) does not specify or implement many features you know in your favorite JMS implementation. Usually, when people talk about JMS, they mean JMS message broker implementations, not the JMS API specification.

JMS message brokers and the JMS portability myth

The JMS specification was developed to provide a common Java library to access different messaging vendor’s brokers. It was intended to act as a wrapper to the messaging vendor’s proprietary APIs in the same way JDBC provided similar functionality for database APIs.

Unfortunately, this simple integration turned out not to be the case. The migration of the JMS code from one vendor’s broker to another is quite complex for several reasons:

  • Not all JMS features are mandatory (security, topic/queue labeling, clustering, routing, compression, etc.)
  • There is no JMS specification for transport
  • No specification to define how persistence is implemented
  • No specification to define how fault tolerance or high availability is implemented
  • Different interpretations of the JMS specification by different vendors result in potentially other behaviors for the same JMS functions
  • No specification for security
  • There is no specification for value-added features in the brokers (such as topic to queue bridging, inter-broker routing, access control lists, etc.)

Therefore, simple source code migration and interoperability between JMS vendors is a myth! This sounds crazy, doesn’t it?

Vendors provide a great deal of unique functionality within the broker (such as topic-to-queue mapping, broker routing, etc.) that provide architectural functionality to the application but are part of the broker functionality and not the application or part of the JMS specification.

Apache Kafka is an open-source protocol implementation for data streaming

Apache Kafka is an implementation to do reliable and scalable data streaming in real-time. The project is open-source and available under Apache 2.0 license, and is driven by a vast community.

Apache Kafka is NOT a standard like OPC-UA or a specification like JMS. However, Kafka at least provides the source code reference implementation, protocol and API definitions, etc.

Kafka established itself as the de facto standard for data streaming. Today, over 100,000 organizations use Apache Kafka. The Kafka API became the de facto standard for event-driven architectures and event streaming. Use cases across all industries and infrastructure. Including various kinds of transactional and analytic workloads. Edge, hybrid, multi-cloud. I collected a few examples across verticals that use Apache Kafka to show the prevalence across markets.

Now, hold on. I used the term Kafka API in the above section. Let’s clarify this: As discussed, Apache Kafka is an implementation of a distributed data streaming platform including the server-side and client-side and various APIs for producing and consuming events, configuration, security, operations, etc. The Kafka API is relevant, too, as Kafka rewrites like Azure Event Hubs and Redpanda use it.

Portability of Apache Kafka – yet another myth?

If you use Apache Kafka as an open-source project, this is the complete Kafka implementation. Some vendors use the full Apache Kafka implementation and build a more advanced product around it.

Here, the migration is super straightforward, as Kafka is not just a specification that each vendor implements differently. Instead, it is the same code, libraries, and packages.

For instance, I have seen several successful migrations from Cloudera to Confluent deployments or from self-managed Apache Kafka open-source infrastructure to serverless Confluent Cloud.

The Kafka API – Kafka rewrites like Azure Event Hubs, Redpanda, Apache Pulsar

With the global success of Kafka, some vendors and cloud services did not build a product on top of the Apache Kafka implementation. Instead, they made their implementation on top of the Kafka API. The underlying implementation is proprietary (like in Azure’s cloud service Event Hubs) or open-source (like Apache Pulsar’s Kafka bridge or Redpanda’s rewrite in C++).

Be careful and analyze if vendors integrate the whole Apache Kafka project or rewrote the complete API. Contrary to the battle-tested Apache Kafka project, a Kafka rewrite using the Kafka API is a completely new implementation!

Many vendors even exclude some components or APIs (like Kafka Connect for data integration or Kafka Streams for stream processing) completely or exclude critical features like exactly-once semantics or long-term storage in their support terms and conditions.

It is up to you to evaluate the different Kafka offerings and their limitations. Recently, I compared Kafka vendors such as Confluent, Cloudera, Red Hat, or Amazon MSK and related technologies like Azure Event Hubs, AWS Kinesis, Redpanda, or Apache Pulsar.

Just battle-test the requirements by yourself. If you find a Kafka-to-XYZ bridge with less than a hundred lines of code, or if you find a .exe Windows Kafka server download from a middleware vendor. Be skeptical! 🙂

All that glitters is not gold. Some frameworks or vendors sound too good to be true. Just saying you support the Kafka API, you provide a fully managed serverless Kafka offering, or you scale much better is not trustworthy if you are constantly forced to provide fear, uncertainty, and doubt (FUD) on Kafka and that you are much better. For instance, I was annoyed by Pulsar always trying to be better than Kafka by creating a lot of FUDs and myths in the open-source community. I responded in my Apache Pulsar vs. Kafka comparison two years ago. FUD is the wrong strategy for any vendor. It does not work. For that reason, Kafka’s adoption still grows like crazy while Pulsar grows much slower percentage-wise (even though the download numbers are on a much lower level anyway).

3. Transactional vs. analytical workloads

TL;DR: A JMS message broker provides transactional capabilities for low volumes of messages. Apache Kafka supports low and high volumes of messages supporting transactional and analytical workloads.

JMS – Session and two-phase commit (XA) transactions

Most JMS message brokers have good support for transactional workloads.

A transacted session supports a single series of transactions. Each transaction groups a set of produced messages and a set of consumed messages into an atomic unit of work.

Two-phase commit transactions (XA transactions) work on a limited scale. They are used to integrate with other systems like Mainframe CICS / DB2 or Oracle database. But it is hard to operate and not possible to scale beyond a few transactions per second.

It is important to note that support for XA transactions is not mandatory with the JMS 2.0 specification. This differs from the session transaction.

Kafka – Exactly-once semantics and transaction API

Kafka is a distributed, fault-tolerant system that is resilient by nature (if you deploy and operate it correctly). No downtime and no data loss can be guaranteed, like in your favorite database, mainframe, or other core platforms.

And even better: Kafka’s Transaction API, i.e., Exactly-Once Semantics (EOS), has been available since Kafka 0.11 (GA’ed many years ago). EOS makes building transactional workloads even easier as you don’t need to handle duplicates anymore.

Kafka supports atomic writes across multiple partitions through the transactions API. This allows a producer to send a batch of messages to multiple partitions. Either all messages in the batch are eventually visible to any consumer, or none are ever visible to consumers.

Kafka transactions work very differently than JMS transactions. But the goal is the same: Each consumer receives the produced event exactly once. Find more details in the blog post “Analytics vs. Transactions in Data Streaming with Apache Kafka“.

4. Push vs. pull message consumption

TL;DR: JMS message brokers push messages to consumer applications. Kafka consumers pull messages providing true decoupling and backpressure handling for independent consumer applications.

Pushing messages seems to be the obvious choice for a real-time messaging system like JMS-based message brokers. However, push-based messaging has various drawbacks regarding decoupling and scalability.

JMS expects the broker to provide back pressure and implement a “pre-fetch” capability, but this is not mandatory. If used, the broker controls the backpressure, which you cannot control.

With Kafka, the consumer controls the backpressure. Each Kafka consumer consumes events in real-time, batch, or only on demand – in the way the particular consumer supports and can handle the data stream. This is an enormous advantage for many inflexible and non-elastic environments.

So while JMS has some kind of backpressure, the producer stops if the queue is full. In Kafka, you control the backpressure on the consumer. There is no way to scale a producer with JMS (as there are no partitions in a JMS queue or topic).

JMS consumers can be scaled, but then you lose guaranteed ordering. Guaranteed ordering in JMS message brokers only works via a single producer, single consumer, and transaction.

5. Simple JMS API vs. powerful and complex Kafka API

TL;DR: The JMS API provides simple operations to produce and consume messages. Apache Kafka has a more granular API that brings additional power and complexity.

JMS vendors hide all the cool stuff in the implementation under the spec. You only get the 5% (no control, built by the vendor). You need to make the rest by yourself. On the other side, Kafka exposes everything. Most developers only need 5%.

In summary, be aware that JMS message brokers are built to send messages from a data source to one or more data sinks. Kafka is a data streaming platform that provides many more capabilities, features, event patterns, and processing options; and a much larger scale. With that in mind, it is no surprise that the APIs are very different and have different complexity.

If your use case requires just sending a few messages per second from A to B, the JMS is the right choice and simple to use! If you need a streaming data hub at any scale, including data integration and data processing, that’s only Kafka.

Asynchronous request-reply vs. data in motion

One of the most common wishes of JMS developers is to use are request-response function in Kafka. Note that this design pattern is different in messaging systems from an RPC (remote procedure call) as you know it from legacy tools like Corba or web service standards like SOAP/WSDL or HTTP. Request-reply in messaging brokers is an asynchronous communication that leverages a correlation ID.

Asynchronous messaging to get events from a producer (say a mobile app) to a consumer (say a database) is a very traditional workflow. No matter if you do fire-and-forget or request-reply. You put data at rest for further processing. JMS supports request-reply out-of-the-box. The API  is very simple.

Data in motion with event streaming continuously processes data. The Kafka log is durable. The Kafka application maintains and queries the state in real-time or in batch. Data streaming is a paradigm shift for most developers and architects. The design patterns are very different. Don’t try to reimplement your JMS application within Kafka using the same pattern and API. That is likely to fail! That is an anti-pattern.

Request-reply is inefficient and can suffer a lot of latency depending on the use case. HTTP or better gRPC is suitable for some use cases. Request-reply is replaced by the CQRS (Command and Query Responsibility Segregation) pattern with Kafka for streaming data. CQRS is not possible with JMS API, since JMS provides no state capabilities and lacks event sourcing capability.

A Kafka example for the request-response pattern

CQRS is the better design pattern for many Kafka use cases. Nevertheless, the request-reply pattern can be implemented with Kafka, too. But differently. Trying to do it like in a JMS message broker (with temporary queues etc.) will ultimately kill the Kafka cluster (because it works differently).
The Spring project shows how you can do better. The Kafka Spring Boot Kafka Template libraries have a great example of the request-reply pattern built with Kafka.
Check out “org.springframework.kafka.requestreply.ReplyingKafkaTemplate“. It creates request/reply applications using the Kafka API easily. The example is interesting since it implements the asynchronous request/reply, which is more complicated to write if you are using, for example, JMS API). Another nice DZone article talks about synchronous request/reply using Spring Kafka templates.
The Spring documentation for Kafka Templates has a lot of details about the Request/Reply pattern for Kafka. So if you are using Spring, the request/reply pattern is pretty simple to implement with Kafka. If you are not using Spring, you can learn how to do request-reply with Kafka in your framework.

6. Storage for durability vs. true decoupling

TL;DR: JMS message brokers use a storage system to provide high availability. The storage system of Kafka is much more advanced to enable long-term storage, back-pressure handling and replayability of historical events.

Kafka storage is more than just the persistence feature you know from JMS

When I explain the Kafka storage system to experienced JMS developers, I almost always get the same response: “Our JMS message broker XYZ also has storage under the hood. I don’t see the benefit of using Kafka!”

JMS uses an ephemeral storage system, where messages are only persisted until they are processed. Long-term storage and replayability of messages are not a concept JMS was designed for.

The core Kafka principles of append-only logs, offsets, guaranteed ordering, retention time, compacted topics, and so on provide many additional benefits beyond the durability guarantees of a JMS. Backpressure handling, true decoupling between consumers, the replayability of historical events, and more are huge differentiators between JMS and Kafka.

Check the Kafka docs for a deep dive into the Kafka storage system. I don’t want to touch on how Tiered Storage for Kafka is changing the game even more by providing even better scalability and cost-efficient long-term storage within the Kafka log.

7. Server-side data-processing with JMS vs. decoupled continuous stream processing with Kafka

TL;DR: JMS message brokers provide simple server-side event processing, like filtering or routing based on the message content. Kafka brokers are dumb. Its data processing is executed in decoupled applications/microservices. 

Server-side JMS filtering and routing

Most JMS message brokers provide some features for server-side event processing. These features are handy for some workloads!

Just be careful that server-side processing usually comes with a cost. For instance:

  • JMS Pre-filtering scalability issues: The broker has to handle so many things. This can kill the broker in a hidden fashion
  • JMS Selectors (= routing) performance issues: It kills 40-50% of performance

Again, sometimes, the drawbacks are acceptable. Then this is a great functionality.

Kafka – Dumb pipes and smart endpoints

Kafka intentionally does not provide server-side processing. The brokers are dumb. The processing happens at the smart endpoints. This is a very well-known design pattern: Dumb pipes and smart endpoints.

The drawback is that you need separate applications/microservices/data products to implement the logic. This is not a big issue in serverless environments (like using a ksqlDB process running in Confluent Cloud for data processing). It gets more complex in self-managed environments.

However, the massive benefit of this architecture is the true decoupling between applications/technologies/programming languages, separation of concerns between business units for building business logic and operations of infrastructure, and the much better scalability and elasticity.

Would I like to see a few server-side processing capabilities in Kafka, too? Yes, absolutely. Especially for small workloads, the performance and scalability impact should be acceptable! Though, the risk is that people misuse the features then. The future will show if Kafka will get there or not.

8. Complex operations vs. serverless cloud

TL;DR: Self-managed operations of scalable JMS message brokers or Kafka clusters are complex. Serverless offerings (should) take over the operations burden.

Operating a cluster is complex – no matter if JMS or Kafka

A basic JMS message broker is relatively easy to operate (including active/passive setups). However, this limits scalability and availability. The JMS API was designed to talk to a single broker or active/passive for high availability. This concept covers the application domain.

More than that (= clustering) is very complex with JMS message brokers.  More advanced message broker clusters from commercial vendors are more powerful but much harder to operate.

Kafka is a powerful, distributed system. Therefore, operating a Kafka cluster is not easy by nature. Cloud-native tools like an operator for Kubernetes take over some burdens like rolling upgrades or handling fail-over.

Both JMS message brokers and Kafka clusters are the more challenging, the more scale and reliability your SLAs demand. The JMS API is not specified for a central data hub (using a cluster). Kafka is intentionally built for the strategic enterprise architecture, not just for a single business application.

Fully managed serverless cloud for the rescue

As the JMS API was designed to talk to a single broker, it is hard to build a serverless cloud offering that provides scalability. Hence, in JMS cloud services, the consumer has to set up the routing and role-based access control to the specific brokers. Such a cloud offering is not serverless but cloud-washing! But there is no other option as the JMS API is not like Kafka with one big distributed cluster.

In Kafka, the situation is different. As Kafka is a scalable distributed system, cloud providers can build cloud-native serverless offerings. Building such a fully managed infrastructure is still super hard. Hence, evaluate the product, not just the marketing slogans!

Every Kafka cloud service is marketed as “fully managed” or “serverless” but most are NOT. Instead, most vendors just provision the infrastructure and let you operate the cluster and take over the support risk. On the other side, some fully managed Kafka offerings are super limited in functionality (like allowing a very limited number of partitions).

Some cloud vendors even exclude Kafka support from their Kafka cloud offerings. Insane, but true. Check the terms and conditions as part of your evaluation.

9. Java/JVM vs. any programming language

TL;DR: JMS focuses on the Java ecosystem for JVM programming languages. Kafka is independent of programming languages.

As the name JMS (=Java Message Service) says: JMS was written only for Java officially. Some broker vendors support their own APIs and clients. These are proprietary to that vendor. Almost all severe JMS projects I have seen in the past use Java code.

Apache Kafka also only provides a Java client. But vendors and the community provide other language bindings for almost every programming language, plus a REST API for HTTP communication for producing/consuming events to/from Kafka. For instance, check out the blog post “12 Programming Languages Walk into a Kafka Cluster” to see code examples in Java, Python, Go, .NET, Ruby, node.js, Groovy, etc.

The true decoupling of the Kafka backend enables very different client applications to speak with each other, no matter what programming languages one uses. This flexibility allows for building a proper domain-driven design (DDD) with a microservices architecture leveraging Kafka as the central nervous system.

10. Single JMS deployment vs. multi-region (including hybrid and multi-cloud) Kafka replication

TL;DR: The JMS API is a client specification for communication between the application and the broker. Kafka is a distributed system that enables various architectures for hybrid and multi-cloud use cases.

JMS is a client specification, while multi-data center replication is a broker function. I won’t go deep here and put it simply: JMS message brokers are not built for replication scenarios across regions, continents, or hybrid/multi-cloud environments.

Multi-cluster and cross-data center deployments of Apache Kafka have become the norm rather than an exception. Various scenarios require multi-cluster Kafka solutions. Specific requirements and trade-offs need to be looked at.

Kafka technologies like MirrorMaker (open source) or Confluent Cluster Linking (commercial) enable use cases such as disaster recovery, aggregation for analytics, cloud migration, mission-critical stretched deployments and global Kafka deployments.

I covered hybrid cloud architectures in various other blog posts. “Low Latency Data Streaming with Apache Kafka and Cloud-Native 5G Infrastructure” is a great example.

Slide deck and video recording

I created a slide deck and video recording if you prefer learning or sharing that kind of material instead of a blog post:

Fullscreen Mode

JMS and Kafka solve distinct problems!

The ten comparison criteria show that JMS and Kafka are very different things. While both overlap (e.g., messaging, real-time, mission-critical), they use different technical capabilities, features, and architectures to support additional use cases.

In short, use a JMS broker for simple and low-volume messaging from A to B. Kafka is usually a real-time data hub between many data sources and data sinks. Many people call it the central real-time nervous system of the enterprise architecture.

The data integration and data processing capabilities of Kafka at any scale with true decoupling and event replayability are the major differences from JMS-based MQ systems.

However, especially in the serverless cloud, don’t fear Kafka being too powerful (and complex). Serverless Kafka projects often start very cheaply at a very low volume, with no operations burden. Then it can scale with your growing business without the need to re-architect the application.

Understand the technical differences between a JMS-based message broker and data streaming powered by Apache Kafka. Evaluate both options to find the right tool for the problem. Within messaging or data streaming, do further detailed evaluations. Every message broker is different even though they all are JMS compliant. In the same way, all Kafka products and cloud services are different regarding features, support, and cost.

Do you use JMS-compliant message brokers? What are the use cases and limitations? When did you or do you plan to use Apache Kafka instead? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Comparison: JMS Message Queue vs. Apache Kafka appeared first on Kai Waehner.

]]>
Machine Learning and Data Science with Kafka in Healthcare https://www.kai-waehner.de/blog/2022/04/18/machine-learning-data-science-with-kafka-in-healthcare-pharma/ Mon, 18 Apr 2022 11:44:09 +0000 https://www.kai-waehner.de/?p=4446 IT modernization and innovative new technologies change the healthcare industry significantly. This blog series explores how data streaming with Apache Kafka enables real-time data processing and business process automation. Real-world examples show how traditional enterprises and startups increase efficiency, reduce cost, and improve the human experience across the healthcare value chain, including pharma, insurance, providers, retail, and manufacturing. This is part five: Machine Learning and Data Science. Examples include Recursion and Humana.

The post Machine Learning and Data Science with Kafka in Healthcare appeared first on Kai Waehner.

]]>
IT modernization and innovative new technologies change the healthcare industry significantly. This blog series explores how data streaming with Apache Kafka enables real-time data processing and business process automation. Real-world examples show how traditional enterprises and startups increase efficiency, reduce cost, and improve the human experience across the healthcare value chain, including pharma, insurance, providers, retail, and manufacturing. This is part five: Machine Learning and Data Science. Examples include Recursion and Humana.

Machine Learning and Data Science with Apache Kafka in Healthcare

Blog Series – Kafka in Healthcare

Many healthcare companies leverage Kafka today. Use cases exist in every domain across the healthcare value chain. Most companies deploy data streaming in different business domains. Use cases often overlap. I tried to categorize a few real-world deployments into different technical scenarios and added a few real-world examples:

Stay tuned for a dedicated blog post for each of these topics as part of this blog series. I will link the blogs here as soon as they are available (in the next few weeks). Subscribe to my newsletter to get an email after each publication (no spam or ads).

Machine Learning and Data Science with Data Streaming using Apache Kafka

The relationship between Apache Kafka and machine learning (ML) is getting more and more traction for data engineering at scale and robust model deployment with low latency.

The Kafka ecosystem helps in different ML use cases for model training, model serving, and model monitoring. The core of most ML projects requires reliable and scalable data engineering pipelines across

  • different technologies
  • communication paradigms (REST, gRPC, data streaming)
  • programming languages (like Python for the data scientist or Java/Go/C++ for the production engineer)
  • APIs
  • commercial products
  • SaaS offerings

Here is an architecture diagram that shows how Kafka helps in data science projects:

The beauty of Kafka is that it combines real-time data processing with extreme scalability and true decoupling between systems.

Tiered Storage adds cost-efficient storage of big data sets and replayability with guaranteed ordering.

I’ve written about this relationship between Kafka and Machine Learning in various articles:

Let’s look at a few real-world deployments for Apache Kafka and Machine Learning in the healthcare sector.

Humana – Real-Time Interoperability at the Point of Care

Humana Inc. is a for-profit American health insurance company. They leverage data streaming with Apache Kafka to improve real-time interoperability at the point of care.

The interoperability platform to transition from an insurance company with elements of health to truly a health company with elements of insurance.

Their core principles include:

  • Consumer-centric
  • Health plan agnostic
  • Provider agnostic
  • Cloud resilient
  • Elastic scale
  • Event-driven and real-time

A critical characteristic is inter-organization data sharing (known as “data exchange/data sharing”).

Humana’s use cases include

  • real-time updates of health information, for instance
  • connecting health care providers to pharmacies
  • reducing pre-authorizations from 20-30 minutes to 1 minute
  • real-time home healthcare assistant communication

The Humana interoperability platform combines data streaming (= the Kafka ecosystem) with artificial intelligence and machine learning (= IBM Watson) to correlate data, train analytic models, and act on new events in real-time.

Humana’s data journey is described in this diagram from their Kafka Summit talk:

Real-Time Healthcare Insurance at Humana with Apache Kafka Data Streaming

Learn more details about Humana’s use cases and architecture in the keynote of another Kafka Summit session.

Recursion – Industrial Revolution of Drug Discovery with Kafka and Deep Learning

Recursion is a clinical-stage biotechnology company that built the “industrial revolution of drug discovery“. They decode biology by integrating technological innovations across biology, chemistry, automation, machine learning, and engineering to industrialize drug discovery.

Industrial pharma revolution - accelerate drug discovery at recursion

Kafka-powered data streaming speeds up the pharma processes significantly. Recursion has already made significant strides in accelerating drug discovery, with over 30 disease models in discovery, another nine in preclinical development, and two in clinical trials.

With serverless Confluent Cloud and the new data streaming approach, the company has built a platform that makes it possible to screen much larger experiments with thousands of compounds against hundreds of disease models in minutes and less expensive than alternative discovery approaches.

From a technical perspective, Recursion finds drug treatments by processing biological images. A massively parallel system combines experimental biology, artificial intelligence, automation, and real-time data streaming:

Apache Kafka and Machine Learning at Recursion for Drug Discovery in Pharma

Recursion went from ‘drug discovery in manual and slow, not scalable, bursty BATCH MODE’ to ‘drug discovery in automated, scalable, reliable REAL-TIME MODE’.

Recursion leverages Dagger, an event-driven workflow and orchestration library for Kafka Streams that enables engineers to orchestrate services by defining workloads as high-level data structures. Dagger combines Kafka topics and schemas with external tasks for actions completed outside of the Kafka Streams applications.

Drug Discovery in automated, scalable, reliable real time Mode

In the meantime, Recursion did not just migrate from manual batch workloads to Kafka but also migrated to serverless Kafka, leveraging Confluent Cloud to focus its resources on business problems instead of infrastructure operations.

Machine Learning and Data Science with Kafka for Intelligent Healthcare Applications

Think about IoT sensor analytics, cybersecurity, patient communication, insurance, research, and many other domains. Real-time data beats slow data in the healthcare supply chain almost everywhere.

This blog post explored the capabilities of the Apache Kafka ecosystem for machine learning infrastructures. Real-world deployments from Humana and Recursion showed how enterprises successfully deploy Kafka together with Machine Learning frameworks like TensorFlow for use cases.

How do you leverage data streaming with Apache Kafka in the healthcare industry? What architecture does your platform use? Which products do you combine with data streaming? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Machine Learning and Data Science with Kafka in Healthcare appeared first on Kai Waehner.

]]>
Kafka for Real-Time Replication between Edge and Hybrid Cloud https://www.kai-waehner.de/blog/2022/01/26/kafka-cluster-linking-for-hybrid-replication-between-edge-cloud/ Wed, 26 Jan 2022 12:45:05 +0000 https://www.kai-waehner.de/?p=4131 Not all workloads should go to the cloud! Low latency, cybersecurity, and cost-efficiency require a suitable combination of edge computing and cloud integration. This blog post explores hybrid data streaming with Apache Kafka anywhere. A live demo shows data synchronization from the edge to the public cloud across continents with Kafka on Hivecell edge hardware and serverless Confluent Cloud.

The post Kafka for Real-Time Replication between Edge and Hybrid Cloud appeared first on Kai Waehner.

]]>
Not all workloads should go to the cloud! Low latency, cybersecurity, and cost-efficiency require a suitable combination of edge computing and cloud integration. This blog post explores architectures and design patterns for software and hardware considerations to deploy hybrid data streaming with Apache Kafka anywhere. A live demo shows data synchronization from the edge to the public cloud across continents with Kafka on Hivecell edge hardware and serverless Confluent Cloud.

Real-Time Edge Computing and Hybrid Cloud with Apache Kafka Confluent and Hivecell

Not every workload should go into the cloud

Almost every company has a cloud-first strategy in the meantime. Nevertheless, not all workloads should be deployed in the public cloud. A few reasons why IT applications still run at the edge or in a local data center:

  • Cost-efficiency: The more data produced at the edge, the more costly it is to transfer everything to the cloud. This significant data transfer is often non-sense for high volumes of raw sensor and telemetry data.
  • Low latency: Some use cases require data processing and correlation in real-time in milliseconds. Communication with remote locations increases the response time significantly.
  • Bad, unstable internet connection: Some environments do not provide good connectivity to the cloud or are entirely disconnected all the time or for some time of the day.
  • Cybersecurity with air-gapped environments: The disconnected Edge is common in safety-critical environments. Controlled data replication is only possible via unidirectional hardware gateways or manual human copy tasks within the site.

Here is a great recent example of why not all workloads should go to the cloud: AWS outage that created enormous issues for visitors to Disney World as the mobile app features are running online in the cloud. Business continuity is not possible if the connection to the cloud is offline:

AWS Outage at Disney World

The Edge is…

To be clear: The term ‘edge’ needs to be defined at the beginning of every conversation. I define the edge as having the following characteristics and needs:

  • Edge is NOT a data center
  • Offline business continuity
  • Often 100+ locations
  • Low-footprint and low-touch
  • Hybrid integration

Hybrid cloud for Kafka is the norm; not an exception

Multi-cluster and cross-data center deployments of Apache Kafka have become the norm rather than an exception. Several scenarios require multi-cluster solutions. Real-world examples have different requirements and trade-offs, including disaster recovery, aggregation for analytics, cloud migration, mission-critical stretched deployments, and global Kafka.

Global Event Streaming with Apache Kafka Confluent Multi Region Clusters Replication and Confluent Cloud

I posted about this in the past. Check out “architecture patterns for distributed, hybrid, edge and global Apache Kafka deployments“.

Apache Kafka at the Edge

From a Kafka perspective, the edge can mean two things:

  • Kafka clients at the edge connecting directly to the Kafka cluster in a remote data center or public cloud, connecting via a native client (Java, C++, Python, etc.) or a proxy (MQTT Proxy, HTTP / REST Proxy)
  • Kafka clients AND the Kafka broker(s) deployed at the edge, not just the client applications

Both alternatives are acceptable and have their trade-offs. This post is about the whole Kafka infrastructure at the edge (potentially replicating to another remote Kafka cluster via MirrorMaker, Confluent Replicator, or Cluster Linking).

Check out my Infrastructure Checklist for Apache Kafka at the edge for more details.

I also covered various Apache Kafka use cases for the edge and hybrid cloud across industries like manufacturing, transportation, energy, mining, retail, entertainment, etc.

Hardware for edge computing and analytics

Edge hardware has some specific requirements to be successful in a project:

  • No special equipment for power, air conditioning, or networking
  • No technicians are required on-site to install, configure, or maintain the hardware and software
  • Start with the smallest footprint possible to show ROI
  • Easily add more compute power as workload expands
  • Deploy and operate simple or distributed software for containers, middleware, event streaming, business applications, and machine learning
  • Monitor, manage and upgrade centrally via fleet management and automation, even when behind a firewall

Devon Energy: Edge and Hybrid Cloud with Kafka, Confluent, and Hivecell

Devon Energy (formerly named WPX Energy) is a company in the oil & gas industry. The digital transformation creates many opportunities to improve processes and reduce costs in this vertical. WPX leverages Confluent Platform on Hivecell edge hardware to realize edge processing and replication to the cloud in real-time at scale.

The solution is designed for real-time decision-making and future closed-loop control optimization. Devon Energy conducts edge stream processing to enable real-time decision-making at the well sites. They also replicate business-relevant data streams produced by machine learning models and analytical preprocessed data at the well site to the cloud, enabling Devon Energy to harness the full power of its real-time events:

Devon Energy Apache Kafka and Confluent at the Edge with Hivecell and Cluster Linking to the Cloud

A few interesting notes about this hybrid Edge to cloud deployment:

  • Improved drilling and well completion operations
  • Edge stream processing / analytics + closed-loop control ready
  • Vendor agnostic (pumping, wireline, coil, offset wells, drilling operations, producing wells)
  • Replication to the cloud in real-time at scale
  • Cloud agnostic (AWS, GCP, Azure)

Live Demo – How to deploy a Kafka Cluster in production on your desk or anywhere

Confluent and Hivecell delivered the promise of bringing a piece of Confluent Cloud right there to your desk and delivering managed Kafka on a cloud-native Kubernetes cluster at the edge. For the first, Kafka deployments run time at scale at the edge, enabling local Kafka clusters at oil drilling sites, on ships, in factories, or in quick-service restaurants.

In this webinar, we showed how it works during a live demo – where we deploy an edge Confluent cluster, stream edge data, and synchronize it with Confluent Cloud across regions and even continents:

Hybrid Kafka Edge to Cloud Replication with Confluent and Hivecell

Dominik and I had our Hivecell cluster in our home in Texas, USA, respectively, Bavaria, Germany. We synchronized events across continents to a central Confluent Cloud cluster and simulated errors and cloud-native self-healing by “killing” one of my Hivecell nodes in Germany.

The webinar covered the following topics:

Edge computing as the next significant paradigm shift in IT
Apache Kafka and Confluent use cases at the Edge in IoT environments
– An easy way of setting up Kafka clusters where ever you need them, including fleet management and error handling
– Hands-on examples of Kafka cluster deployment and data synchronization

Slides and on-demand video recording

Here are the slides:

And the on-demand video recording:

Video Recording - Kafka, Confluent and Hivecell at the Edge and Hybrid Cloud

 

 

Real-time data streaming everywhere required hybrid edge to cloud data streaming!

A cloud-first strategy makes sense. Elastic scaling, agile development, and cost-efficient infrastructure allow innovation. However, not all workloads should go to the cloud for latency, cost, or security reasons.

Apache Kafka can be deployed everywhere. Essential for most projects is the successful deployment and management at the edge and the uni- or bidirectional synchronization in real-time between the edge and the cloud. This post showed how Confluent Cloud, Kafka at the Edge on Hivecell edge hardware, and Cluster Linking enable hybrid streaming data exchanges.

How do you use Apache Kafka? Do you deploy in the public cloud, in your data center, or at the edge outside a data center? How do you process and replicate the data streams? What other technologies do you combine with Kafka? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Kafka for Real-Time Replication between Edge and Hybrid Cloud appeared first on Kai Waehner.

]]>