Confluent Archives - Kai Waehner

Data Streaming Meets the SAP Ecosystem and Databricks – Insights from SAP Sapphire Madrid

Kai Waehner — Wed, 28 May 2025 05:17:50 +0000

I had the opportunity to attend SAP Sapphire 2025 in Madrid—an impressive gathering of SAP customers, partners, and technology leaders from around the world. It was a massive event, bringing the global SAP community together to explore the company’s future direction, innovations, and growing ecosystem.

A key highlight was SAP’s deepening integration of Databricks as an OEM partner for AI and analytics within the SAP Business Data Cloud—showing how the ecosystem is evolving toward more open, composable architectures.

At the same time, conversations around Confluent and data streaming highlighted the critical role real-time integration plays in connecting SAP systems (including ERP, MES, DataSphere, Databricks, etc.) with the rest of the enterprise. As always, it was a great place to learn, connect, and discuss where enterprise data architecture is heading—and how technologies like data streaming are enabling that transformation.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases, focusing on industry scenarios, success stories and business value.

SAP’s Vision: Business Data Cloud, Joule, and Strategic Ecosystem Moves

SAP presented a broad and ambitious strategy centered around the SAP Business Data Cloud (BDC), SAP Joule (including its Agentic AI initiative), and strategic collaborations like SAP Databricks, SAP DataSphere, and integrations across multiple cloud platforms. The vision is clear: SAP wants to connect business processes with modern analytics, AI, and automation.

Source: SAP

For those of us working in data streaming and integration, these developments present a major opportunity. Most customers I meet globally uses SAP ERP or other products like MES, SuccessFactors, or Ariba. The relevance of real-time data streaming in this space is undeniable—and it’s growing.

Building the Bridge: Event-Driven Architecture + SAP

One of the most exciting things about SAP Sapphire is seeing how event-driven architecture is becoming more relevant—even if the conversations don’t start with “Apache Kafka” or “Data Streaming.” In the SAP ecosystem, discussions often focus on business outcomes first, then architecture second. And that’s exactly how it should be.

Many SAP customers are moving toward hybrid cloud environments, where data lives in SAP systems, Salesforce, Workday, ServiceNow, and more. There’s no longer a belief in a single, unified data model. Master Data Management (MDM) as a one-size-fits-all solution has lost its appeal, simply because the real world is more complex.

This is where data streaming with Apache Kafka, Apache Flink, etc. fits in perfectly. Event streaming enables organizations to connect their SAP solutions with the rest of the enterprise—for real-time integration across operational systems, analytics platforms, AI engines, and more. It supports transactional and analytical use cases equally well and can be tailored to each industry’s needs.

In the SAP ecosystem, customers typically don’t look for open source frameworks to assemble their own solutions—they look for a reliable, enterprise-grade platform that just works. That’s why Confluent’s data streaming platform is an excellent fit: it combines the power of Kafka and Flink with the scalability, security, governance, and cloud-native capabilities enterprises expect.

SAP, Databricks, and Confluent – A Triangular Partnership

At the event, I had some great conversations—often literally sitting between leaders from SAP and Databricks. Watching how these two players are evolving—and where Confluent fits into the picture—was eye-opening.

SAP and Databricks are working closely together, especially with the SAP Databricks OEM offering that integrates Databricks into the SAP Business Data Cloud as an embedded AI and analytics engine. SAP DataSphere also plays a central role here, serving as a gateway into SAP’s structured data.

Meanwhile, Databricks is expanding into the operational domain, not just the analytical lakehouse. After acquiring Neon (a Postgres-compatible cloud-native database), Databricks is expected to announce an additional own transactional OLTP solution soon. This shows how rapidly they’re moving beyond batch analytics into the world of operational workloads—areas where Kafka and event streaming have traditionally provided the backbone.

This trend opens up a significant opportunity for data streaming platforms like Confluent to play a central role in modern SAP data architectures. As platforms like Databricks expand their capabilities, the demand for real-time, multi-system integration and cross-platform data sharing continues to grow.

Confluent is uniquely positioned to meet this need—offering not just data movement, but also the ability to process, govern, and enrich data in motion using tools like Apache Flink, and a broad ecosystem of connectors, including those for transactional systems like SAP ERP, but also Oracle databases, IBM mainframe, and other cloud services like Snowflake, ServiceNow or Salesforce.

Data Products, Not Just Pipelines

The term “data product” was mentioned in nearly every conversation—whether from the SAP angle (business semantics and ownership), Databricks (analytics-first), or Confluent (independent, system-agnostic, streaming-native). The key message? Everyone wants real-time, reusable, discoverable data products.

This is where an event-driven architecture powered by a data streaming platform shines: Data Streaming connects everything and distributes data to both operational and analytical systems, with governance, durability, and flexibility at the core.

Confluent’s data streaming platform enables the creation of data products from a wide range of enterprise systems, complementing the SAP data products being developed within the SAP Business Data Cloud. The strength of the partnership lies in the ability to combine these assets—bringing together SAP-native data products with real-time, event-driven data products built from non-SAP systems connected through Confluent. This integration creates a unified, scalable foundation for both operational and analytical use cases across the enterprise.

Industry-Specific Use Cases to Explore the Business Value of SAP and Data Streaming

One major takeaway: in the SAP ecosystem, generic messaging around cutting edge technologies such as Apache Kafka does not work. Success comes from being well-prepared—knowing which SAP systems are involved (ECC, S/4HANA, on-prem, or cloud) and what role they play in the customer’s architecture. The conversations must be use case-driven, often tailored to industries like manufacturing, retail, logistics, or the public sector.

This level of specificity is new to many people working in the technical world of Kafka, Flink, and data streaming. Developers and architects often approach integration from a tool- or framework-centric perspective. However, SAP customers expect business-aligned solutions that address concrete pain points in their domain—whether it’s real-time order tracking in logistics, production analytics in manufacturing, or spend transparency in the public sector.

Understanding the context of SAP’s role in the business process, along with industry regulations, workflows, and legacy system constraints, is key to having meaningful conversations. For the data streaming community, this is a shift in mindset—from building pipelines to solving business problems—and it represents a major opportunity to bring strategic value to enterprise customers.

You are lucky: I just published a free ebook about data streaming use cases focusing on industry scenarios and business value: “The Ultimate Data Streaming Guide“.

Looking Forward: SAP, Data Streaming, AI, and Open Table Formats

Another theme to watch: data lake and format standardization. All cloud providers and data vendors like Databricks, Confluent or Snowflake are investing heavily in supporting open table formats like Apache Iceberg (alongside Delta Lake at Databricks) to standardize analytical integrations and reduce storage costs significantly.

SAP’s investment in Agentic AI through SAP Joule reflects a broader trend across the enterprise software landscape, with vendors like Salesforce, ServiceNow, and others embedding intelligent agents into their platforms. This creates a significant opportunity for Confluent to serve as the streaming backbone—enabling real-time coordination, integration, and decision-making across these diverse, distributed systems.

An event-driven architecture powered by data streaming is crucial for the success of Agentic AI with SAP Joule, Databricks AI agents, and other operational systems that need to be integrated into the business processes. The strategic partnership between Confluent and Databricks makes it even easier to implement end-to-end AI pipelines across the operational and analytical estates.

SAP Sapphire Madrid was a valuable reminder that data streaming is no longer a niche technology—it’s a foundation for digital transformation. Whether it’s SAP ERP, Databricks AI, or new cloud-native operational systems, a Data Streaming Platform connects them all in real time to enable new business models, better customer experiences, and operational agility.

The post Data Streaming Meets the SAP Ecosystem and Databricks – Insights from SAP Sapphire Madrid appeared first on Kai Waehner.

Databricks and Confluent in the World of Enterprise Software (with SAP as Example)

Kai Waehner — Mon, 12 May 2025 11:26:54 +0000

Modern enterprises rely heavily on operational systems like SAP ERP, Oracle, Salesforce, ServiceNow and mainframes to power critical business processes. But unlocking real-time insights and enabling AI at scale requires bridging these systems with modern analytics platforms like Databricks. This blog explores how Confluent’s data streaming platform enables seamless integration between SAP, Databricks, and other systems to support real-time decision-making, AI-driven automation, and agentic AI use cases. It explores how Confluent delivers the real-time backbone needed to build event-driven, future-proof enterprise architectures—supporting everything from inventory optimization and supply chain intelligence to embedded copilots and autonomous agents.

About the Confluent and Databricks Blog Series

This article is part of a blog series exploring the growing roles of Confluent and Databricks in modern data and AI architectures:

Blog 1: The Past, Present and Future of Confluent (The Kafka Company) and Databricks (The Spark Company)
Blog 2: Confluent Data Streaming Platform vs. Databricks Data Intelligence Platform for Data Integration and Processing
Blog 3: Shift-Left Architecture for AI and Data Warehousing with Confluent and Databricks
Blog 4 (THIS ARTICLE): Databricks and Confluent in Enterprise Software Environments (with SAP as Example)
Blog 5: Databricks and Confluent Leading Data and AI Architectures – and How They Compare to Competitors

Learn how these platforms will affect data use in businesses in future articles. Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including technical architectures and the relation to other operational and analytical platforms like SAP and Databricks.

Most Enterprise Data Is Operational

Enterprise software systems generate a constant stream of operational data across a wide range of domains. This includes orders and inventory from SAP ERP systems, often extended with real-time production data from SAP MES. Oracle databases capture transactional data critical to core business operations, while MongoDB contributes operational data—frequently used as a CDC source or, in some cases, as a sink for analytical queries. Customer interactions are tracked in platforms like Salesforce CRM, and financial or account-related events often originate from IBM mainframes.

Together, these systems form the backbone of enterprise data, requiring seamless integration for real-time intelligence and business agility. This data is often not immediately available for analytics or AI unless it’s integrated into downstream systems.

Confluent is built to ingest and process this kind of operational data in real time. Databricks can then consume it for AI and machine learning, dashboards, or reports. Together, SAP, Confluent and Databricks create a real-time architecture for enterprise decision-making.

SAP Product Landscape for Operational and Analytical Workloads

SAP plays a foundational role in the enterprise data landscape—not just as a source of business data, but as the system of record for core operational processes across finance, supply chain, HR, and manufacturing.

On a high level, the SAP product portfolio has three categories (these days): SAP Business AI, SAP Business Data Cloud (BDC), and SAP Business Applications powered by SAP Business Technology Platform (BTP).

Source: SAP

To support both operational and analytical needs, SAP offers a portfolio of platforms and tools, while also partnering with best-in-class technologies like Databricks and Confluent.

Operational Workloads (Transactional Systems):

SAP S/4HANA – Modern ERP for core business operations
SAP ECC – Legacy ERP platform still widely deployed
SAP CRM / SCM / SRM – Domain-specific business systems
SAP Business One / Business ByDesign – ERP solutions for mid-market and subsidiaries

Analytical Workloads (Data & Analytics Platforms):

SAP Datasphere – Unified data fabric to integrate, catalog, and govern SAP and non-SAP data
SAP Analytics Cloud (SAC) – Visualization, reporting, and predictive analytics
SAP BW/4HANA – Data warehousing and modeling for SAP-centric analytics

SAP Business Data Cloud (BDC)

SAP Business Data Cloud (BDC) is a strategic initiative within SAP Business Technology Platform (BTP) that brings together SAP’s data and analytics capabilities into a unified cloud-native experience. It includes:

SAP Datasphere as the data fabric layer, enabling seamless integration of SAP and third-party data
SAP Analytics Cloud (SAC) for consuming governed data via dashboards and reports
SAP’s partnership with Databricks to allow SAP data to be analyzed alongside non-SAP sources in a lakehouse architecture
Real-time integration scenarios enabled through Confluent and Apache Kafka, bringing operational data in motion directly into SAP and Databricks environments

Together, this ecosystem supports real-time, AI-powered, and governed analytics across operational and analytical workloads—making SAP data more accessible, trustworthy, and actionable within modern cloud data architectures.

SAP Databricks OEM: Limited Scope, Full Control by SAP

SAP recently announced an OEM partnership with Databricks, embedding parts of Databricks’ serverless infrastructure into the SAP ecosystem. While this move enables tighter integration and simplified access to AI workloads within SAP, it comes with significant trade-offs. The OEM model is narrowly scoped, optimized primarily for ML and GenAI scenarios on SAP data, and lacks the openness and flexibility of native Databricks.

This integration is not intended for full-scale data engineering. Core capabilities such as workflows, streaming, Delta Live Tables, and external data connections (e.g., Snowflake, S3, MS SQL) are missing. The architecture is based on data at rest and does not embrace event-driven patterns. Compute options are limited to serverless only, with no infrastructure control. Pricing is complex and opaque, with customers often needing to license Databricks separately to unlock full capabilities.

Critically, SAP controls the entire data integration layer through its BDC Data Products, reinforcing a vendor lock-in model. While this may benefit SAP-centric organizations focused on embedded AI, it restricts broader interoperability and long-term architectural flexibility. In contrast, native Databricks, i.e., outside of SAP, offers a fully open, scalable platform with rich data engineering features across diverse environments.

Whichever Databricks option you prefer, this is where Confluent adds value—offering a truly event-driven, decoupled architecture that complements both SAP Datasphere and Databricks, whether used within or outside the SAP OEM framework.

Confluent and SAP Integration

Confluent provides native and third-party connectors to integrate with SAP systems to enable continuous, low-latency data flow across business applications.

Source: Confluent

This powers modern, event-driven use cases that go beyond traditional batch-based integrations:

Low-latency access to SAP transactional data
Integration with other operational source systems like Salesforce, Oracle, IBM Mainframe, MongoDB, or IoT platforms
Synchronization between SAP DataSphere and other data warehouse and analytics platforms such as Snowflake, Google BigQuery or Databricks
Decoupling of applications for modular architecture
Data consistency across real-time, batch and request-response APIs
Hybrid integration across any edge, on-premise or multi-cloud environments

SAP Datasphere and Confluent

To expand its role in the modern data stack, SAP introduced SAP Datasphere—a cloud-native data management solution designed to extend SAP’s reach into analytics and data integration. Datasphere aims to simplify access to SAP and non-SAP data across hybrid environments.

SAP Datasphere simplifies data access within the SAP ecosystem, but it has key drawbacks when compared to open platforms like Databricks, Snowflake, or Google BigQuery:

Closed Ecosystem: Optimized for SAP, but lacks flexibility for non-SAP integrations.
No Event Streaming: Focused on data at rest, with limited support for real-time processing or streaming architectures.
No Native Stream Processing: Relies on batch methods, adding latency and complexity for hybrid or real-time use cases.

Confluent alleviates these drawbacks and supports this strategy through bi-directional integration with SAP Datasphere. This enables real-time streaming of SAP data into Datasphere and back out to operational or analytical consumers via Apache Kafka. It allows organizations to enrich SAP data, apply real-time processing, and ensure it reaches the right systems in the right format—without waiting for overnight batch jobs or rigid ETL pipelines.

Confluent for Agentic AI with SAP Joule and Databricks

SAP is laying the foundation for agentic AI architectures with a vision centered around Joule—its generative AI copilot—and a tightly integrated data stack that includes SAP Databricks (via OEM), SAP Business Data Cloud (BDC), and a unified knowledge graph. On top of this foundation, SAP is building specialized AI agents for use cases such as customer 360, creditworthiness analysis, supply chain intelligence, and more.

Source: SAP

The architecture combines:

SAP Joule as the interface layer for generative insights and decision support
SAP’s foundational models and domain-specific knowledge graph
SAP BDC and SAP Databricks as the data and ML/AI backbone
Data from both SAP systems (ERP, CRM, HR, logistics) and non-SAP systems (e.g. clickstream, IoT, partner data, social media) from its partnership with Confluent

But here’s the catch: What happens when agents need to communicate with one another to deliver a workflow? Such Agentic systems require continuous, contextual, and event-driven data exchange—not just point-to-point API calls and nightly batch jobs.

This is where Confluent’s data streaming platform comes in as critical infrastructure.

Agentic AI with Apache Kafka as Event Broker

Confluent provides the real-time data streaming platform that connects the operational world of SAP with the analytical and AI-driven world of Databricks, enabling the continuous movement, enrichment, and sharing of data across all layers of the stack.

The above is a conceptual view on the architecture. The AI agents on the left side could be built with SAP Joule, Databricks, or any “outside” GenAI framework.

The data streaming platform helps connecting the AI agents with the reset of the enterprise architecture, both within SAP and Databricks but also beyond:

Real-time data integration from non-SAP systems (e.g., mobile apps, IoT devices, mainframes, web logs) into SAP and Databricks
True decoupling of services and agents via an event-driven architecture (EDA), replacing brittle RPC or point-to-point API calls
Event replay and auditability—critical for traceable AI systems operating in regulated environments
Streaming pipelines for feature engineering and inference: stream-based model triggering with low-latency SLAs
Support for bi-directional flows: e.g., operational triggers in SAP can be enriched by AI agents running in Databricks and pushed back into SAP via Kafka events

Without Confluent, SAP’s agentic architecture risks becoming a patchwork of stateless services bound by fragile REST endpoints—lacking the real-time responsiveness, observability, and scalability required to truly support next-generation AI orchestration.

Confluent turns the SAP + Databricks vision into a living, breathing ecosystem—where context flows continuously, agents act autonomously, and enterprises can build future-proof AI systems that scale.

Data Streaming Use Cases Across SAP Product Suites

With Confluent, organizations can support a wide range of use cases across SAP product suites, including:

Real-Time Inventory Visibility: Live updates of stock levels across warehouses and stores by streaming material movements from SAP ERP and SAP EWM, enabling faster order fulfillment and reduced stockouts.
Dynamic Pricing and Promotions: Stream sales orders and product availability in real time to trigger pricing adjustments or dynamic discounting via integration with SAP ERP and external commerce platforms.
AI-Powered Supply Chain Optimization: Combine data from SAP ERP, SAP Ariba, and external logistics platforms to power ML models that predict delays, optimize routes, and automate replenishment.
Shop Floor Event Processing: Stream sensor and machine data alongside order data from SAP MES, enabling real-time production monitoring, alerting, and throughput optimization.
Employee Lifecycle Automation: Stream employee events (e.g., onboarding, role changes) from SAP SuccessFactors to downstream IT systems (e.g., Active Directory, badge systems), improving HR operations and compliance.
Order-to-Cash Acceleration: Connect order intake (via web portals or Salesforce) to SAP ERP in real time, enabling faster order validation, invoicing, and cash flow.
Procure-to-Pay Automation: Integrate procurement events from SAP Ariba and supplier portals with ERP and financial systems to streamline approvals and monitor supplier performance continuously.
Customer 360 and CRM Synchronization: Synchronize customer master data and transactions between SAP ERP, SAP CX, and third-party CRMs like Salesforce to enable unified customer views.
Real-Time Financial Reporting: Stream financial transactions from SAP S/4HANA into cloud-based lakehouses or BI tools for near-instant reporting and compliance dashboards.
Cross-System Data Consistency: Ensure consistent master data and business events across SAP and non-SAP environments by treating SAP as a real-time event source—not just a system of record.

Example Use Case and Architecture with SAP, Databricks and Confluent

Consider a manufacturing company using SAP ERP for inventory management and Databricks for predictive maintenance. The combination of SAP Datasphere and Confluent enables seamless data integration from SAP systems, while the addition of Databricks supports advanced AI/ML applications—turning operational data into real-time, predictive insights.

With Confluent as the real-time backbone:

Machine telemetry (via MQTT or OPC-UA) and ERP events (e.g., stock levels, work orders) are streamed in real time.
Apache Flink enriches and filters the event streams—adding context like equipment metadata or location.
Tableflow publishes clean, structured data to Databricks as Delta tables for analytics and ML processing.
A predictive model hosted in a Databricks model detects potential equipment failure before it happens in a Flink application calling the remote model with low latency.
The resulting prediction is streamed back to Kafka, triggering an automated work order in SAP via event integration.

This bi-directional, event-driven pattern illustrates how Confluent enables seamless, real-time collaboration across SAP, Databricks, and IoT systems—supporting both operational and analytical use cases with a shared architecture.

Going Beyond SAP with Data Streaming

This pattern applies to other enterprise systems:

Salesforce: Stream customer interactions for real-time personalization through Salesforce Data Cloud
Oracle: Capture transactions via CDC (Change Data Capture)
ServiceNow: Monitor incidents and automate operational responses
Mainframe: Offload events from legacy applications without rewriting code
MongoDB: Sync operational data in real time to support responsive apps
Snowflake: Stream enriched operational data into Snowflake for near real-time analytics, dashboards, and data sharing across teams and partners
OpenAI (or other GenAI platforms): Feed real-time context into LLMs for AI-assisted recommendations or automation
“You name it”: Confluent’s prebuilt connectors and open APIs enable event-driven integration with virtually any enterprise system

Confluent provides the backbone for streaming data across all of these platforms—securely, reliably, and in real time.

Strategic Value for the Enterprise of Event-based Real-Time Integration with Data Streaming

Enterprise software platforms are essential. But they are often closed, slow to change, and not designed for analytics or AI.

Confluent provides real-time access to operational data from platforms like SAP. SAP Datasphere and Databricks enable analytics and AI on that data. Together, they support modern, event-driven architectures.

Use Confluent for real-time data streaming from SAP and other core systems
Use SAP Datasphere and Databricks to build analytics, reports, and AI on that data
Use Tableflow to connect the two platforms seamlessly

This modern approach to data integration delivers tangible business value, especially in complex enterprise environments. It enables real-time decision-making by allowing business logic to operate on live data instead of outdated reports. Data products become reusable assets, as a single stream can serve multiple teams and tools simultaneously. By reducing the need for batch layers and redundant processing, the total cost of ownership (TCO) is significantly lowered. The architecture is also future-proof, making it easy to integrate new systems, onboard additional consumers, and scale workflows as business needs evolve.

Beyond SAP: Enabling Agentic AI Across the Enterprise

The same architectural discussion applies across the enterprise software landscape. As vendors embed AI more deeply into their platforms, the effectiveness of these systems increasingly depends on real-time data access, continuous context propagation, and seamless interoperability.

Without an event-driven foundation, AI agents remain limited—trapped in siloed workflows and brittle API chains. Confluent provides the scalable, reliable backbone needed to enable true agentic AI in complex enterprise environments.

Examples of AI solutions driving this evolution include:

SAP Joule / Business AI – Context-aware agents and embedded AI across ERP, finance, and supply chain
Salesforce Einstein / Copilot Studio – Generative AI for CRM, service, and marketing automation built on top of Salesforce Data Cloud
ServiceNow Now Assist – Intelligent workflows and predictive automation in ITSM and Ops
Oracle Fusion AI / OCI AI Services – Embedded machine learning in ERP, HCM, and SCM
Microsoft Copilot (Dynamics / Power Platform) – AI copilots across business and low-code apps
Workday AI – Smart recommendations for finance, workforce, and HR planning
Adobe Sensei GenAI – GenAI for content creation and digital experience optimization
IBM watsonx – Governed AI foundation for enterprise use cases and data products
Infor Coleman AI – Industry-specific AI for supply chain and manufacturing systems
All the “traditional” cloud providers and data platforms such as Snowflake with Cortex, Microsoft Azure Fabric, AWS SageMaker, AWS Bedrock, and GCP Vertex AI

Each of these platforms benefits from a streaming-first architecture that enables real-time decisions, reusable data, and smarter automation across the business.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including technical architectures and the relation to other operational and analytical platforms like SAP and Databricks.

The post Databricks and Confluent in the World of Enterprise Software (with SAP as Example) appeared first on Kai Waehner.

Shift Left Architecture for AI and Analytics with Confluent and Databricks

Kai Waehner — Fri, 09 May 2025 06:03:07 +0000

Modern enterprise architectures are evolving. Traditional batch data pipelines and centralized processing models are being replaced by more flexible, real-time systems. One of the driving concepts behind this change is the Shift Left approach. This blog compares Databricks’ Medallion Architecture with a Shift Left Architecture popularized by Confluent. It explains where each concept fits best—and how they can work together to create a more complete, flexible, and scalable architecture.

About the Confluent and Databricks Blog Series

This article is part of a blog series exploring the growing roles of Confluent and Databricks in modern data and AI architectures:

Blog 1: The Past, Present and Future of Confluent (The Kafka Company) and Databricks (The Spark Company)
Blog 2: Confluent Data Streaming Platform vs. Databricks Data Intelligence Platform for Data Integration and Processing
Blog 3 (THIS ARTICLE): Shift-Left Architecture for AI and Data Warehousing with Confluent and Databricks
Blog 4: Databricks and Confluent in Enterprise Software Environments (with SAP as Example)
Blog 5: Databricks and Confluent Leading Data and AI Architectures – and How They Compare to Competitors

Learn how these platforms will affect data use in businesses in future articles. Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including more details about the shift left architecture with data streaming and lakehouses.

Medallion Architecture: Structured, Proven, but Not Always Optimal

The Medallion Architecture, popularized by Databricks, is a well-known design pattern for organizing and processing data within a lakehouse. It provides structure, modularity, and clarity across the data lifecycle by breaking pipelines into three logical layers:

Bronze: Ingest raw data in its original format (often semi-structured or unstructured)
Silver: Clean, normalize, and enrich the data for usability
Gold: Aggregate and transform the data for reporting, dashboards, and machine learning

Source: Databricks

This layered approach is valuable for teams looking to establish governed and scalable data pipelines. It supports incremental refinement of data and enables multiple consumers to work from well-defined stages.

Challenges of the Medallion Architecture

The Medallion Architecture also introduces challenges:

Pipeline delays: Moving data from Bronze to Gold can take minutes or longer—too slow for operational needs
Infrastructure overhead: Each stage typically requires its own compute and storage footprint
Redundant processing: Data transformations are often repeated across layers
Limited operational use: Data is primarily at rest in object storage; using it for real-time operational systems often requires inefficient reverse ETL pipelines.

For use cases that demand real-time responsiveness and/or critical SLAs—such as fraud detection, personalized recommendations, or IoT alerting—this traditional batch-first model may fall short. In such cases, an event-driven streaming-first architecture, powered by a data streaming platform like Confluent, enables faster, more cost-efficient pipelines by performing validation, enrichment, and even model inference before data reaches the lakehouse.

Importantly, this data streaming approach doesn’t replace the Medallion pattern—it complements it. It allows you to “shift left” critical logic, reducing duplication and latency while still feeding trusted, structured data into Delta Lake or other downstream systems for broader analytics and governance.

In other words, shifting data processing left (i.e., before it hits a data lake or Lakehouse) is especially valuable when the data needs to serve multiple downstream systems—operational and analytical alike—because it avoids duplication, reduces latency, and ensures consistent, high-quality data is available wherever it’s needed.

In a Shift Left Architecture, data processing happens earlier—closer to the source, both physically and logically. This often means:

Transforming and validating data as it streams in
Enriching and filtering in real time
Sharing clean, usable data quickly across teams AND different technologies/applications

This is especially useful for:

Reducing time to insight
Improving data quality at the source
Creating reusable, consistent data products
Operational workloads with critical SLAs

How Confluent Enables Shift Left with Databricks

In a Shift Left setup, Apache Kafka provides scalable, low-latency, and truly decoupled ingestion of data across operational and analytical systems, forming the backbone for unified data pipelines.

Schema Registry and data governance policies enforce consistent, validated data across all streams, ensuring high-quality, secure, and compliant data delivery from the very beginning.

Apache Flink enables early data processing — closer to where data is produced. This reduces complexity downstream, improves data quality, and allows real-time decisions and analytics.

Data Quality Governance via Data Contracts and Schema Validation

Flink can enforce data contracts by validating incoming records against predefined schemas (e.g., using JSON Schema, Apache Avro or Protobuf with Schema Registry). This ensures structurally valid data continues through the pipeline. In cases where schema violations occur, records can be automatically routed to a Dead Letter Queue (DLQ) for inspection.

Additionally, data contracts can enforce policy-based rules at the schema level—such as field-level encryption, masking of sensitive data (PII), type coercion, or enrichment defaults. These controls help maintain compliance and reduce risk before data reaches regulated or shared environments.

Apache Flink for Continuous Stream Processing

Flink can perform the following tasks before data ever lands in a data lake or warehouse:

Filtering and Routing

Events can be filtered based on business rules and routed to the appropriate downstream system or Kafka topic. This allows different consumers to subscribe only to relevant data, optimizing both performance and cost.

Metric Calculation

Use Flink to compute rolling aggregates (e.g., counts, sums, averages, percentiles) over windows of data in motion. This is useful for business metrics, anomaly detection, or feeding real-time dashboards—without waiting for batch jobs.

Real-Time Joins and Enrichment

Flink supports both stream-stream and stream-table joins. This enables real-time enrichment of incoming events with contextual information from reference data (e.g., user profiles, product catalogs, pricing tables), often sourced from Kafka topics, databases, or external APIs.

By shifting this logic to the beginning of the pipeline, teams can reduce duplication, avoid unnecessary storage and compute costs in downstream systems, and ensure that data products are clean, policy-compliant, and ready for both operational and analytical use—as soon as they are created.

Example: A financial application might use Flink to calculate running balances, detect anomalies, and enrich records with reference data before pushing to Databricks for reporting and training analytic models.

In addition to enhancing data quality and reducing time-to-insight in the lakehouse, this approach also makes data products immediately usable for operational workloads and downstream applications—without building separate pipelines.

Learn more about stateless and stateful stream processing in real-time architectures using Apache Flink in this in-depth blog post.

Combining Shift Left with Medallion Architecture

These architectures are not mutually exclusive. Shift Left is about processing data earlier. Medallion is about organizing data once it arrives.

You can use Shift Left principles to:

Pre-process operational data before it enters the Bronze layer
Ensure clean, validated data enters Silver with minimal transformation needed
Reduce the need for redundant processing steps between layers

Confluent’s Tableflow bridges the two worlds. It converts Kafka streams into Delta tables, integrating cleanly with the Medallion model while supporting real-time flows.

Shift Left with Delta Lake, Iceberg, and Tableflow

Confluent Tableflow makes it easy to publish Kafka streams into Delta Lake or Apache Iceberg formats. These can be discovered and queried inside Databricks via Unity Catalog.

This integration:

Simplifies integration, governance and discovery
Enables live updates to AI features and dashboards
Removes the need to manage Spark streaming jobs

This is a natural bridge between a data streaming platform and the lakehouse.

Source: Confluent

AI Use Cases for Shift Left with Confluent and Databricks

The Shift Left model benefits both predictive and generative AI:

Model training: Real-time data pipelines can stream features to Delta Lake
Model inference: In some cases, predictions can happen in Confluent (via Flink) and be pushed back to operational systems instantly
Agentic AI: Real-time event-driven architectures are well suited for next-gen, stateful agents

Databricks supports model training and hosting via MosaicML. Confluent can integrate with these models, or run lightweight inference directly from the stream processing application.

Data Warehouse Use Cases for Shift Left with Confluent and Databricks

Batch reporting: Continue using Databricks for traditional BI
Real-time analytics: Flink or real-time OLAP engines (e.g., Apache Pinot, Apache Druid) may be a better fit for sub-second insights
Hybrid: Push raw events into Databricks for historical analysis and use Flink for immediate feedback

Where you do the data processing depends on the use case.

Architecture Benefits Beyond Technology

Shift Left also brings architectural benefits:

Cost Reduction: Processing early can lower storage and compute usage
Faster Time to Market: Data becomes usable earlier in the pipeline
Reusability: Processed streams can be reused and consumed by multiple technologies/applications (not just Databricks teams)
Compliance and Governance: Validated data with lineage can be shared with confidence

These are important for strategic enterprise data architectures.

Bringing in New Types of Data

Shift Left with a data streaming platform supports a wider range of data sources:

Operational databases (like Oracle, DB2, SQL Server, Postgres, MongoDB)
ERP systems (SAP et al)
Mainframes and other legacy technologies
IoT interfaces (MQTT, OPC-UA, proprietary IIoT gateway, etc.)
SaaS platforms (Salesforce, ServiceNow, and so on)
Any other system that does not directly fit into the “table-driven analytics perspective” of a Lakehouse

With Confluent, these interfaces can be connected in real time, enriched at the edge or in transit, and delivered to analytics platforms like Databricks.

This expands the scope of what’s possible with AI and analytics.

Shift Left Using ONLY Databricks

A shift left architecture only with Databricks is possible, too. A Databricks consultant took my Shift Left slide and adjusted it that way:

Relying solely on Databricks for a “Shift Left Architecture” can work if all workloads (should) stay within the platform — but it’s a poor fit for many real-world scenarios.

Databricks focuses on ELT, not true ETL, and lacks native support for operational workloads like APIs, low-latency apps, or transactional systems. This forces teams to rely on reverse ETL tools – a clear anti-pattern in the enterprise architecture – just to get data where it’s actually needed. The result: added complexity, latency, and tight coupling.

The Shift Left Architecture is valuable, but in most cases it requires a modular approach, where streaming, operational, and analytical components work together — not a monolithic platform.

That said, shift left principles still apply within Databricks. Processing data as early as possible improves data quality, reduces overall compute cost, and minimizes downstream data engineering effort. For teams that operate fully inside the Databricks ecosystem, shifting left remains a powerful strategy to simplify pipelines and accelerate insight.

Meesho: Scaling a Real-Time Commerce Platform with Confluent and Databricks

Many high-growth digital platforms adopt a shift-left approach out of necessity—not as a buzzword, but to reduce latency, improve data quality, and scale efficiently by processing data closer to the source.

Meesho, one of India’s largest online marketplaces, relies on Confluent and Databricks to power its hyper-growth business model focused on real-time e-commerce. As the company scaled rapidly, supporting millions of small businesses and entrepreneurs, the need for a resilient, scalable, and low-latency data architecture became critical.

To handle massive volumes of operational events — from inventory updates to order management and customer interactions — Meesho turned to Confluent Cloud. By adopting a fully managed data streaming platform using Apache Kafka, Meesho ensures real-time event delivery, improved reliability, and faster application development. Kafka serves as the central nervous system for their event-driven architecture, connecting multiple services and enabling instant, context-driven customer experiences across mobile and web platforms.

Alongside their data streaming architecture, Meesho migrated from Amazon Redshift to Databricks to build a next-generation analytics platform. Databricks’ lakehouse architecture empowers Meesho to unify operational data from Kafka with batch data from other sources, enabling near real-time analytics at scale. This migration not only improved performance and scalability but also significantly reduced costs and operational overhead.

With Confluent managing real-time event processing and ingestion, and Databricks providing powerful, scalable analytics, Meesho is able to:

Deliver real-time personalized experiences to customers
Optimize operational workflows based on live data
Enable faster, data-driven decision-making across business teams

By combining real-time data streaming with advanced lakehouse analytics, Meesho has built a flexible, future-ready data infrastructure to support its mission of democratizing online commerce for millions across India.

Shift Left: Reducing Complexity, Increasing Value for the Lakehouse (and other Operational Systems)

Shift Left is not about replacing Databricks. It’s about preparing better data earlier in the pipeline—closer to the source—and reducing end-to-end complexity.

Use Confluent for real-time ingestion, enrichment, and transformation
Use Databricks for advanced analytics, reporting, and machine learning
Use Tableflow and Delta Lake to govern and route high-quality data to the right consumers

This architecture not only improves data quality for the lakehouse, but also enables the same real-time data products to be reused across multiple downstream systems—including operational, transactional, and AI-powered applications.

The result: increased agility, lower costs, and scalable innovation across the business.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including more details about the shift left architecture with data streaming and lakehouses.

The post Shift Left Architecture for AI and Analytics with Confluent and Databricks appeared first on Kai Waehner.

Confluent Data Streaming Platform vs. Databricks Data Intelligence Platform for Data Integration and Processing

Kai Waehner — Mon, 05 May 2025 03:47:21 +0000

Many organizations use both Confluent and Databricks. While these platforms serve different primary goals—real-time data streaming vs. analytical processing—there are areas where they overlap. This blog explores how the Confluent Data Streaming Platform (DSP) and the Databricks Data Intelligence Platform handle data integration and processing. It explains their different roles, where they intersect, and when one might be a better fit than the other.

About the Confluent and Databricks Blog Series

This article is part of a blog series exploring the growing roles of Confluent and Databricks in modern data and AI architectures:

Blog 1: The Past, Present and Future of Confluent (The Kafka Company) and Databricks (The Spark Company)
Blog 2 (THIS ARTICLE): Confluent Data Streaming Platform vs. Databricks Data Intelligence Platform for Data Integration and Processing
Blog 3: Shift-Left Architecture for AI and Data Warehousing with Confluent and Databricks
Blog 4: Databricks and Confluent in Enterprise Software Environments (with SAP as Example)
Blog 5: Databricks and Confluent Leading Data and AI Architectures – and How They Compare to Competitors

Data Integration and Processing: Shared Space, Different Strengths

Confluent is focused on continuous, event-based data movement and processing. It connects to hundreds of real-time and non-real-time data sources and targets. It enables low-latency stream processing using Apache Kafka and Flink, forming the backbone of an event-driven architecture. Databricks, on the other hand, combines data warehousing, analytics, and machine learning on a unified, scalable architecture.

Confluent: Event-Driven Integration Platform

Confluent is increasingly used as modern operational middleware, replacing traditional message queues (MQ) and enterprise service buses (ESB) in many enterprise architectures.

Thanks to its event-driven foundation, it supports not just real-time event streaming but also integration with request/response APIs and batch-based interfaces. This flexibility allows enterprises to standardize on the Kafka protocol as the data hub—bridging asynchronous event streams, synchronous APIs, and legacy systems. The immutable event store and true decoupling of producers and consumers help maintain data consistency across the entire pipeline, regardless of whether data flows in real-time, in scheduled batches or via API calls.

Databricks: Batch-Driven Analytics and AI Platform

Databricks excels in batch processing and traditional ELT workloads. It is optimized for storing data first and then transforming it within its platform, but it’s not built as a real-time ETL tool for directly connecting to operational systems or handling complex, upstream data mappings.

Databricks enables data transformations at scale, supporting complex joins, aggregations, and data quality checks over large historical datasets. Its Medallion Architecture (Bronze, Silver, Gold layers) provides a structured approach to incrementally refine and enrich raw data for analytics and reporting. The engine is tightly integrated with Delta Lake and Unity Catalog, ensuring governed and high-performance access to curated datasets for data science, BI, and machine learning.

For most use cases, the right choice is simple.

Confluent is ideal for building real-time pipelines and unifying operational systems.
Databricks is optimized for batch analytics, warehousing, and AI development.

Together, Confluent and Databricks cover both sides of the modern data architecture—streaming and batch, operational and analytical. And Confluent’s Tableflow and a shift-left architecture enable native integration with earlier data validation, simplified pipelines, and faster access to AI-ready data.

Data Ingestion Capabilities

Databricks recently introduced LakeFlow Connect and acquired Arcion to strengthen its capabilities around Change Data Capture (CDC) and data ingestion into Delta Lake. These are good steps toward improving integration, particularly for analytical use cases.

However, Confluent is the industry leader in operational data integration, serving as modern middleware for connecting mainframes, ERP systems, IoT devices, APIs, and edge environments. Many enterprises have already standardized on Confluent to move and process operational data in real time with high reliability and low latency.

Introducing yet another tool—especially for ETL and ingestion—creates unnecessary complexity. It risks a return to Lambda-style architectures, where separate pipelines must be built and maintained for real-time and batch use cases. This increases engineering overhead, inflates cost, and slows time to market.

In contrast, Confluent supports a Kappa architecture model: a single, unified event-driven data streaming pipeline that powers both operational and analytical workloads. This eliminates duplication, simplifies the data flow, and enables consistent, trusted data delivery from source to sink.

Confluent for Data Ingestion into Databricks

Confluent’s integration capabilities provide:

100+ enterprise-grade connectors, including SAP, Salesforce, and mainframe systems
Native CDC support for Oracle, SQL Server, PostgreSQL, MongoDB, Salesforce, and more
Flexible integration via Kafka Clients for any relevant programming language, REST/HTTP, MQTT, JDBC, and other APIs
Support for operational sinks (not just analytics platforms)
Built-in governance, durability, and replayability

A good example: Confluent’s Oracle CDC Connector uses Oracle’s XStream API and delivers “GoldenGate-level performance”, with guaranteed ordering, high throughput, and minimal latency. This enables real-time delivery of operational data into Kafka, Flink, and downstream systems like Databricks.

Bottom line: Confluent offers the most mature, scalable, and flexible ingestion capabilities into Databricks—especially for real-time operational data. For enterprises already using Confluent as the central nervous system of their architecture, adding another ETL layer specifically for the lakehouse integration with weaker coverage and SLAs only slows progress and increases cost.

Stick with a unified approach—fewer moving parts, faster implementation, and end-to-end consistency.

Real-Time vs. Batch: When to Use Each

Batch ETL is well understood. It works fine when data does not need to be processed immediately—e.g., for end-of-day reports, monthly audits, or historical analysis.

Streaming ETL is best when data must be processed in motion. This enables real-time dashboards, live alerts, or AI features based on the latest information.

Confluent DSP is purpose-built for streaming ETL. Kafka and Flink allow filtering, transformation, enrichment, and routing in real time.

Databricks supports batch ELT natively. Delta Live Tables offers a managed way to build data pipelines on top of Spark. Delta Live Tables lets you declaratively define how data should be transformed and processed using SQL or Python. On the other side, Spark Structured Streaming can handle streaming data in near real-time. But it still requires persistent clusters and infrastructure management.

If you’re already invested in Spark, Structured Streaming or Delta Live Tables might be sufficient. But if you’re starting fresh—or looking to simplify your architecture — Confluent’s Tableflow provides a more streamlined, Kafka-native alternative. Tableflow represents Kafka streams as Delta Lake tables. No cluster management. No offset handling. Just discoverable, governed data in Databricks Unity Catalog.

Real-Time and Batch: A Perfect Match at Walmart for Replenishment Forecasting in the Supply Chain

Walmart demonstrates how real-time and batch processing can work together to optimize a large-scale, high-stakes supply chain.

At the heart of this architecture is Apache Kafka, powering Walmart’s real-time inventory management and replenishment system.

Kafka serves as the central data hub, continuously streaming inventory updates, sales transactions, and supply chain events across Walmart’s physical stores and digital channels. This enables real-time replenishment to ensure product availability and timely fulfillment for millions of online and in-store customers.

Batch processing plays an equally important role. Apache Spark processes historical sales, seasonality trends, and external factors in micro-batches to feed forecasting models. These forecasts are used to generate accurate daily order plans across Walmart’s vast store network.

Source: Walmart

This hybrid architecture brings significant operational and business value:

Kafka provides not just low latency, but true decoupling between systems, enabling seamless integration across real-time streams, batch pipelines, and request-response APIs—ensuring consistent, reliable data flow across all environments
Spark delivers scalable, high-performance analytics to refine predictions and improve long-term planning
The result: reduced cycle times, better accuracy, increased scalability and elasticity, improved resiliency, and substantial cost savings

Walmart’s supply chain is just one of many use cases where Kafka powers real-time business processes, decisioning and workflow orchestration at global scale—proof that combining streaming and batch is key to modern data infrastructure.

Batch as a Subset of Streaming in Apache Flink

Apache Flink supports both streaming and batch processing within the same engine. This enables teams to build unified pipelines that handle real-time events and batch-style computations without switching tools or architectures. In Flink, batch is treated as a special case of streaming—where a bounded stream (or a complete window of events) can be processed once all data has arrived.

This approach simplifies operations by avoiding the need for parallel pipelines or separate orchestration layers. It aligns with the principles of the shift-left architecture, allowing earlier processing, validation, and enrichment—closer to the data source. As a result, pipelines are more maintainable, scalable, and responsive.

That said, batch processing is not going away—nor should it. For many use cases, batch remains the most practical solution. Examples include:

Daily financial reconciliations
End-of-day retail reporting
Weekly churn model training
Monthly compliance and audit jobs

In these cases, latency is not critical, and workloads often involve large volumes of historical data or complex joins across datasets.

This is where Databricks excels—especially with its Delta Lake and Medallion architecture, which structures raw, refined, and curated data layers for high-performance analytics, BI, and AI/ML training.

In summary, Flink offers the flexibility to consolidate streaming and batch pipelines, making it ideal for unified data processing. But when batch is the right choice—especially at scale or with complex transformations—Databricks remains a best-in-class platform. The two technologies are not mutually exclusive. They are complementary parts of a modern data stack.

Streaming CDC and Lakehouse Analytics

Streaming CDC is a key integration pattern. It captures changes from operational databases and pushes them into analytics platforms. But CDC isn’t limited to databases. CDC is just as important for business applications like Salesforce, where capturing customer updates in real time enables faster, more responsive analytics and downstream actions.

Confluent is well suited for this. Kafka Connect and Flink can continuously stream changes. These change events are sent to Databricks as Delta tables using Tableflow. Streaming CDC ensures:

Data consistency across operational and analytical workloads leveraging a single data pipeline
Reduced ETL / ELT lag
Near real-time updates to BI dashboards
Timely training of AI/ML models

Streaming CDC also avoids data duplication, reduces latency, and minimizes storage costs.

Reverse ETL: An (Anti) Pattern to Avoid with Confluent and Databricks

Some architectures push data from data lakes or warehouses back into operational systems using reverse ETL. While this may appear to bridge the analytical and operational worlds, it often leads to increased latency, duplicate logic, and fragile point-to-point workflows. These tools typically reprocess data that was already transformed once, leading to inefficiencies, governance issues, and unclear data lineage.

Reverse ETL is an architectural anti-pattern. It violates the principles of an event-driven system. Rather than reacting to events as they happen, reverse ETL introduces delays and additional moving parts—pushing stale insights back into systems that expect real-time updates.

With the upcoming bidirectional integration of Tableflow with Delta Lake, these issues can be avoided entirely. Insights generated in Databricks—from analytics, machine learning, or rule-based engines—can be pushed directly back into Kafka topics.

This approach removes the need for reverse ETL tools, reduces system complexity, and ensures that both operational and analytical layers operate on a shared, governed, and timely data foundation.

It also brings lineage, schema enforcement, and observability into both directions of data flow—streamlining feedback loops and enabling true event-driven decisioning across the enterprise.

In short: Don’t pull data back into operational systems after the fact. Push insights forward at the speed of events.

Multi-Cloud and Hybrid Integration with an Event-Driven Architecture

Confluent is designed for distributed data movement across environments in real-time for operational and analytical use cases:

On-prem, cloud, and edge
Multi-region and multi-cloud
Support for SaaS, BYOC, and private networking

Features like Cluster Linking and Schema Registry ensure consistent replication and governance across environments.

Databricks runs only in the cloud. It supports hybrid access and partner integrations. But the platform is not built for event-driven data distribution across hybrid environments.

In a hybrid architecture, Confluent acts as the bridge. It moves operational data securely and reliably. Then, Databricks can consume it for analytics and AI use cases. Here is an example architecture for industrial IoT use cases:

Uniper: Real-Time Energy Trading with Confluent and Databricks

Uniper, a leading international energy company, leverages Confluent and Databricks to modernize its energy trading operations.

I covered the value of data streaming with Apache Kafka and Flink for energy trading in a dedicated blog post already.

Confluent Cloud with Apache Kafka and Apache Flink provides a scalable real-time data streaming foundation for Uniper, enabling efficient ingestion and processing of market data, IoT sensor inputs, and operational events. This setup supports the full trading lifecycle, improving decision-making, risk management, and operational agility.

Within its Azure environment, Uniper uses Databricks to empower business users to rapidly build trading decision-support tools and advanced analytics applications. By combining a self-service data platform with scalable processing power, Uniper significantly reduces the lead time for developing data apps—from weeks to just minutes.

To deliver real-time insights to its teams, Uniper also leverages Plotly’s Dash Enterprise, creating interactive dashboards that consolidate live data from Databricks, Kafka, Snowflake, and various databases. This end-to-end integration enables dynamic, collaborative workflows, giving analysts and traders fast, actionable insights that drive smarter, faster trading strategies.

By combining real-time data streaming, advanced analytics, and intuitive visualization, Uniper has built a resilient, flexible data architecture that meets the demands of today’s fast-moving energy markets.

From Ingestion to Insight: Modern Data Integration and Processing for AI with Confluent and Databricks

While both platforms can handle integration and processing, their roles are different:

Use Confluent when you need real-time ingestion and processing of operational and analytical workloads, or data delivery across systems and clouds.
Use Databricks for AI workloads, analytics and data warehousing.

When used together, Confluent and Databricks form a complete data integration and processing pipeline for AI and analytics:

Confluent ingests and processes operational data in real time.
Tableflow pushes this data into Delta Lake in a discoverable, secure format.
Databricks performs analytics and model development.
Tableflow (bidirectional) pushes insights or AI models back into Kafka for use in operational systems.

This is the foundation for modern data and AI architectures—real-time pipelines feeding intelligent applications.

Stay tuned for deep dives into how these platforms are shaping the future of data-driven enterprises. Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including technical architectures and the relation to analytical platforms like Databricks.

The post Confluent Data Streaming Platform vs. Databricks Data Intelligence Platform for Data Integration and Processing appeared first on Kai Waehner.

The Past, Present, and Future of Confluent (The Kafka Company) and Databricks (The Spark Company)

Kai Waehner — Fri, 02 May 2025 07:10:42 +0000

Confluent and Databricks are two of the most influential platforms in modern data architectures. Both have roots in open source. Both focus on enabling organizations to work with data at scale. And both have expanded their mission well beyond their original scope.

Confluent and Databricks are often described as serving different parts of the data architecture—real-time vs. batch, operational vs. analytical, data streaming vs. artificial intelligence (AI). But the lines are not always clear. Confluent can run batch workloads and embed AI. Databricks can handle (near) real-time pipelines. With Flink, Confluent supports both operational and analytical processing. Databricks can run operational workloads, too—if latency, availability, and delivery guarantees meet the project’s requirements.

This blog explores where these platforms came from, where they are now, how they complement each other in modern enterprise architectures—and why their roles are future-proof in a data- and AI-driven world.

About the Confluent and Databricks Blog Series

This article is part of a blog series exploring the growing roles of Confluent and Databricks in modern data and AI architectures:

Blog 1 (THIS ARTICLE): The Past, Present and Future of Confluent (The Kafka Company) and Databricks (The Spark Company)
Blog 2: Confluent Data Streaming Platform vs. Databricks Data Intelligence Platform for Data Integration and Processing
Blog 3: Shift-Left Architecture for AI and Data Warehousing with Confluent and Databricks
Blog 4: Databricks and Confluent in Enterprise Software Environments (with SAP as Example)
Blog 5: Databricks and Confluent Leading Data and AI Architectures – and How They Compare to Competitors

Operational vs. Analytical Workloads

Confluent and Databricks were designed for different workloads, but the boundaries are not always strict.

Confluent was built for operational workloads—moving and processing data in real time as it flows through systems. This includes use cases like real-time payments, fraud detection, system monitoring, and streaming pipelines.

Databricks focuses on analytical workloads—enabling large-scale data processing, machine learning, and business intelligence.

That said, there is no clear black and white separation. Confluent, especially with the addition of Apache Flink, can support analytical processing on streaming data. Databricks can handle operational workloads too, provided the SLAs—such as latency, uptime, and delivery guarantees—are sufficient for the use case.

With Tableflow and Delta Lake, both platforms can now be natively connected, allowing real-time operational data to flow into analytical environments, and AI insights to flow back into real-time systems—effectively bridging operational and analytical workloads in a unified architecture.

From Apache Kafka and Spark to (Hybrid) Cloud Platforms: Both Confluent and Databricks both have strong open source roots—Kafka and Spark, respectively—but have taken different branding paths.

Confluent: From Apache Kafka to a Data Streaming Platform (DSP)

Confluent is well known as “The Kafka Company.” It was founded by the original creators of Apache Kafka over ten years ago. Kafka is now widely adopted for event streaming in over 150,000 organizations worldwide. Confluent operates tens of thousands of clusters with Confluent Cloud across all major cloud providers, and also in customer’s data centers and edge locations.

But Confluent has become much more than just Kafka. It offers a complete data streaming platform (DSP).

Source: Confluent

This includes:

Apache Kafka as the core messaging and persistence layer
Data integration via Kafka Connect for databases and business applications, a REST/HTTP proxy for request-response APIs and clients for all relevant programming languages
Stream processing via Apache Flink and Kafka Streams (read more about the past, present and future of stream processing)
Tableflow for native integration with lakehouses that support the open table format standard via Delta Lake and Apache Iceberg
24/7 SLAs, security, data governance, disaster recovery – for the most critical workloads companies run
Deployment options: Everywhere (not just cloud) – SaaS, on-prem, edge, hybrid, stretched across data centers, multi-cloud, BYOC (bring your own cloud)

Databricks: From Apache Spark to a Data Intelligence Platform

Databricks has followed a similar evolution. Known initially as “The Spark Company,” it is the original force behind Apache Spark. But Databricks no longer emphasizes Spark in its branding. Spark is still there under the hood, but it’s no longer the dominant story.

Today, it positions itself as the Data Intelligence Platform, focused on AI and analytics.

Source: Databricks

Key components include:

Fully cloud-native deployment model—Databricks is now a cloud-only platform providing BYOC and Serverless products
Delta Lake and Unity Catalog for table format standardization and governance
Model development and AI/ML tools
Data warehouse workloads
Tools for data scientists and data engineers

Together, Confluent and Databricks meet a wide range of enterprise needs and often complement each other in shared customer environments from the edge to multi-cloud data replication and analytics.

Real-Time vs. Batch Processing

A major point of comparison between Confluent and Databricks lies in how they handle data processing—real-time versus batch—and how they increasingly converge through shared formats and integrations.

A key difference between the platforms lies in how they process and share data.

Confluent focuses on data in motion—real-time streams that can be filtered, transformed, and shared across systems as they happen.

Databricks focuses on data at rest—data that has landed in a lakehouse, where it can be queried, aggregated, and used for analysis and modeling.

Both platforms offer native capabilities for data sharing. Confluent provides Stream Sharing, which enables secure, real-time sharing of Kafka topics across organizations and environments. Databricks offers Delta Sharing, an open protocol for sharing data from Delta Lake tables with internal and external consumers.

In many enterprise architectures, the two vendors work together. Kafka and Flink handle continuous real-time processing for operational workloads and data ingestion into the lakehouse. Databricks handles AI workloads (model training and some of the model inference), business intelligence (BI), and reporting. Both do data integration; ETL (Confluent) respectively ELT (Databricks).

Stream Processing with Spark Structured Streaming vs. Apache Flink or Kafka Streams

Many organizations still use Databricks’ Apache Spark Structured Streaming to connect Kafka and Databricks. That’s a valid pattern, especially for teams with Spark expertise.

Flink is available as a serverless offering in Confluent Cloud that can scale down to zero when idle, yet remains highly scalable—even for complex stateful workloads. It supports multiple languages, including Python, Java, and SQL.

For self-managed environments, Kafka Streams offers a lightweight alternative to running Flink in a self-managed Confluent Platform. But be aware that Kafka Streams is limited to Java and operates as a client library embedded directly within the application. Read my dedicated article to learn about the trade-offs between Apache Flink and Kafka Streams.

In short: use what works. If Spark Structured Streaming is already in place and meets your needs, keep it. But for new use cases, Apache Flink or Kafka Streams might be the better choice for stream processing workloads. But make sure to understand the concepts and value of stateless and stateful stream processing before building batch pipelines.

Confluent Tableflow: Unify Operational and Analytic Workloads with Open Table Formats (such as Apache Iceberg and Delta Lake)

Databricks is actively investing in Delta Lake and Unity Catalog to structure, govern, and secure data for analytical applications. The acquisition of Tabular—founded by the original creators of Apache Iceberg—demonstrates Databricks’ commitment to supporting open standards.

Confluent’s Tableflow materializes Kafka streams into Apache Iceberg or Delta Lake tables—automatically, reliably, and efficiently. This native integration between Confluent and Databricks is faster, simpler, and more cost-effective than using a Spark connector or other ETL tools.

Tableflow reads the Kafka segments, checks schema against schema registry, and creates parquet and table metadata.

Source: Confluent

Native stream processing with Apache Flink also plays a growing role. It enables unified real-time and batch stream processing in a single engine. Flink’s ability to “shift left” data processing (closer to the source) supports early validation, enrichment, and transformation. This simplifies the architecture and reduces the need for always-on Spark clusters, which can drive up cost.

These developments highlight how Databricks and Confluent address different but complementary layers of the data ecosystem.

Confluent + Databricks = A Strategic Partnership for Future-Proof AI Architectures

Confluent and Databricks are not competing platforms—they’re complementary. While they serve different core purposes, there are areas where their capabilities overlap. In those cases, it’s less about which is better and more about which fits best for your architecture, team expertise, SLA or latency requirements. The real value comes from understanding how they work together and where you can confidently choose the platform that serves your use case most effectively.

Confluent and Databricks recently deepened their partnership with Tableflow integration with Delta Lake and Unity Catalog. This integration makes real-time Kafka data available inside Databricks as Delta tables. It reduces the need for custom pipelines and enables fast access to trusted operational data.

The architecture supports AI end to end—from ingesting real-time operational data to training and deploying models—all with built-in governance and flexibility. Importantly, data can originate from anywhere: mainframes, on-premise databases, ERP systems, IoT and edge environments or SaaS cloud applications.

With this setup, you can:

Feed data from 100+ Confluent sources (Mainframe, Oracle, SAP, Salesforce, IoT, HTTP/REST applications, and so on) into Delta Lake
Use Databricks for AI model development and business intelligence
Push models back into Kafka and Flink for real-time model inference with critical, operational SLAs and latency

Both directions will be supported. Governance and security metadata flows alongside the data.

Source: Confluent

Michelin: Real-Time Data Streaming and AI Innovation with Confluent and Databricks

A great example of how Confluent and Databricks complement each other in practice is Michelin’s digital transformation journey. As one of the world’s largest tire manufacturers, Michelin set out to become a data-first and digital enterprise. To achieve this, the company needed a foundation for real-time operational data movement and a scalable analytical platform to unlock business insights and drive AI initiatives.

Confluent @ Michelin: Real-Time Data Streaming Pipelines

Confluent Cloud plays a critical role at Michelin by powering real-time data pipelines across their global operations. Migrating from self-managed Kafka to Confluent Cloud on Microsoft Azure enabled Michelin to reduce operational complexity by 35%, meet strict 99.99% SLAs, and speed up time to market by up to nine months. Real-time inventory management, order orchestration, and event-driven supply chain processes are now possible thanks to a fully managed data streaming platform.

Databricks @ Michelin: Centralized Lakehouse

Meanwhile, Databricks empowers Michelin to democratize data access across the organization. By building a centralized lakehouse architecture, Michelin enabled business users and IT teams to independently access, analyze, and develop their own analytical use cases—from predicting stock outages to reducing carbon emissions in logistics. With Databricks’ lakehouse capabilities, they scaled to support hundreds of use cases without central bottlenecks, fostering a vibrant community of innovators across the enterprise.

The synergy between Confluent and Databricks at Michelin is clear:

Confluent moves operational data in real time, ensuring fresh, trusted information flows across systems (including Databricks).
Databricks transforms data into actionable insights, using powerful AI, machine learning, and analytics capabilities.

Confluent + Databricks @ Michelin = Cloud-Native Data-Driven Enterprise

Together, Confluent and Databricks allow Michelin to shift from batch-driven, siloed legacy systems to a cloud-native, real-time, data-driven enterprise—paving the road toward higher agility, efficiency, and customer satisfaction.

As Yves Caseau, Group Chief Digital & Information Officer at Michelin, summarized: “Confluent plays an integral role in accelerating our journey to becoming a data-first and digital business.”

And as Joris Nurit, Head of Data Transformation, added: “Databricks enables our business users to better serve themselves and empowers IT teams to be autonomous.”

The Michelin success story perfectly illustrates how Confluent and Databricks, when used together, bridge operational and analytical workloads to unlock the full value of real-time, AI-powered enterprise architectures.

Confluent and Databricks: Better Together!

Confluent and Databricks are both leaders in different – but connected – layers of the modern data stack.

If you want real-time, event-driven data pipelines, Confluent is the right platform. If you want powerful analytics, AI, and ML, Databricks is a great fit.

Together, they allow enterprises to bridge operational and analytical workloads—and to power AI systems with live, trusted data.

In the next post, I will explore how Confluent’s Data Streaming Platform compares to the Databricks Data Intelligence Platform for data integration and processing.

The post The Past, Present, and Future of Confluent (The Kafka Company) and Databricks (The Spark Company) appeared first on Kai Waehner.

Shift Left Architecture at Siemens: Real-Time Innovation in Manufacturing and Logistics with Data Streaming

Kai Waehner — Fri, 11 Apr 2025 12:32:50 +0000

Industrial enterprises are under pressure to modernize. They need to move faster, automate more, and adapt to constant change—without sacrificing reliability or control. Siemens Digital Industries is meeting this challenge head-on by combining software, edge computing, and cloud-native technologies into a new architecture. This blog explores how Siemens is using data streaming, modular design, and Shift Left thinking to enable real-time decision-making, improve data quality, and unlock scalable, reusable data products across manufacturing and logistics operations. A real-world example for industrial IoT, intralogistics and shop floor manufacturing illustrates the architecture and highlights the business value behind this transformation.

The Data Streaming Use Case Show: Episode #1 – Manufacturing and Automotive

These Siemens success stories are part of The Data Streaming Use Case Show, a new industry webinar series hosted by me.

In the first episode, we focus on the manufacturing and automotive industries. It features:

Experts from Siemens Digital Industries and Siemens Healthineers
The Founder of ‘IoT Use Case‘, a content and community platform focused on real-world industrial IoT applications
Deep insights into how industrial companies combine OT, IT, cloud, and data streaming with the shift left architecture.

The series explores real-world solutions across industries, showing how leaders turn data into action through open architectures and real-time platforms.

Siemens Digital Industries: Company and Vision

Siemens Digital Industries is the technology and software arm of Siemens AG, focused on advancing industrial automation and digitalization. It empowers manufacturers and machine builders to become more agile, efficient, and resilient through intelligent software and integrated systems.

Its business model bridges the physical and digital worlds—combining operational technology (OT) with modern information technology (IT). From programmable logic controllers to industrial IoT, Siemens delivers end-to-end solutions across industries.

Today, the company is transforming itself into a software- and cloud-driven organization, focusing strongly on edge computing, real-time analytics, and data streaming as key enablers of modern manufacturing.

With edge and cloud working in harmony, Siemens helps industrial enterprises break up monoliths and develop toward modular, flexible architectures. These software-driven approaches make plants and factories more adaptive, intelligent, and autonomous.

Data Streaming at Industrial Companies

In industrial settings, data is continuously generated by machines, production systems, robots, and logistics processes. But traditional batch-oriented IT systems are not designed to handle this in real time.

To make smarter, faster decisions, companies need to process data as it is generated. That’s where data streaming comes in.

Apache Kafka and Apache Flink enable event-driven architectures. These allow industrial data to flow in real time, from edge to cloud, across hybrid environments.

Check out my other blogs about use cases and architecture for manufacturing and Industrial IoT powered by data streaming.

Edge and Hybrid Cloud as a Standard

Modern industrial use cases are increasingly hybrid by design. Machines and controllers produce data at the edge. Decisions must be made close to the source. However, cloud platforms offer powerful compute and AI capabilities.

Siemens leverages edge devices to capture and preprocess data on-site. Data streaming with Confluent provides Siemens a real-time backbone for integrating this data with cloud-based systems, including Snowflake, SAP, Salesforce, and others.

This hybrid architecture supports low latency, high availability, and full control over data processing and analytics workflows.

The Shift Left Architecture for Industrial IoT

In many industrial architectures, Kafka has traditionally been used to ingest data into analytics platforms like Snowflake or Databricks. Processing, transformation, and enrichment happened late in the data pipeline.

But Siemens is shifting that model.

The Shift Left Architecture moves processing closer to the source, directly into the streaming layer. Instead of waiting to transform data in a data warehouse, Siemens now applies stream processing in real time, using Confluent Cloud and Kafka topics.

This shift enables faster decision-making, better data quality, and broader reuse of high-quality data across both analytical and operational systems.

For a deeper look at how Shift Left is transforming industrial architectures, read the full article about the Shift Left Architecture with Data Streaming.

Siemens Data Streaming Success Story: Modular Intralogistics Platform

A key example of this new architecture is Siemens’ Modular Intralogistics Platform, used in manufacturing plants for material handling and supply chain optimization. I explored the shift left architecture in our data streaming use case show with Stefan Baer, Senior Key Expert – Data Streaming at Siemens IT.

Traditionally, intralogistic systems were tightly coupled, with rigid integrations between

Enterprise Resource Planning (ERP): Order management, master data
Manufacturing Operations Management (MOM): Production scheduling, quality, maintenance
Warehouse Execution System (EWM): Inventory, picking, warehouse automation
Execution Management System (eMS): Transport control, automated guided vehicle (AGV) orchestration, conveyor logic

The new approach breaks this down into package business capabilities—each one modular, orchestrated, and connected through Confluent Cloud.

Key benefits:

Real-time orchestration of logistics operations
Automated material delivery—no manual reordering required
ERP and MOM systems integrated flexibly via Kafka
High adaptability through modular components
GenAI used for package station load optimization

Stream processing with Apache Flink transforms events in motion. For example, when a production order changes or material shortages occur, the system reacts instantly—adjusting delivery routes, triggering alerts, or rebalancing station loads using AI.

Architecture: Data Products + Shift Left

At the heart of the solution is a combination of data products and stream processing:

Kafka Topics serve as real-time interfaces and persistency layer between business domains.
Confluent Cloud hosts the event streaming infrastructure as a fully-managed service with low latency, elasticity, and critical SLAs.
Stream processing with serverless Flink logic enriches and transforms data in motion.
Snowflake receives curated, ready-to-use data for analytics.
Other operational and analytical downstream consumers—such as GenAI modules or shop floor dashboards—access the same consistent data in real time.

Source: Siemens Digital Industries

This reuse of data products ensures consistent semantics, reduces duplication, and simplifies governance.

By processing data earlier in the pipeline, Siemens improves both data quality and system responsiveness. This model replaces brittle, point-to-point integrations with a more sustainable, scalable platform architecture.

Source: Siemens Digital Industries

Business Value of Data Streaming and Shift Left at Siemens Digital Industries

The combination of real-time data streaming, modular data products, and Shift Left design principles unlocks significant value:

Faster response to dynamic events in production and logistics
Improved operational resilience and agility
Higher quality data for both analytics and AI
Reuse across multiple consumers (analytics, operations, automation)
Lower integration costs and easier scaling

This approach is not just technically superior—it supports measurable business outcomes like shorter lead times, lower stock levels, and increased manufacturing throughput.

Siemens Healthineers: Shift Left with IoT, Data Streaming, AI/ML, Confluent and Snowflake in Manufacturing and Healthcare

In a recent blog post, I explored how Siemens Healthineers uses Apache Kafka and Flink to transform both manufacturing and healthcare with a wide range of data streaming use cases. From predictive maintenance to real-time logistics, their approach is a textbook example of how to modernize complex environments with an event-driven architecture and data streaming—even if they don’t explicitly label it “shift left.”

Source: Siemens Healthineers

Their architecture enables proactive decision-making by pushing real-time insights and automation earlier in the process. Examples include telemetry streaming from medical devices, machine integration with SAP and KUKA robots, and logistics event streaming from SAP for faster packaging and delivery. Each use case shows how real-time data—combined with cloud-native platforms like Confluent and Snowflake—improves efficiency, reliability, and responsiveness.

Just like the intralogistics example from Siemens Digital Industries, Healthineers applies shift-left thinking by enabling teams to act on data sooner, reduce latency, and prevent costly delays. This approach enhances not only operational workflows but also outcomes that matter, like patient care and regulatory compliance.

This is shift left in action: embedding intelligence and quality controls early, where they have the greatest impact.

Rethinking Industrial Data Architectures with Data Streaming and Shift Left Architecture

Siemens Digital Industries is demonstrating what’s possible when you rethink the data architecture beyond just analytics in a data lake.

With data streaming leveraging Confluent Cloud, data products for modular software, and a Shift Left approach, Siemens is transforming traditional factories into intelligent, event-driven operations. A data streaming platform based on Apache Kafka is no longer just an ingestion layer. It is a central nervous system for real-time processing and decision-making.

This is not about chasing trends. It’s about building resilient, scalable, and future-proof industrial systems. And it’s just the beginning.

To learn more, watch the on-demand industry use case show with Siemens Digital Industries and Siemens Healthineers or connect with us to explore what data streaming can do for your organization.

Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter. And download my free book about data streaming use cases.

The post Shift Left Architecture at Siemens: Real-Time Innovation in Manufacturing and Logistics with Data Streaming appeared first on Kai Waehner.

The Importance of Focus: Why Software Vendors Should Specialize Instead of Doing Everything (Example: Data Streaming)

Kai Waehner — Mon, 07 Apr 2025 03:31:55 +0000

As technology landscapes evolve, software vendors must decide whether to specialize in a core area or offer a broad suite of services. Some companies take a highly focused approach, investing deeply in a specific technology, while others attempt to cover multiple use cases by integrating various tools and frameworks. Both strategies have trade-offs, but history has shown that specialization leads to deeper innovation, better performance, and stronger customer trust. This blog explores why focus matters in the context of data streaming software, the challenges of trying to do everything, and how companies that prioritize one thing—data streaming—can build best-in-class solutions that work everywhere.

Specialization vs. Generalization: Why Data Streaming Requires a Focused Approach

Data streaming enables real-time processing of continuous data flows, allowing businesses to act instantly rather than relying on batch updates. This shift from traditional databases and APIs to event-driven architectures has become essential for modern IT landscapes.

Data streaming is no longer just a technique—it is a new software category. The 2023 Forrester Wave for Streaming Data Platforms confirms its role as a core component of scalable, real-time architectures. Technologies like Apache Kafka and Apache Flink have become industry standards. They power cloud, hybrid, and on-premise environments for real-time data movement and analytics.

Businesses increasingly adopt streaming-first architectures, focusing on:

Hybrid and multi-cloud streaming for real-time edge-to-cloud integration
AI-driven analytics powered by continuous optimization and inference using machine learning models
Streaming data contracts to ensure governance and reliability across the entire data pipeline
Converging operational and analytical workloads to replace inefficient batch processing and Lambda architecture with multiple data pipelines

The Data Streaming Landscape

As data streaming becomes a core part of modern IT, businesses must choose the right approach: adopt a purpose-built data streaming platform or piece together multiple tools with limitations. Event-driven architectures demand scalability, low latency, cost efficiency, and strict SLAs to ensure real-time data processing meets business needs.

Some solutions may be “good enough” for specific use cases, but they often lack the performance, reliability, and flexibility required for large-scale, mission-critical applications.

The Data Streaming Landscape highlights the differences—while some vendors provide basic capabilities, others offer a complete Data Streaming Platform (DSP)designed to handle complex, high-throughput workloads with enterprise-grade security, governance, and real-time analytics. Choosing the right platform is essential for staying competitive in an increasingly data-driven world.

The Challenge of Doing Everything

Many software vendors and cloud providers attempt to build a comprehensive technology stack, covering everything from data lakes and AI to real-time data streaming. While this offers customers flexibility, it often leads to overlapping services, inconsistent long-term investment, and complexity in adoption.

A few examples (from the perspective of data streaming solutions).

Amazon AWS: Multiple Data Streaming Services, Multiple Choices

AWS has built the most extensive cloud ecosystem, offering services for nearly every aspect of modern IT, including data lakes, AI, analytics, and real-time data streaming. While this breadth provides flexibility, it also leads to overlapping services, evolving strategies, and complexity in decision-making for customers, resulting in frequent solution ambiguity.

Amazon provides several options for real-time data streaming and event processing, each with different capabilities:

Amazon SQS (Simple Queue Service): One of AWS’s oldest and most widely adopted messaging services. It’s reliable for basic decoupling and asynchronous workloads, but it lacks native support for real-time stream processing, ordering, replayability, and event-time semantics.
Amazon Kinesis Data Streams: A managed service for real-time data ingestion and simple event processing, but lacks the full event streaming capabilities of a complete data streaming platform.
Amazon MSK (Managed Streaming for Apache Kafka): A partially managed Kafka service that mainly focuses on Kafka infrastructure management. It leaves customers to handle critical operational support (MSK does NOT provide SLAs or support for Kafka itself) and misses capabilities such as stream processing, schema management, and governance.
AWS Glue Streaming ETL: A stream processing service built for data transformations but not designed for high-throughput, real-time event streaming.
Amazon Flink (formerly Kinesis Data Analytics): AWS’s attempt to offer a fully managed Apache Flink service for real-time event processing, competing directly with open-source Flink offerings.

Each of these services targets different real-time use cases, but they lack a unified, end-to-end data streaming platform. Customers must decide which combination of AWS services to use, increasing integration complexity, operational overhead, and costs.

Strategy Shift and Rebranding with Multiple Product Portfolios

AWS has introduced, rebranded, and developed its real-time streaming services over time:

Kinesis Data Analytics was originally AWS’s solution for stream processing but was later rebranded as Amazon Flink, acknowledging Flink’s dominance in modern stream processing.
MSK Serverless was introduced to simplify Kafka adoption but also introduces various additional product limitations and cost challenges.
AWS Glue Streaming ETL overlaps with Flink’s capabilities, adding confusion about the best choice for real-time data transformations.

As AWS expands its cloud-native services, customers must navigate a complex mix of technologies—often requiring third-party solutions to fill gaps—while assessing whether AWS’s flexible but fragmented approach meets their real-time data streaming needs or if a specialized, fully integrated platform is a better fit.

Google Cloud: Multiple Approaches to Streaming Analytics

Google Cloud is known for its powerful analytics and AI/ML tools, but its strategy in real-time stream processing has been inconsistent:

Customers looking for stream processing in Google Cloud now have three competing services:

Google Managed Service for Apache Kafka (Google MSK) (a managed Kafka offering). Google MSK is very early stage in the maturity curve and has many limitations.
Google Dataflow (built on Apache Beam)
Google Pub/Sub (event messaging)
Apache Flink on Dataproc (a managed service)

While each of these services has its use cases, they introduce complexity for customers who must decide which option is best for their workloads.

BigQuery Flink was introduced to extend Google’s analytics capabilities into real-time processing but was later discontinued before exiting the preview.

Microsoft Azure: Shifting Strategies in Data Streaming

Microsoft Azure has taken multiple approaches to real-time data streaming and analytics, with an evolving strategy that integrates various tools and services.

Azure Event Hubs has been a core event streaming service within Azure, designed for high-throughput data ingestion. It supports the Apache Kafka protocol (through Kafka version 3.0, so its feature set lags considerably), making it a flexible choice for (some) real-time workloads. However, it primarily focuses on event ingestion rather than event storage, data processing and integration–additional capabilities of a complete data streaming platform.
Azure Stream Analytics was introduced as a serverless stream processing solution, allowing customers to analyze data in motion. Despite its capabilities, its adoption has remained limited, particularly as enterprises seek more scalable, open-source alternatives like Apache Flink.
Microsoft Fabric is now positioned as an all-in-one data platform, integrating business intelligence, data engineering, real-time streaming, and AI. While this brings together multiple analytics tools, it also shifts the focus away from dedicated, specialized solutions like Stream Analytics.

While Microsoft Fabric aims to simplify enterprise data infrastructure, its broad scope means that customers must adapt to yet another new platform rather than continuing to rely on long-standing, specialized services. The combination of Azure Event Hubs, Stream Analytics, and Fabric presents multiple options for stream processing, but also introduces complexity, limitations and increased cost for a combined solution.

Microsoft’s approach highlights the challenge of balancing broad platform integration with long-term stability in real-time streaming technologies. Organizations using Azure must evaluate whether their streaming workloads require deep, specialized solutions or can fit within a broader, integrated analytics ecosystem.

I wrote an entire blog series to demystify what Microsoft Fabric really is.

Instaclustr: Too Many Technologies, Not Enough Depth

Instaclustr has positioned itself as a managed platform provider for a wide array of open-source technologies, including Apache Cassandra, Apache Kafka, Apache Spark, Apache ZooKeeper, OpenSearch, PostgreSQL, Redis, and more. While this broad portfolio offers customers choices, it reflects a horizontal expansion strategy that lacks deep specialization in any one domain.

For organizations seeking help with real-time data streaming, Instaclustr’s Kafka offering may appear to be a viable managed service. However, unlike purpose-built data streaming platforms, Instaclustr’s Kafka solution is just one of many services, with limited investment in stream processing, schema governance, or advanced event-driven architectures.

Because Instaclustr splits its engineering and support resources across so many technologies, customers often face challenges in:

Getting deep technical expertise for Kafka-specific issues
Relying on long-term roadmaps and support for evolving Kafka features
Building integrated event streaming pipelines that require more than basic Kafka infrastructure

This generalist model may be appealing for companies looking for low-cost, basic managed services—but it falls short when mission-critical workloads demand real-time reliability, zero data loss, SLAs, and advanced stream processing capabilities. Without a singular focus, platforms like Instaclustr risk becoming jacks-of-all-trades but masters of none—especially in the demanding world of real-time data streaming.

Cloudera: A Broad Portfolio Without a Clear Focus

Cloudera has adopted a distinct strategy by incorporating various open-source frameworks into its platform, including:

Apache Kafka (event streaming)
Apache Flink (stream processing)
Apache Iceberg (data lake table format)
Apache Hadoop (big data storage and batch processing)
Apache Hive (SQL querying)
Apache Spark (batch and near real-time processing and analytics)
Apache NiFi (data flow management)
Apache HBase (NoSQL database)
Apache Impala (real-time SQL engine)
Apache Pulsar (event streaming, via a partnership with StreamNative)

While this provides flexibility, it also introduces significant complexity:

Customers must determine which tools to use for specific workloads.
Integration between different components is not always seamless.
The broad scope makes it difficult to maintain deep expertise in each area.

Rather than focusing on one core area, Cloudera’s strategy appears to be adding whatever is trending in open source, which can create challenges in long-term support and roadmap clarity.

Splunk: Repeated Attempts at Data Streaming

Splunk, known for log analytics, has tried multiple times to enter the data streaming market:

Initially, Splunk built a proprietary streaming solution that never gained widespread adoption.

Later, Splunk acquired Streamlio to leverage Apache Pulsar as its streaming backbone.This Pulsar-based strategy ultimately failed, leading to a lack of a clear real-time streaming offering.

Splunk’s challenges highlight a key lesson: successful data streaming requires long-term investment and specialization, not just acquisitions or technology integrations.

Why a Focused Approach Works Better for Data Streaming

Some vendors take a more specialized approach, focusing on one core capability and doing it better than anyone else. For data streaming, Confluent became the leader in this space by focusing on improving the vision of a complete data streaming platform.

Confluent: Focused on Data Streaming, Built for Everywhere

At Confluent, the focus is clear: real-time data streaming. Unlike many other vendors and the cloud providers that offer fragmented or overlapping services, Confluent specializes in one thing and ensures it works everywhere:

Cloud: Deploy across AWS, Azure, and Google Cloud with deep native integrations.
On-Premise: Enterprise-grade deployments with full control over infrastructure.
Edge Computing: Real-time streaming at the edge for IoT, manufacturing, and remote environments.
Hybrid Cloud: Seamless data streaming across edge, on-prem, and cloud environments.
Multi-Region: Built-in disaster recovery and globally distributed architectures.

More Than Just “The Kafka Company”

While Confluent is often recognized as “the Kafka company,” it has grown far beyond that. Today, Confluent is a complete data streaming platform, combining Apache Kafka for event streaming, Apache Flink for stream processing, and many additional components for data integration, governance and security to power critical workloads.

However, Confluent remains laser-focused on data streaming—it does NOT compete with BI, AI model training, search platforms, or databases. Instead, it integrates and partners with best-in-class solutions in these domains to ensure businesses can seamlessly connect, process, and analyze real-time data within their broader IT ecosystem.

The Right Data Streaming Platform for Every Use Case

Confluent is not just one product—it matches the specific needs, SLAs, and cost considerations of different streaming workloads:

Fully Managed Cloud (SaaS)
- Dedicated and multi-tenant Enterprise Clusters: Low latency, strict SLAs for mission-critical workloads.
- Freight Clusters: Optimized for high-volume, relaxed latency requirements.
Bring Your Own Cloud (BYOC)
- WarpStream: Bring Your Own Cloud for flexibility and cost efficiency.
Self-Managed
- Confluent Platform: Deploy anywhere—customer cloud VPC, on-premise, at the edge, or across multi-region environments.

Confluent is built for organizations that require more than just “some” data streaming—it is for businesses that need a scalable, reliable, and deeply integrated event-driven architecture. Whether operating in a cloud, hybrid, or on-premise environment, Confluent ensures real-time data can be moved, processed, and analyzed seamlessly across the enterprise.

By focusing only on data streaming, Confluent ensures seamless integration with best-in-class solutions across both operational and analytical workloads. Instead of competing across multiple domains, Confluent partners with industry leaders to provide a best-of-breed architecture that avoids the trade-offs of an all-in-one compromise.

Deep Integrations Across Key Ecosystems

A purpose-built data streaming platform plays well with cloud providers and other data platforms. A few examples:

Cloud Providers (AWS, Azure, Google Cloud): While all major cloud providers offer some data streaming capabilities, Confluent takes a different approach by deeply integrating into their ecosystems. Confluent’s managed services can be:
- Consumed via cloud credits through the cloud provider marketplace
- Integrated natively into cloud provider’s security and networking services
- Fully-managed out-of-the-box connectivity to cloud provider services like object storage, lakehouses, and databases
MongoDB: A leader in NoSQL and operational workloads, MongoDB integrates with Confluent via Kafka-based change data capture (CDC), enabling real-time event streaming between transactional databases and event-driven applications.
Databricks: A powerhouse in AI and analytics, Databricks integrates bi-directionally with Confluent via Kafka and Apache Spark, or object storage and the open table format from Iceberg / Delta Lake via Tableflow. This enables businesses to stream data for AI model training in Databricks and perform real-time model inference directly within the streaming platform.

Rather than attempting to own the entire data stack, Confluent specializes in data streaming and integrates seamlessly with the best cloud, AI, and database solutions.

Beyond the Leader: Specialized Vendors Shaping Data Streaming

Confluent is not alone in recognizing the power of focus. A handful of other vendors have also chosen to specialize in data streaming—each with their own vision, strengths, and approaches.

WarpStream, recently acquired by Confluent, is a Kafka-compatible infrastructure solution designed for Bring Your Own Cloud (BYOC) environments. It re-architects Kafka by running the protocol directly on cloud object storage like Amazon S3, removing the need for traditional brokers or persistent compute. This model dramatically reduces operational complexity and cost—especially for high-ingest, elastic workloads. While WarpStream is now part of the Confluent portfolio, it remains a distinct offering focused on lightweight, cost-efficient Kafka infrastructure.

StreamNative is the commercial steward of Apache Pulsar, aiming to provide a unified messaging and streaming platform. Built for multi-tenancy and geo-replication, it offers some architectural differentiators, particularly in use cases where separation of compute and storage is a must. However, adoption remains niche, and the surrounding ecosystem still lacks maturity and standardization.

Redpanda positions itself as a Kafka-compatible alternative with a focus on performance, especially in low-latency and resource-constrained environments. Its C++ foundation and single-binary architecture make it appealing for edge and latency-sensitive workloads. Yet, Redpanda still needs to mature in areas like stream processing, integrations, and ecosystem support to serve as a true platform.

AutoMQ re-architects Apache Kafka for the cloud by separating compute and storage using object storage like S3. It aims to simplify operations and reduce costs for high-throughput workloads. Though fully Kafka-compatible, AutoMQ concentrates on infrastructure optimization and currently lacks broader platform capabilities like governance, processing, or hybrid deployment support.

Bufstream is experimenting with lightweight approaches to real-time data movement using modern developer tooling and APIs. While promising in niche developer-first scenarios, it has yet to demonstrate scalability, production maturity, or a robust ecosystem around complex stream processing and governance.

Ververica focuses on stream processing with Apache Flink. It offers Ververica Platform to manage Flink deployments at scale, especially on Kubernetes. While it brings deep expertise in Flink operations, it does not provide a full data streaming platform and must be paired with other components, like Kafka for ingestion and delivery.

Great Ideas Are Born From Market Pressure

Each of these companies brings interesting ideas to the space. But building and scaling a complete, enterprise-grade data streaming platform is no small feat. It requires not just infrastructure, but capabilities for processing, governance, security, global scale, and integrations across complex environments.

That’s where Confluent continues to lead—by combining deep technical expertise, a relentless focus on one problem space, and the ability to deliver a full platform experience across cloud, on-prem, and hybrid deployments.

In the long run, the data streaming market will reward not just technical innovation, but consistency, trust, and end-to-end excellence. For now, the message is clear: specialization matters—but execution matters even more. Let’s see where the others go.

How Customers Benefit from Specialization

A well-defined focus provides several advantages for customers, ensuring they get the right tool for each job without the complexity of navigating overlapping services.

Clarity in technology selection: No need to evaluate multiple competing services; purpose-built solutions ensure the right tool for each use case.
Deep technical investment: Continuous innovation focused on solving specific challenges rather than spreading resources thin.
Predictable long-term roadmap: Stability and reliability with no sudden service retirements or shifting priorities.
Better performance and reliability: Architectures optimized for the right workloads through the deep experience in the software category.
Seamless ecosystem integration: Works natively with leading cloud providers and other data platforms for a best-of-breed approach.
Deployment flexibility: Not bound to a single environment like one cloud provider; businesses can run workloads on-premise, in any cloud, at the edge, or across hybrid environments.

Rather than adopting a broad but shallow set of solutions, businesses can achieve stronger outcomes by choosing vendors that specialize in one core competency and deliver it everywhere.

Why Deep Expertise Matters: Supporting 24/7, Mission-Critical Data Streaming

For mission-critical workloads—where downtime, data loss, and compliance failures are not an option—deep expertise is not just an advantage, it is a necessity.

Data streaming is a high-performance, real-time infrastructure that requires continuous reliability, strict SLAs, and rapid response to critical issues. When something goes wrong at the core of an event-driven architecture—whether in Apache Kafka, Apache Flink, or the surrounding ecosystem—only specialized vendors with proven expertise can ensure immediate, effective solutions.

The Challenge with Generalist Cloud Services

Many cloud providers offer some level of data streaming, but their approach is different from a dedicated data streaming platform. Take Amazon MSK as an example:

Amazon MSK provides managed Kafka clusters, but does NOT offer Kafka support itself. If an issue arises deep within Kafka, customers are responsible for troubleshooting it—or must find external experts to resolve the problem.
The terms and conditions of Amazon MSK explicitly exclude Kafka support, meaning that, for mission-critical applications requiring uptime guarantees, compliance, and regulatory alignment, MSK is not a viable choice.
This lack of core Kafka support poses a serious risk for enterprises relying on event streaming for financial transactions, real-time analytics, AI inference, fraud detection, and other high-stakes applications.

For companies that cannot afford failure, a data streaming vendor with direct expertise in the underlying technology is essential.

Why Specialized Vendors Are Essential for Mission-Critical Workloads

A complete data streaming platform is much more than a hosted Kafka cluster or a managed Flink service. Specialized vendors like Confluent offer end-to-end operational expertise, covering:

24/7 Critical Support: Direct access to Kafka and Flink experts, ensuring immediate troubleshooting for core-level issues.
Guaranteed SLAs: Strict uptime commitments, ensuring that mission-critical applications are always running.
No Data Loss Architecture: Built-in replication, failover, and durability to prevent business-critical data loss.
Security & Compliance: Encryption, access control, and governance features designed for regulated industries.
Professional Services & Advisory: Best practices, architecture reviews, and operational guidance tailored for real-time streaming at scale.

This level of deep, continuous investment in operational excellence separates a general-purpose cloud service from a true data streaming platform.

The Power of Specialization: Deep Expertise Beats Broad Offerings

Software vendors will continue expanding their offerings, integrating new technologies, and launching new services. However, focus remains a key differentiator in delivering best-in-class solutions, especially for operational systems with critical SLAs—where low latency, 24/7 uptime, no data loss, and real-time reliability are non-negotiable.

For companies investing in strategic data architectures, choosing a vendor with deep expertise in one core technology—rather than one that spreads across multiple domains—ensures stability, predictable performance, and long-term success.

In a rapidly evolving technology landscape, clarity, specialization, and seamless integration are the foundations of lasting innovation. Businesses that prioritize proven, mission-critical solutions will be better equipped to handle the demands of real-time, event-driven architectures at scale.

How do you see the world of software? Better to specialize or become an allrounder? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter. And download my free book about data streaming use cases.

The post The Importance of Focus: Why Software Vendors Should Specialize Instead of Doing Everything (Example: Data Streaming) appeared first on Kai Waehner.

Data Streaming as the Technical Foundation for a B2B Marketplace

Kai Waehner — Wed, 05 Mar 2025 06:26:59 +0000

A B2B data marketplace is a groundbreaking platform enabling businesses to exchange, monetize, and use data in real time. Beyond the basic promise of data sharing, these marketplaces are evolving into self-service platforms with features such as subscription management, usage-based billing, and secure data monetization. This post explores the core technical and functional aspects of building a data marketplace for subscription commerce using data streaming technologies like Apache Kafka. Drawing inspiration from real-world implementations like AppDirect, the post examines how these capabilities translate into a robust and scalable architecture.

Subscription Commerce with a Digital Marketplace

Subscription commerce refers to business models where customers pay recurring fees—monthly, annually, or usage-based—for access to products or services, such as SaaS, streaming platforms, or subscription boxes.

Digital marketplaces are online platforms where multiple vendors can sell their products or services to customers, often incorporating features like catalog management, payment processing, and partner integrations.

Together, subscription commerce and digital marketplaces enable businesses to monetize recurring offerings efficiently, manage customer relationships, and scale through multi-vendor ecosystems. These solutions enables organizations to sell own or third-party recurring technology services through a white-labeled marketplace, or streamline procurement with an internal IT marketplace to manage and acquire services. The platform empowers digital growth for businesses of all sizes across direct and indirect go-to-market channels.

The Competitive Landscape for Subscription Commerce

The subscription commerce and digital marketplace space includes several prominent players offering specialized solutions.

Zuora leads in enterprise-grade subscription billing and revenue management, while Chargebee and Recurly focus on flexible billing and automation for SaaS and SMBs. Paddle provides global payment and subscription management tailored to SaaS businesses. AppDirect stands out for enabling SaaS providers and enterprises to manage subscriptions, monetize offerings, and build partner ecosystems through a unified platform.

For marketplace platforms, CloudBlue (from Ingram Micro) enables as-a-service ecosystems for telcos and cloud providers, and Mirakl excels at building enterprise-level B2B and B2C marketplaces.

Solutions like ChannelAdvisor and Vendasta cater to resellers and localized businesses with marketplace and subscription tools. Each platform offers unique capabilities, making the choice dependent on specific needs like scalability, industry focus, and integration requirements.

What Makes a B2B Data Marketplace Technically Unique?

A data marketplace is more than a repository; it is a dynamic, decentralized platform that enables the continuous exchange of data streams across organizations. Its key distinguishing features include:

Real-Time Data Sharing: Enables instantaneous exchange and consumption of data streams.
Decentralized Design: Avoids reliance on centralized data hubs, reducing latency and risk of single points of failure.
Fine-Grained Access Control: Ensures secure and compliant data sharing.
Self-Service Capabilities: Simplifies the discovery and consumption of data through APIs and portals.
Usage-Based Billing and Monetization: Tracks data usage in real time to enable flexible pricing models.

These characteristics require a scalable, fault-tolerant, and real-time data processing backbone. Enter data streaming with the de facto standard Apache Kafka.

Data Streaming as the Backbone of a B2B Data Marketplace

At the heart of a B2B data marketplace lies data streaming, a technology paradigm enabling continuous data flow and processing. Kafka’s publish-subscribe architecture aligns seamlessly with the marketplace model, where data producers publish streams that consumers can subscribe to in real time.

Why Apache Kafka for a Data Marketplace?

A data streaming platform uniquely combines different characteristics and capabilities:

Scalability and Fault Tolerance: Kafka’s distributed architecture allows for handling large volumes of data streams, ensuring high availability even during failures.
Event-Driven Design: Kafka provides a natural fit for event-driven architectures, where data exchanges trigger workflows, such as subscription activation or billing.
Stream Processing with Kafka Streams or ksqlDB: Real-time transformation, filtering, and enrichment of data streams can be performed natively, ensuring the data is actionable as it flows.
Integration with Ecosystem: Kafka’s connectors enable seamless integration with external systems such as billing platforms, monitoring tools, and data lakes.
Security and Compliance: Built-in features like TLS encryption, SASL authentication, and fine-grained ACLs ensure the marketplace adheres to strict security standards.

I wrote a separate article that explores how an Event-driven Architecture (EDA) and Apache Kafka build the foundation of a streaming data exchange.

Architecture Overview

Modern architectures for data marketplaces are often inspired by Domain-Driven Design (DDD), microservices, and the principles of a data mesh.

Domain-Driven Design helps structure the platform around distinct business domains, ensuring each part of the marketplace aligns with its core functionality, such as subscription management or billing.
Microservices decompose the marketplace into independently deployable services, promoting scalability and modularity.
A Data mesh decentralizes data ownership, empowering individual teams or providers to manage and share their datasets while adhering to shared governance policies.

Together, these principles create a flexible, scalable, and business-aligned architecture. A high-level architecture for such a marketplace involves:

Data Providers: Publish real-time data streams to Kafka Topics. Use Kafka Connect to ingest data from external sources.
Data Marketplace Platform: A front-end portal backed by Kafka for subscription management, search, and discovery. Kafka Streams or Apache Flink for real-time processing (e.g., billing, transformation). Integration with billing systems, identity management, and analytics platforms.
Data Consumers: Subscribe to Kafka Topics, consuming data tailored to their needs. Integrate the marketplace streams into their own analytics or operational workflows.

A data streaming platoform enable simple and secure data sharing within or across organizations with chargeback capabilities built-in to build cost APIs and new business models. The following is an implementation leveraging Confluent’s Stream Sharing functionality in Confluent Cloud:

Source: Confluent

Data Marketplace Features and Their Technical Implementation

A robust B2B data marketplace should offer the following vendor-agnostic features:

Self-Service Data Discovery

Functionality: Allows users to browse available datasets or streams, explore metadata, and request subscriptions.
Technical Implementation: Build a front-end portal or API leveraging Kafka’s schema registry for metadata management. Use Kafka Topics to represent each data stream or dataset in the data mesh.

Real-Time Subscription Management

Functionality: Enables users to subscribe to data streams with customizable preferences, such as data filters or frequency of updates.
Technical Implementation: Use Kafka’s consumer groups to manage subscriptions. Implement filtering logic with Kafka Streams or ksqlDB to tailor streams to user preferences.

Usage-Based Billing

Functionality: Tracks the volume or type of data consumed by each user and generates invoices dynamically.
Technical Implementation: Use Kafka’s log retention and monitoring tools to track data consumption. Integrate with a billing engine via Kafka Connect or RESTful APIs for real-time invoice generation.

Functionality: Facilitates revenue sharing between data providers and marketplace operators.
Technical Implementation: Build a revenue-sharing logic layer using Kafka Streams or Apache Flink, processing data usage metrics. Store provider-specific pricing models in a database connected via Kafka Connect.

Compliance and Data Governance

Functionality: Ensures data sharing complies with regulations (e.g., GDPR, HIPAA) and provides an audit trail.
Technical Implementation: Leverage Kafka’s immutable event log as an auditable record of all data exchanges. Implement data contracts for Kafka Topics with policies, role-based access control (RBAC), and encryption for secure sharing.

Dynamic Pricing Models

Functionality: Enables data providers to offer pricing models based on volume, frequency, or quality of data.
Technical Implementation: Deploy stream processing applications to calculate pricing dynamically. Use APIs to display pricing updates in real time on the marketplace portal.

Marketplace Analytics

Functionality: Offers insights into usage patterns, revenue streams, and operational metrics.
Technical Implementation: Aggregate Kafka stream data into analytics platforms such as Snowflake, Databricks, Elasticsearch or Microsoft Fabri.

Real-World Success Story: AppDirect’s Subscription Commerce Platform Powered by a Data Streaming Platform

AppDirect is a leading subscription commerce platform that helps businesses monetize and manage software, services, and data through a unified digital marketplace. It provides tools for subscription billing, usage tracking, partner management, and revenue sharing, enabling seamless B2B transactions.

Source: AppDirect

AppDirect serves customers across industries such as telecommunications (e.g., Telstra, Deutsche Telekom), technology (e.g., Google, Microsoft), and cloud services, powering ecosystems for software distribution and partner-driven monetization.

The Challenge

AppDirect enables SaaS providers to monetize their offerings, but faced significant challenges in scaling its platform to handle the growing complexity of real-time subscription billing and data flow management.

As the number of vendors and consumers on the platform increased, ensuring accurate, real-time tracking of usage and billing became increasingly difficult. Additionally, the legacy systems struggled to support seamless integration, dynamic pricing models, and real-time updates required for a competitive marketplace experience.

The Solution

AppDirect implemented a data streaming backbone with Apache Kafka leveraging Confluent’s data streaming platform. This enabled:

Real-time billing for subscription services.
Accurate usage tracking and monetization.
Improved scalability with a distributed, event-driven architecture.

The Outcome

90% reduction in time-to-market for new features.
Enhanced customer experience with real-time updates.
Seamless scaling to handle increasing vendor participation and data loads.

Advantages Over Competitors in the Subscription Commerce and Data Marketplace Business

Powered by the event-driven architecture and a data streaming platform, AppDirect distinguishes itself with from competitors in the subscription commerce and data marketplace business:

A unified approach to subscription management, billing, and partner ecosystem enablement.
Strong focus on the telecommunications and technology sectors.
Deep integrations for vendor and reseller ecosystems.

The technical backbone of a B2B data marketplace relies on data streaming to deliver real-time data sharing, scalable subscription management, and secure monetization. Platforms like Apache Kafka and Confluent enable these features through their distributed, event-driven architecture, ensuring resilience, compliance, and operational efficiency.

By implementing these principles, organizations can build a modern, self-service data marketplace that fosters innovation and collaboration. The success of AppDirect highlights the potential of this approach, offering a blueprint for businesses looking to capitalize on the power of data streaming.

Whether you’re a data provider seeking additional revenue streams or a business aiming to harness external insights, a well-designed data marketplace is your gateway to unlocking value in the digital economy.

Stay ahead of the curve! Subscribe to my newsletter for insights into data streaming and connect with me on LinkedIn to continue the conversation. And make sure to download my free book about data streaming use cases.

The post Data Streaming as the Technical Foundation for a B2B Marketplace appeared first on Kai Waehner.

How Siemens Healthineers Leverages Data Streaming with Apache Kafka and Flink in Manufacturing and Healthcare

Kai Waehner — Tue, 17 Dec 2024 05:58:17 +0000

Siemens Healthineers, a global leader in medical technology, delivers solutions that improve patient outcomes and empower healthcare professionals. As part of the Siemens AG family, Siemens Healthineers stands out with innovative products, data-driven solutions, and services designed to optimize workflows, improve precision, and enhance efficiency in healthcare systems worldwide. A significant aspect of their technological prowess lies in their use of data streaming to unlock real-time insights and optimize processes. This blog post explores how Siemens Healthineers uses data streaming with Apache Kafka and Flink, their cloud-focused technology stack, and the use cases that drive tangible business value such as real-time logistics, robotics, SAP ERP integration, AI/ML, and more.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch.

Siemens Healthineers: Shaping the Future of Healthcare Technology

Who They Are

Siemens AG, a global powerhouse in industrial manufacturing, energy, and technology, has been a leader in innovation for over 170 years. Known for its groundbreaking contributions across sectors, Siemens combines engineering expertise with digitalization to shape industries worldwide. Within this ecosystem, Siemens Healthineers stands out as a pivotal player in healthcare technology.

Source: Siemens Healthineers

With over 71,000 employees operating in 70+ countries, Siemens Healthineers supports critical clinical decisions in healthcare. Over 90% of leading hospitals worldwide collaborate with them, and their technologies influence over 70% of critical clinical decisions.

Their Vision

Siemens Healthineers focuses on innovation through data and AI, aiming to streamline healthcare delivery. With more than 24,000 technical intellectual property rights, including 15,000 granted patents, their technological foundation enables precision medicine, enhanced diagnostics, and patient-centric solutions.

Source: Siemens Healthineers

Siemens Healthineers and Data Streaming for Healthcare and Manufacturing

Siemens is a large conglomerate. I already covered a few data streaming use cases at other Siemens divisions. For instance, the integration project from SAP ERP on-premise to Salesforce CRM in the cloud.

At the Data in Motion Tour 2024 in Frankfurt, Arash Attarzadeh (“Apache Kafka Jedi“) from Siemens Heathineers presented various very interesting success stories that leverage data streaming using Apache Kafka, Flink, Confluent, and its entire ecosystem.

Why Data Streaming with Apache Kafka and Flink?

Healthcare and manufacturing processes generate massive volumes of real-time data. Whether it’s monitoring devices on production floors, analyzing telemetry data from hospitals, or optimizing logistics, Siemens Healthineers recognizes that data streaming enables:

Real-time insights: Immediate and continuously action on events as they happen.
Improved decision-making: Faster and more accurate responses.
Cost efficiency: Reduced downtime and optimized operations.

Healthineers Data Cloud

The Siemens Healthineers Data Cloud serves as the backbone of their data strategy. Built on a robust technology stack, it facilitates real-time data ingestion, transformation, and analytics using tools like Confluent Cloud (including Apache Kafka and Flink) and Snowflake.

Source: Siemens Healthineers

This combination of leading SaaS solutions enables seamless integration of streaming data with batch processes and diverse analytics platforms.

Technology Stack: Healthineers Data Cloud

Key Components

Confluent Cloud (Apache Kafka): For real-time data ingestion, data integration and stream processing.
Snowflake: A centralized warehouse for analytics and reporting.
Matillion: Batch ETL processes for structured and semi-structured data.
IoT Data Integration: Sensors and PLCs collect data from manufacturing floors, often via MQTT.

Source: Siemens Healthineers

Many other solutions are critical for some use cases. Siemens Healthineers also uses Databricks, dbt, OPC-UA, and many other systems for the end-to-end data pipelines.

Diverse Data Ingestion

Real-Time Streaming: IoT data (sensors, PLCs) is ingested within minutes.
Batch Processing: Structured and semi-structured data from SAP systems.
Change Data Capture (CDC): Data changes in SAP sources are captured and available in under 30 minutes.

Not every data integration process is or can be real-time. Data consistency is still one of the most underrated capabilities of data streaming. Apache Kafka supports real-time, batch and request-response APIs communicating with each other in a consistent way.

Use Cases for Data Streaming at Siemens Healthineers

Siemens Healthineers described six different use cases that leverage data streaming together with various other IoT, software and cloud services:

Machine monitoring and predictive maintenance
Data integration layer for analytics
Machine and robot integration
Telemetry data processing for improved diagnostics
Real-time logistics with SAP events for better supply chain efficiency
Track and Trace Orders for improved customer satisfaction and ensured compliance

Let’s take a look at them in the following subsections.

1. Machine Monitoring and Predictive Maintenance in Manufacturing

Goal: To ensure the smooth operation of production devices through predictive maintenance.

Using data streaming, real-time IoT data from drill machines is ingested into Kafka topics, where it’s analyzed to predict maintenance needs. By using a TensorFlow machine learning model for infererence with Apache Kafka, Siemens Healthineers can:

Reduce machine downtime.
Optimize maintenance schedules.
Increase productivity in manufacturing CT scanners.

Business Value: Predictive maintenance reduces operational costs and prevents production halts, ensuring timely delivery of critical medical equipment.

2. IQ-Data Intelligence from IoT and SAP to Cloud

Goal: Develop an end-to-end data integration layer for analytics.

Data from various lifecycle phases (e.g., SAP systems, IoT interfaces via MQTT using Mosquitto, external sources) is streamed into a consistent model using stream processing with ksqlDB. The resulting data backend supports the development of MLOps architectures and enables advanced analytics.

Source: Siemens Healthineers

Business Value: Streamlined data integration accelerates the development of AI applications, helping data scientists and analysts make quicker, more informed decisions.

3. Machine Integration with SAP and KUKA Robots

Goal: Integrate machine data for analytics and real-time insights.

Data from SAP systems (such as SAP ME and SAP PCO) and machines like KUKA robots is streamed into Snowflake for analytics. MQTT brokers and Apache Kafka manage real-time data ingestion and facilitate predictive analytics.

Source: Siemens Healthineers

Business Value: Enhanced machine integration improves production quality and supports the shift toward smart manufacturing processes.

4. Digital Healthcare Service Operations using Data Streaming

Goal: Stream telemetry data from Siemens Healthineers products for analytics.

Telemetry data from hospital devices is streamed via WebSockets to Kafka and combined with ksqlDB for continuous stream processing. Insights are fed back to clients for improved diagnostics.

Business Value: By leveraging real-time device data, Siemens Healthineers enhances the reliability of its medical equipment and improves patient outcomes.

5. Real-Time Logistics with SAP Events and Confluent Cloud

Goal: Stream SAP logistics event data for real-time packaging and shipping updates.

Using Confluent Cloud, Siemens Healthineers reduces delays in packaging and shipping by enabling real-time insights into logistics processes.

Source: Siemens Healthineers

Business Value: Improved packaging planning reduces delivery times and enhances supply chain efficiency, ensuring faster deployment of medical devices.

6. Track and Trace Orders with Apache Kafka and Snowflake

Goal: Real-time order tracking using streaming data.

Data from Siemens Healthineers orders is streamed into Snowflake using Kafka for real-time monitoring. This enables detailed tracking of orders throughout the supply chain.

Business Value: Enhanced order visibility improves customer satisfaction and ensures compliance with regulatory requirements.

Real-Time Data as a Catalyst for Healthcare and Manufacturing Innovation at Siemens Healthineers

Siemens Healthineers’ innovative use of data streaming exemplifies how real-time insights can drive efficiency, reliability, and innovation in healthcare and manufacturing. By leveraging tools like Confluent (including Apache Kafka and Flink), MQTT and Snowflake and transitiing some workloads to the cloud, they’ve built a robust infrastructure to handle diverse data streams, improve decision-making, and deliver tangible business outcomes.

From predictive maintenance to enhanced supply chain visibility, the adoption of data streaming unlocks value at every stage of the production and service lifecycle. For Siemens Healthineers, these advancements translate into better patient care, streamlined operations, and a competitive edge in the dynamic healthcare industry.

To learn more about the relationship between these key technologies and their applications in different use cases, explore the articles below:

Do you have similar use cases and architectures like Siemens Healthineers to leverage data streaming with Apache Kafka and Flink in the healthcare and manufacturing sector? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post How Siemens Healthineers Leverages Data Streaming with Apache Kafka and Flink in Manufacturing and Healthcare appeared first on Kai Waehner.

Top Trends for Data Streaming with Apache Kafka and Flink in 2025

Kai Waehner — Mon, 02 Dec 2024 14:02:07 +0000

The evolution of data streaming has transformed modern business infrastructure, establishing real-time data processing as a critical asset across industries. At the forefront of this transformation, Apache Kafka and Apache Flink stand out as leading open-source frameworks that serve as the foundation for cloud services, enabling organizations to unlock the potential of real-time data. Over recent years, trends have shifted from batch-based data processing to real-time analytics, scalable cloud-native architectures, and improved data governance powered by these technologies. Looking ahead to 2025, the data streaming ecosystem is set to undergo even greater changes. Here are the top trends shaping the future of data streaming for businesses.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch.

The Top Data Streaming Trends

Some followers might notice that this became a series with articles about the top 5 data streaming trends for 2021, the top 5 for 2022, the top 5 for 2023, and the top 5 for 2024. Trends change over time, but the huge value of having a scalable real-time infrastructure as the central data hub stays. Data streaming with Apache Kafka is a journey and evolution to set data in motion.

I recently explored the past, present, and future of data streaming tools and strategies from the past decades. Data streaming is becoming more and more mature and standardized, but also innovative.

Let’s now look at the top trends coming up more regularly in conversations with customers, prospects, and the broader data streaming community across the globe:

The Democratization of Kafka: Apache Kafka has transitioned from a specialized tool to a key pillar in modern data infrastructure.
Kafka Protocol as the Standard: Vendors adopt the Kafka wire protocol, enabling flexibility with compatibility and performance trade-offs.
BYOC Deployment Model: Bring Your Own Cloud gains traction for balancing security, compliance, and managed services.
Flink Becomes the Standard for Stream Processing: Apache Flink rises as the premier framework for stream processing, building integration pipelines and business applications.
Data Streaming for Real-Time Predictive AI and GenAI: Real-time model inference drives predictive and generative AI applications.
Data Streaming Organizations: Companies unify real-time data strategies to standardize processes, tools, governance, and collaboration.

The following sections describe each trend in more detail. The trends are relevant for many scenarios; no matter if you use open-source frameworks like Apache Kafka and Flink, a commercial platform, or a fully managed cloud service like Confluent Cloud.

Trend 1: The Democratization of Kafka

In the last decade, Apache Kafka has become the standard for data streaming, evolving from a specialized tool to an essential utility in the modern tech stack. With over 150,000 organizations using Kafka today, it has become the de facto choice for stream processing. Yet, with a market crowded by offerings from AWS, Microsoft Azure, Google GCP, IBM, Oracle, Confluent, and various startups, companies can no longer rely solely on Kafka for differentiation. The vast array of Kafka-compatible solutions means that businesses face more choices than ever, but also new challenges in selecting the solution that balances cost, performance, and features.

The Challenge: Finding the Right Fit in a Crowded Kafka Market

For end users, choosing the right Kafka solution is becoming increasingly complex. Basic Kafka offerings cover standard streaming needs but may lack advanced features, such as enhanced security, data governance, or integration and processing capabilities, that are essential for specific industries. In such a diverse market, businesses must navigate trade-offs, considering whether a low-cost option meets their needs or whether investing in a premium solution with added capabilities provides better long-term value.

The Solution: Prioritizing Features for Your Strategic Needs

As Kafka solutions evolve, users must look beyond price and consider features that offer real strategic value. For example, companies handling sensitive customer data might benefit from Kafka products with top-tier security features. Those focused on analytics may look for solutions with strong integrations into data platforms and low cost for high throughput. By carefully selecting a Kafka product that aligns with industry-specific requirements, businesses can leverage the full potential of Kafka while optimizing for cost and capabilities.

For instance, look at Confluent’s various cluster types for different requirements and use cases in the cloud:

Source: Confluent

As an example, Freight Clusters was introduced to provide an offering with up to 90 percent less cost. The major trade-off is higher latency. But this is perfect for high volume log analytics at GB/sec scale.

The Business Value: Affordable and Customized Data Streaming

Kafka’s commoditization means more affordable, customizable options for businesses of all sizes. This competition reduces costs, making high-performance data streaming more accessible, even to smaller organizations. By choosing a tailored solution, businesses can enhance customer satisfaction, speed up decision-making, and innovate faster in a competitive landscape.

Trend 2: The Kafka Protocol, not Apache Kafka, is the New Standard for Data Streaming

With the rise of cloud-native architectures, many vendors have shifted to supporting the Kafka protocol rather than the open-source Kafka framework itself, allowing for greater flexibility and cloud optimization. This change enables businesses to choose Kafka-compatible tools that better align with specific needs, moving away from a one-size-fits-all approach.

Confluent introduced its KORA engine, i.e., Kafka re-architected to be cloud-native. A deep technical whitepaper goes into the details (this is not a marketing document but really for software engineers).

Source: Confluent

Other players followed Confluent and introduced their own cloud-native “data streaming engines”. For instance, StreamNative has URSA powered by Apache Pulsar, Redpanda talks about its R1 Engine implementing the Kafka protocol, and Ververica recently announced VERA for its Flink-based platform.

Some vendors rely only on the Kafka protocol with a proprietary engine from the beginning. For instance, Azure Event Hubs or WarpStream. Amazon MSK also goes in this direction by adding proprietary features like Tiered Storage or even introducing completely new product options such as Amazon MSK Express brokers.

The Challenge: Limited Compatibility Across Kafka Solutions

When vendors implement the Kafka protocol instead of the entire Kafka framework, it can lead to compatibility issues, especially if the solution doesn’t fully support Kafka APIs. For end users, this can complicate integration, particularly for advanced features like Exactly-Once Semantics, the Transaction API, Compacted Topics, Kafka Connect, or Kafka Streams, which may not be supported or working as expected.

The Solution: Evaluating Kafka Protocol Solutions Critically

To fully leverage the flexibility of Kafka protocol-based solutions, a thorough evaluation is essential. Businesses should carefully assess the capabilities and compatibility of each option, ensuring it meets their specific needs. Key considerations include verifying the support of required features and APIs (such as the Transaction API, Kafka Streams, or Connect).

It is also crucial to evaluate the level of product support provided, including 24/7 availability, uptime SLAs, and compatibility with the latest versions of open-source Apache Kafka. This detailed evaluation ensures that the chosen solution integrates seamlessly into existing architectures and delivers the reliability and performance required for modern data streaming applications.

The Business Value: Expanded Options and Cost-Efficiency

Kafka protocol-based solutions offer greater flexibility, allowing businesses to select Kafka-compatible services optimized for their specific environments. This flexibility opens doors for innovation, enabling companies to experiment with new tools without vendor lock-in.

For instance, innovations such as a “direct write to S3 object store” architecture, as seen in WarpStream, Confluent Freight Clusters, and other data streaming startups that also build proprietary engines around the Kafka protocol. The result is a more cost-effective approach to data streaming, though it may come with trade-offs, such as increased latency. Check out this video about the evolution of Kafka Storage to learn more.

Trend 3: BYOC (Bring Your Own Cloud) as a New Deployment Model for Security and Compliance

As data security and compliance concerns grow, the Bring Your Own Cloud (BYOC) model is gaining traction as a new way to deploy Apache Kafka. BYOC allows businesses to host Kafka in their own Virtual Private Cloud (VPC) while the vendor manages the control plane to handle complex orchestration tasks like partitioning, replication, and failover.

This BYOC approach offers organizations enhanced control over their data while retaining the operational benefits of a managed service. BYOC provides a middle ground between self-managed and fully managed solutions, addressing specific regulatory and security needs without sacrificing scalability or flexibility.

Source: Confluent

The Challenge: Balancing Security and Ease of Use

Ensuring data sovereignty and compliance is non-negotiable for organizations in highly regulated industries. However, traditional fully managed cloud solutions can pose risks due to vendor access to sensitive data and infrastructure. Many BYOC solutions claim to address these issues but fall short when it comes to minimizing external access to customer environments. Common challenges include:

Vendor Access to VPCs: Many BYOC offerings require vendors to have access to customer VPCs for deployment, cluster management, and troubleshooting. This introduces potential security vulnerabilities.
IAM Roles and Elevated Privileges: Cross-account Identity and Access Management (IAM) roles are often necessary for managing BYOC clusters, which can expose sensitive systems to unnecessary risks.
VPC Peering Complexity: Traditional BYOC solutions often rely on VPC peering, a complex and expensive setup that increases operational overhead and opens additional points of failure.

These limitations create significant challenges for security-conscious organizations, as they undermine the core promise of BYOC: control over the data environment.

The Solution: Gaining Control with a “Zero Access” BYOC Model

WarpStream redefines the BYOC model with a “zero access” architecture, addressing the challenges of traditional BYOC solutions. Unlike other BYOC offerings using the Kafka protocol, WarpStream ensures that no data leaves the customer’s environment, delivering a truly secure-by-default platform. Hence this section discusses specifically WarpStream, not BYOC Kafka offerings in general.

Source: WarpStream

Key features of WarpStream include:

Zero Access to Customer VPCs: WarpStream eliminates vendor access by deploying stateless agents within the customer’s environment, handling compute operations locally without requiring cross-account IAM roles or elevated privileges to reduce security risks.
Data/Metadata Separation: Raw data remains entirely within the customer’s network for full sovereignty, while only metadata is sent to WarpStream’s control plane for centralized management, ensuring data security and compliance.
Simplified Infrastructure: WarpStream avoids complex setups like VPC peering and cross-IAM roles, minimizing operational overhead while maintaining high performance.

Comparison with Other BYOC Solutions using the Kafka protocol:

Unlike most other BYOC offerings (e.g., Redpanda), WarpStream doesn’t require direct VPC access or elevated permissions, avoiding risks like data exposure or remote troubleshooting vulnerabilities. Its “zero access” architecture ensures unparalleled security and compliance.

The Business Value: Secure, Compliant, and Scalable Data Streaming

WarpStream’s innovative approach to BYOC delivers exceptional business value by addressing security and compliance concerns while maintaining operational simplicity and scalability:

Uncompromised Security: The zero-access architecture ensures that raw data remains entirely within the customer’s environment, meeting the strictest security and compliance requirements for regulated industries like finance, healthcare, and government.
Operational Efficiency: By eliminating the need for VPC peering, cross-IAM roles, and remote vendor access, WarpStream simplifies BYOC deployments and reduces operational complexity.
Cost Optimization: WarpStream’s reliance on cloud-native technologies like object storage reduces infrastructure costs compared to traditional disk-based approaches. Stateless agents also enable efficient scaling without unnecessary overhead.
Data Sovereignty: The data/metadata split guarantees that data never leaves the customer’s environment, ensuring compliance with regulations such as GDPR and HIPAA.
Peace of Mind for Security Teams: With no vendor access to the VPC or object storage, WarpStream’s zero-access model eliminates concerns about external breaches or elevated privileges, making it easier to gain buy-in from security and infrastructure teams.

BYOC Strikes the Balance Between Control and Managed Services

BYOC offers businesses the ability to strike a balance between control and managed services, but not all BYOC solutions are created equal. WarpStream’s “zero access” architecture sets a new standard, addressing the critical challenges of security, compliance, and operational simplicity. By ensuring that raw data never leaves the customer’s environment and eliminating the need for vendor access to VPCs, WarpStream delivers a BYOC model that meets the highest standards of security and performance. For organizations seeking a secure, scalable, and compliant approach to data streaming, WarpStream represents the future of BYOC data streaming.

But just to be clear: If a data streaming project goes to the cloud, fully managed Kafka (and Flink) should always be the first option as it is much easier to manage and operate to focus on fast time-to-market and business innovation. Choose BYOC only if fully managed does not work for you because of security requirements.

Trend 4: Flink Becomes the De Facto Standard for Stream Processing

Apache Flink has emerged as the premier choice for organizations seeking a robust and versatile framework for continuous stream processing. Its ability to handle complex data pipelines with high throughput, low latency, and advanced stateful operations has solidified its position as the de facto standard for stream processing. Flink’s support for Java, Python, and SQL further enhances its appeal, enabling developers to build powerful data-driven applications using familiar tools.

As Flink adoption grows, it increasingly complements Apache Kafka as part of the modern data streaming ecosystem, while the Kafka Streams (Java-only) library remains relevant for lightweight, application-embedded use cases.

The Challenge: Handling Complex, High-Throughput Data Streams

Modern businesses increasingly rely on real-time data for both operational and analytical needs, spanning mission-critical applications like fraud detection, predictive maintenance, and personalized customer experiences, as well as Streaming ETL for integrating and transforming data. These diverse use cases demand robust stream processing capabilities that can address the challenges of:

The Solution: Apache Flink for Scalability and Versatility

Apache Flink’s versatility makes it uniquely positioned to meet the demands of both streaming ETL for data integration and building real-time business applications. Flink provides:

Low Latency: Near-instantaneous processing is crucial for enabling real-time decision-making in business applications, timely updates in analytical systems, and supporting transactional workloads where rapid processing and immediate consistency are essential for ensuring smooth operations and seamless user experiences.
High Throughput and Scalability: The ability to process millions of events per second, whether for aggregating operational metrics or moving massive volumes of data into data lakes or warehouses, without bottlenecks.
Stateful Processing: Support for maintaining and querying the state of data streams, essential for performing complex operations like aggregations, joins, and pattern detection in business applications, as well as data transformations and enrichment in ETL pipelines.
Multiple Programming Languages: Support for Java, Python, and SQL ensures accessibility for a wide range of developers, enabling efficient implementation across various use cases.

The rise of cloud services has further propelled Flink’s adoption, with offerings from major providers like Confluent, Amazon, IBM, and emerging startups. These cloud-native solutions simplify Flink deployments, making it easier for organizations to operationalize real-time analytics.

Apache Spark and Spark Streaming as Alternative for Flink?

While Apache Flink has emerged as the de facto standard for stream processing, other frameworks like Apache Spark and its streaming module, Structured Streaming, continue to compete in this space. However, Spark Streaming has notable limitations that make it less suitable for many of the complex, high-throughput workloads modern enterprises demand.

The Challenges with Spark Streaming

Apache Spark, originally designed as a batch processing framework, introduced Spark Streaming and later Structured Streaming to address real-time processing needs. However, its batch-oriented roots present inherent challenges:

Micro-Batch Architecture: Spark Structured Streaming relies on micro-batches, where data is divided into small time intervals for processing. This approach, while effective for certain workloads, introduces higher latency compared to Flink’s true streaming architecture. Applications requiring millisecond-level processing or transactional workloads may find Spark unsuitable.
Limited Stateful Processing: While Spark supports stateful operations, its reliance on micro-batches adds complexity and latency. This makes Spark Streaming less efficient for use cases that demand continuous state updates, such as fraud detection or complex event processing (CEP).
Fault Tolerance Complexity: Spark’s recovery model is rooted in its lineage-based approach to fault tolerance, which can be less efficient for long-running streaming applications. Flink, by contrast, uses checkpointing and savepoints to handle failures more gracefully to ensure state consistency with minimal overhead.
Performance Overhead: Spark’s general-purpose design often results in higher resource consumption compared to Flink, which is purpose-built for stream processing. This can lead to increased infrastructure costs for high-throughput workloads.
Scalability Challenges for Stateful Workloads: While Spark scales effectively for batch jobs, its scalability for complex stateful stream processing is more limited, as distributed state management in micro-batches can become a bottleneck under heavy load.

By addressing these limitations, Apache Flink provides a more versatile and efficient solution than Apache Spark for organizations looking to handle complex, real-time data processing at scale.

Flink’s architecture is purpose-built for streaming, offering native support for stateful processing, low-latency event handling, and fault-tolerant operation, making it the preferred choice for modern real-time applications. But to be clear: Apache Spark, including Spark Streaming, has its place in data lakes and lakehouses to process analytical workloads.

The Business Value: Real-Time Analytics and Improved Decision-Making with Apache Flink

Flink’s technical capabilities bring tangible business benefits, making it an essential tool for modern enterprises. By providing real-time insights, Flink enables businesses to respond to events as they occur, such as detecting and mitigating fraudulent transactions instantly, reducing losses, and enhancing customer trust.

The support of Flink for both transactional workloads (e.g., fraud detection or payment processing) and analytical workloads (e.g., real-time reporting or trend analysis) ensures versatility across a range of critical business functions. Scalability and resource optimization keep infrastructure costs manageable, even for demanding, high-throughput workloads, while features like checkpointing streamline failure recovery and upgrades, minimizing operational overhead.

Flink stands out with its dual focus on streaming ETL for data integration and building business applications powered by real-time analytics. Its rich APIs for Java, Python, and SQL make it easy for developers to implement complex workflows, accelerating time-to-market for new applications.

Trend 5: AI and Data Streaming: Real-Time Model Inference with Flink

Data streaming has powered AI/ML infrastructure for many years because of its capabilities to scale to high volumes, process data in real-time, and integrate with transactional (payments, orders, ERP, etc.) and analytical (data warehouse, data lake, lakehouse) systems. My first article about Apache Kafka and Machine Learning was published in 2017: “How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka“.

As AI continues to evolve, real-time model inference powered by data streaming is opening up new possibilities for predictive and generative AI applications. By integrating model inference with stream processors such as Apache Flink, businesses can perform on-demand predictions for fraud detection, customer personalization, and more.

The Challenge: Provide Context for AI Applications In Real-Time

Traditional batch-based AI inference is too slow for many applications, delaying responses and leading to missed opportunities or wrong business decisions. To fully harness AI in real-time, businesses need to embed model inference directly within streaming pipelines.

Generative AI (GenAI) demands new design patterns like Retrieval Augmented Generation (RAG) to ensure accuracy, relevance, and reliability in its outputs. Without data streaming, RAG struggles to provide large language models (LLMs) with the real-time, domain-specific context they need, leading to outdated or hallucinated responses. Context is essential to ensure that LLMs deliver accurate and trustworthy outputs by grounding them in up-to-date and precise information.

The Solution: Leveraging Apache Flink for Real-Time Context and Model Inference

Flink’s ability to process data in real-time also enables advanced machine learning workflows, supporting predictive analytics and generative AI use cases that drive innovation.

Apache Flink enables real-time model inference by connecting data streams to external AI models through APIs. This setup allows companies to use centralized model servers for inference, providing flexibility and scalability while keeping data streams fast and responsive. By processing data in real-time, Flink ensures that generative AI models operate with the most current and relevant context, reducing errors and hallucinations.

Flink’s real-time processing capabilities also support advanced machine learning workflows. This enables use cases like predictive analytics, anomaly detection, and generative AI applications that require instantaneous decision-making. The ability to join live data streams with historical or external datasets enriches the context for model inference, enhancing both accuracy and relevance.

Additionally, Flink facilitates feature extraction and data preprocessing directly within the stream to ensure that the inputs to AI models are optimized for performance. This seamless integration with model servers and vector databases allows organizations to scale their AI systems effectively, leveraging real-time insights to drive innovation and deliver immediate business value.

The Business Value: Immediate, Actionable AI Insights

Real-time AI model inference with Flink enables businesses to provide personalized customer experiences, detect fraud as it happens, and perform predictive maintenance with minimal latency. This real-time responsiveness empowers companies to make AI-driven decisions in milliseconds, improving customer satisfaction and operational efficiency.

By integrating Flink with event-driven architectures like Apache Kafka, businesses can ensure that AI systems are always fed with up-to-date and trustworthy data, further enhancing the reliability of predictions.

The integration of Flink and data streaming offers a clear path to measurable business impact. By aligning real-time AI capabilities with organizational goals, they can drive innovation while reducing operational costs, such as automating customer support to lower reliance on service agents.

Furthermore, Flink’s ability to process and enrich data streams at scale supports strategic initiatives like hyper-personalized marketing or optimizing supply chains in real-time. These benefits directly translate into enhanced competitive positioning, faster time-to-market for AI-driven solutions, and the ability to make more confident, data-driven decisions at the speed of business.

Trend 6: Becoming a Data Streaming Organization

To harness the full potential of data streaming, companies are shifting toward structured, enterprise-wide data streaming strategies. Moving from a tactical, ad-hoc approach to a cohesive top-down strategy enables businesses to align data streaming with organizational goals, driving both efficiency and innovation.

The Challenge: Fragmented Data Streaming Efforts

Many companies face challenges due to disjointed streaming efforts, leading to data silos and inconsistencies that prevent them from reaping the full benefits of real-time data processing. At Confluent, we call this the enterprise adoption barrier:

Source: Confluent

This fragmentation results in inefficiencies, duplication of efforts, and a lack of standardized processes. Without a unified approach, organizations struggle with:

Data Silos: Limited data sharing across teams creates bottlenecks for broader use cases.
Inconsistent Standards: Different teams often use varying schemas, patterns, and practices, leading to integration challenges and data quality issues.
Governance Gaps: A lack of defined roles, responsibilities, and policies results in limited oversight, increasing the risk of data misuse and compliance violations.

These challenges prevent organizations from scaling their data streaming capabilities and realizing the full value of their real-time data investments.

The Solution: Building an Integrated Data Streaming Organization

By adopting a comprehensive data streaming strategy, businesses can create a unified data platform with standardized tools and practices. A dedicated streaming platform team, often called the Center of Excellence (CoE), ensures consistent operations. An internal developer platform provides governed, self-serve access to streaming resources.

Key elements of a data streaming organization include:

Unified Platform: Move from disparate tools and approaches to a single, standardized data streaming platform. This includes consistent policies for cluster management, multi-tenancy, and topic naming, ensuring a reliable foundation for data streaming initiatives.
Self-Service: Provide APIs, UIs, and other interfaces for teams to onboard, create, and manage data streaming resources. Self-service capabilities ensure governed access to topics, schemas, and streaming capabilities, empowering developers while maintaining compliance and security.
Data as a Product: Adopt a product-oriented mindset where data streams are treated as reusable assets. This includes formalizing data products with clear contracts, ownership, and metadata, making them discoverable and consumable across the organization.
Alignment: Define clear roles and responsibilities, from platform operators and developers to data product owners. Establishing an enterprise-wide data streaming function ensures coordination and alignment across teams.
Governance: Implement automated guardrails for compliance, quality, and access control. This ensures that data streaming efforts remain secure, trustworthy, and scalable.

The Business Value: Consistent, Scalable, and Agile Data Streaming

Becoming a Data Streaming Organization unlocks significant value by turning data streaming into a strategic asset. The benefits include:

Enhanced Agility: A unified platform reduces time-to-market for new data-driven products and services, allowing businesses to respond quickly to market trends and customer demands.
Operational Efficiency: Streamlined processes and self-service capabilities reduce the overhead of managing multiple tools and teams, improving productivity and cost-effectiveness.
Scalable Innovation: Standardized patterns and reusable data products enable the rapid development of new use cases, fostering a culture of innovation across the enterprise.
Improved Governance: Clear policies and automated controls ensure data quality, security, and compliance, building trust with customers and stakeholders.
Cross-Functional Collaboration: By breaking down silos, organizations can leverage data streams across teams, creating a network effect that accelerates value creation.

To successfully adopt a Data Streaming Organization model, companies must combine technical capabilities with cultural and structural change. This involves not just deploying tools but establishing shared goals, metrics, and education to bring teams together around the value of real-time data. As organizations embrace data streaming as a strategic function, they position themselves to thrive in a data-driven world.

Embracing the Future of Data Streaming

As data streaming continues to mature, it has become the backbone of modern digital enterprises. It enables real-time decision-making, operational efficiency, and transformative AI applications. Trends such as the commoditization of Kafka, the adoption of the Kafka protocol, BYOC deployment models, and the rise of Flink as the standard for stream processing demonstrate the rapid evolution and growing importance of this technology. These innovations not only streamline infrastructure but also empower organizations to harness real-time insights, foster agility, and remain competitive in the ever-changing digital landscape.

These advancements in data streaming present a unique opportunity to redefine data strategy. Leveraging data streaming as a central pillar of IT architecture allows businesses to break down silos, integrate machine learning into critical workflows, and deliver unparalleled customer experiences. The convergence of data streaming with generative AI, particularly through frameworks like Flink, underscores the importance of embracing a real-time-first approach to data-driven innovation.

Looking ahead, organizations that invest in scalable, secure, and strategic data streaming infrastructures will be positioned to lead in 2025 and beyond. By adopting these trends, enterprises can unlock the full potential of their data, drive business transformation, and solidify their place as leaders in the digital era. The journey to set data in motion is not just about technology—it’s about building the foundation for a future where real-time intelligence powers every decision and every experience.

What trends do you see for data streaming? Which ones are your favorites? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Top Trends for Data Streaming with Apache Kafka and Flink in 2025 appeared first on Kai Waehner.

Confluent Archives - Kai Waehner

Data Streaming Meets the SAP Ecosystem and Databricks – Insights from SAP Sapphire Madrid

SAP’s Vision: Business Data Cloud, Joule, and Strategic Ecosystem Moves

Building the Bridge: Event-Driven Architecture + SAP

SAP, Databricks, and Confluent – A Triangular Partnership

Data Products, Not Just Pipelines

Industry-Specific Use Cases to Explore the Business Value of SAP and Data Streaming

Looking Forward: SAP, Data Streaming, AI, and Open Table Formats

Databricks and Confluent in the World of Enterprise Software (with SAP as Example)

About the Confluent and Databricks Blog Series

Most Enterprise Data Is Operational

SAP Product Landscape for Operational and Analytical Workloads

SAP Business Data Cloud (BDC)

SAP Databricks OEM: Limited Scope, Full Control by SAP

Confluent and SAP Integration

SAP Datasphere and Confluent

Confluent for Agentic AI with SAP Joule and Databricks

Agentic AI with Apache Kafka as Event Broker

Data Streaming Use Cases Across SAP Product Suites

Example Use Case and Architecture with SAP, Databricks and Confluent

Going Beyond SAP with Data Streaming

Strategic Value for the Enterprise of Event-based Real-Time Integration with Data Streaming

Beyond SAP: Enabling Agentic AI Across the Enterprise

Shift Left Architecture for AI and Analytics with Confluent and Databricks

About the Confluent and Databricks Blog Series

Medallion Architecture: Structured, Proven, but Not Always Optimal

Challenges of the Medallion Architecture

Shift Left Architecture: Process Earlier, Share Faster

How Confluent Enables Shift Left with Databricks

Data Quality Governance via Data Contracts and Schema Validation

Apache Flink for Continuous Stream Processing

Combining Shift Left with Medallion Architecture

Shift Left with Delta Lake, Iceberg, and Tableflow

AI Use Cases for Shift Left with Confluent and Databricks

Data Warehouse Use Cases for Shift Left with Confluent and Databricks

Architecture Benefits Beyond Technology

Bringing in New Types of Data

Shift Left Using ONLY Databricks

Meesho: Scaling a Real-Time Commerce Platform with Confluent and Databricks

Shift Left: Reducing Complexity, Increasing Value for the Lakehouse (and other Operational Systems)

Confluent Data Streaming Platform vs. Databricks Data Intelligence Platform for Data Integration and Processing

About the Confluent and Databricks Blog Series

Data Integration and Processing: Shared Space, Different Strengths

Confluent: Event-Driven Integration Platform

Databricks: Batch-Driven Analytics and AI Platform

Data Ingestion Capabilities

Confluent for Data Ingestion into Databricks

Real-Time vs. Batch: When to Use Each

Real-Time and Batch: A Perfect Match at Walmart for Replenishment Forecasting in the Supply Chain

Batch as a Subset of Streaming in Apache Flink

Streaming CDC and Lakehouse Analytics

Reverse ETL: An (Anti) Pattern to Avoid with Confluent and Databricks

Multi-Cloud and Hybrid Integration with an Event-Driven Architecture

Uniper: Real-Time Energy Trading with Confluent and Databricks

From Ingestion to Insight: Modern Data Integration and Processing for AI with Confluent and Databricks

The Past, Present, and Future of Confluent (The Kafka Company) and Databricks (The Spark Company)

About the Confluent and Databricks Blog Series

Operational vs. Analytical Workloads

Confluent: From Apache Kafka to a Data Streaming Platform (DSP)

Databricks: From Apache Spark to a Data Intelligence Platform

Real-Time vs. Batch Processing

Data Processing and Data Sharing “In Motion” vs. “At Rest”

Stream Processing with Spark Structured Streaming vs. Apache Flink or Kafka Streams

Confluent Tableflow: Unify Operational and Analytic Workloads with Open Table Formats (such as Apache Iceberg and Delta Lake)

Confluent + Databricks = A Strategic Partnership for Future-Proof AI Architectures

Michelin: Real-Time Data Streaming and AI Innovation with Confluent and Databricks

Confluent @ Michelin: Real-Time Data Streaming Pipelines

Databricks @ Michelin: Centralized Lakehouse

Confluent + Databricks @ Michelin = Cloud-Native Data-Driven Enterprise

Confluent and Databricks: Better Together!

Shift Left Architecture at Siemens: Real-Time Innovation in Manufacturing and Logistics with Data Streaming

The Data Streaming Use Case Show: Episode #1 – Manufacturing and Automotive

Siemens Digital Industries: Company and Vision

Data Streaming at Industrial Companies

Edge and Hybrid Cloud as a Standard

The Shift Left Architecture for Industrial IoT

Siemens Data Streaming Success Story: Modular Intralogistics Platform

Architecture: Data Products + Shift Left

Business Value of Data Streaming and Shift Left at Siemens Digital Industries

Siemens Healthineers: Shift Left with IoT, Data Streaming, AI/ML, Confluent and Snowflake in Manufacturing and Healthcare