Apache Kafka Archives - Kai Waehner https://www.kai-waehner.de/blog/category/apache-kafka/ Technology Evangelist - Big Data Analytics - Middleware - Apache Kafka Mon, 02 Jun 2025 05:09:50 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.2 https://www.kai-waehner.de/wp-content/uploads/2020/01/cropped-favicon-32x32.png Apache Kafka Archives - Kai Waehner https://www.kai-waehner.de/blog/category/apache-kafka/ 32 32 How Penske Logistics Transforms Fleet Intelligence with Data Streaming and AI https://www.kai-waehner.de/blog/2025/06/02/how-penske-logistics-transforms-fleet-intelligence-with-data-streaming-and-ai/ Mon, 02 Jun 2025 04:44:37 +0000 https://www.kai-waehner.de/?p=7971 Real-time visibility has become essential in logistics. As supply chains grow more complex, providers must shift from delayed, batch-based systems to event-driven architectures. Data Streaming technologies like Apache Kafka and Apache Flink enable this shift by allowing continuous processing of data from telematics, inventory systems, and customer interactions. Penske Logistics is leading the way—using Confluent’s platform to stream and process 190 million IoT messages daily. This powers predictive maintenance, faster roadside assistance, and higher fleet uptime. The result: smarter operations, improved service, and a scalable foundation for the future of logistics.

The post How Penske Logistics Transforms Fleet Intelligence with Data Streaming and AI appeared first on Kai Waehner.

]]>
Real-time visibility is no longer a competitive advantage in logistics—it’s a business necessity. As global supply chains become more complex and customer expectations rise, logistics providers must respond with agility and precision. That means shifting away from static, delayed data pipelines toward event-driven architectures built around real-time data.

Technologies like Apache Kafka and Apache Flink are at the heart of this transformation. They allow logistics companies to capture, process, and act on streaming data as it’s generated—from vehicle sensors and telematics systems to inventory platforms and customer applications. This enables new use cases in predictive maintenance, live fleet tracking, customer service automation, and much more.

A growing number of companies across the supply chain are embracing this model. Whether it’s real-time shipment tracking, automated compliance reporting, or AI-driven optimization, the ability to stream, process, and route data instantly is proving vital.

One standout example is Penske Logistics—a transportation leader using Confluent’s data streaming platform (DSP) to transform how it operates and delivers value to customers.

How Penske Logistics Transforms Fleet Intelligence with Kafka and AI

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases.

Why Real-Time Data Matters in Logistics and Transportation

Transportation and logistics operate on tight margins and stricter timelines than almost any other sector. Delays ripple through supply chains, disrupting manufacturing schedules, customer deliveries, and retail inventories. Traditional data integration methods—batch ETL, manual syncing, and siloed systems—simply can’t meet the demands of today’s global logistics networks.

Data streaming enables organizations in the logistics and transportation industry to ingest and process information in real-time while the data is valuable and critical. Vehicle diagnostics, route updates, inventory changes, and customer interactions can all be captured and acted upon in real time. This leads to faster decisions, more responsive services, and smarter operations.

Real-time data also lays the foundation for advanced use cases in automation and AI, where outcomes depend on immediate context and up-to-date information. And for logistics providers, it unlocks a powerful competitive edge.

Apache Kafka serves as the backbone for real-time messaging—connecting thousands of data producers and consumers across enterprise systems. Apache Flink adds stateful stream processing to the mix, enabling continuous pattern recognition, enrichment, and complex business logic in real time.

Event-driven Architecture with Data Streaming in Logistics and Transportation using Apache Kafka and Flink

In the logistics industry, this event-driven architecture supports use cases such as:

  • Continuous monitoring of vehicle health and sensor data
  • Proactive maintenance scheduling
  • Real-time fleet tracking and route optimization
  • Integration of telematics, ERP, WMS, and customer systems
  • Instant alerts for service delays or disruptions
  • Predictive analytics for capacity and demand forecasting

This isn’t just theory. Leading logistics organizations are deploying these capabilities at scale.

Data Streaming Success Stories Across the Logistics and Transportation Industry

Many transportation and logistics firms are already using Kafka-based architectures to modernize their operations. A few examples:

  • LKW Walter relies on data streaming to optimize its full truck load (FTL) freight exchanges and enable digital freight matching.
  • Uber Freight leverages real-time telematics, pricing models, and dynamic load assignment across its digital logistics platform.
  • Instacart uses event-driven systems to coordinate live order delivery, matching customer demand with available delivery slots.
  • Maersk incorporates streaming data from containers and ports to enhance shipping visibility and supply chain planning.

These examples show the diversity of value that real-time data brings—across first mile, middle mile, and last mile operations.

An increasing number of companies are using data streaming as the event-driven control tower for their supply chains. It’s not only about real-time insights—it’s also about ensuring consistent data across real-time messaging, HTTP APIs, and batch systems. Learn more in this article: A Real-Time Supply Chain Control Tower powered by Kafka.

Supply Chain Control Tower powered by Data Streaming with Apache Kafka

Penske Logistics: A Leader in Transportation, Fleet Services, and Supply Chain Innovation

Penske Transportation Solutions is one of North America’s most recognizable logistics brands. It provides commercial truck leasing, rental, and fleet maintenance services, operating a fleet of over 400,000 vehicles. Its logistics arm offers freight management, supply chain optimization, and warehousing for enterprise customers.

Penske Logistics
Source: Penske Logistics

But Penske is more than a fleet and logistics company. It’s a data-driven operation where technology plays a central role in service delivery. From vehicle telematics to customer support, Penske is leveraging data streaming and AI to meet growing demands for reliability, transparency, and speed.

Penske’s Data Streaming Success Story

Penske explored its data streaming journey at the Confluent Data in Motion Tour. Sarvant Singh, Vice President of Data and Emerging Solutions at Penske, explains the company’s motivation clearly: “We’re an information-intense business. A lot of information is getting exchanged between our customers, associates, and partners. In our business, vehicle uptime and supply chain visibility are critical.

This focus on uptime is what drove Penske to adopt a real-time data streaming platform, powered by Confluent. Today, Penske ingests and processes around 190 million IoT messages every day from its vehicles.

Each truck contains hundreds of sensors (and thousands of sub-sensors) that monitor everything from engine performance to braking systems. With this volume of data, traditional architectures fell short. Penske turned to Confluent Cloud to leverage Apache Kafka at scale as a fully-managed, elastic SaaS to eliminate the operational burden and unlocking true real-time capabilities.

By streaming sensor data through Confluent and into a proactive diagnostics engine, Penske can now predict when a vehicle may fail—before the problem arises. Maintenance can be scheduled in advance, roadside breakdowns avoided, and customer deliveries kept on track.

This approach has already prevented over 90,000 potential roadside incidents. The business impact is enormous, saving time, money, and reputation.

Other real-time use cases include:

  • Diagnosing issues instantly to dispatch roadside assistance faster
  • Triggering preventive maintenance alerts to avoid unscheduled downtime
  • Automating compliance for IFTA reporting using telematics data
  • Streamlining repair workflows through integration with electronic DVIRs (Driver Vehicle Inspection Reports)

Why Confluent for Apache Kafka?

Managing Kafka in-house was never the goal for Penske. After initially working with a different provider, they transitioned to Confluent Cloud to avoid the complexity and cost of maintaining open-source Kafka themselves.

“We’re not going to put mission-critical applications on an open source tech,” Singh noted. “Enterprise-grade applications require enterprise level support—and Confluent’s business value has been clear.”

Key reasons for choosing Confluent include:

  • The ability to scale rapidly without manual rebalancing
  • Enterprise tooling, including stream governance and connectors
  • Seamless integration with AI and analytics engines
  • Reduced time to market and improved uptime

Data Streaming and AI in Action at Penske

Penske’s investment in AI began in 2015, long before it became a mainstream trend. Early use cases included Erica, a virtual assistant that helps customers manage vehicle reservations. Today, AI is being used to reduce repair times, predict failures, and improve customer service experiences.

By combining real-time data with machine learning, Penske can offer more reliable services and automate decisions that previously required human intervention. AI-enabled diagnostics, proactive maintenance, and conversational assistants are already delivering measurable benefits.

The company is also exploring the role of generative AI. Singh highlighted the potential of technologies like ChatGPT for enterprise applications—but also stressed the importance of controls: “Configuration for risk tolerance is going to be the key. Traceability, explainability, and anomaly detection must be built in.”

Fleet Intelligence in Action: Measurable Business Value Through Data Streaming

For a company operating hundreds of thousands of vehicles, the stakes are high. Penske’s real-time architecture has improved uptime, accelerated response times, and empowered technicians and drivers with better tools.

The business outcomes are clear:

  • Fewer breakdowns and delays
  • Faster resolution of vehicle issues
  • Streamlined operations and reporting
  • Better customer and driver experience
  • Scalable infrastructure for new services, including electric vehicle fleets

With 165,000 vehicles already connected to Confluent and more being added as EV adoption grows, Penske is just getting started.

The Road Ahead: Agentic AI and the Next Evolution of Event-Driven Architecture Powered By Apache Kafka

The future of logistics will be defined by intelligent, real-time systems that coordinate not just vehicles, but entire networks. As Penske scales its edge computing and expands its use of remote sensing and autonomous technologies, the role of data streaming will only increase.

Agentic AI—systems that act autonomously based on real-time context—will require seamless integration of telematics, edge analytics, and cloud intelligence. This demands a resilient, flexible event-driven foundation. I explored the general idea in a dedicated article: How Apache Kafka and Flink Power Event-Driven Agentic AI in Real Time.

Agentic AI with Apache Kafka as Event Broker Combined with MCP and A2A Protocol

Penske’s journey shows that real-time data streaming is not only possible—it’s practical, scalable, and deeply transformative. The combination of a data streaming platform, sensor analytics, and AI allows the company to turn every vehicle into a smart, connected node in a global supply chain.

For logistics providers seeking to modernize, the path is clear. It starts with streaming data—and the possibilities grow from there. Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases.

The post How Penske Logistics Transforms Fleet Intelligence with Data Streaming and AI appeared first on Kai Waehner.

]]>
Agentic AI with the Agent2Agent Protocol (A2A) and MCP using Apache Kafka as Event Broker https://www.kai-waehner.de/blog/2025/05/26/agentic-ai-with-the-agent2agent-protocol-a2a-and-mcp-using-apache-kafka-as-event-broker/ Mon, 26 May 2025 05:32:01 +0000 https://www.kai-waehner.de/?p=7855 Agentic AI is emerging as a powerful pattern for building autonomous, intelligent, and collaborative systems. To move beyond isolated models and task-based automation, enterprises need a scalable integration architecture that supports real-time interaction, coordination, and decision-making across agents and services. This blog explores how the combination of Apache Kafka, Model Context Protocol (MCP), and Google’s Agent2Agent (A2A) protocol forms the foundation for Agentic AI in production. By replacing point-to-point APIs with event-driven communication as the integration layer, enterprises can achieve decoupling, flexibility, and observability—unlocking the full potential of AI agents in modern enterprise environments.

The post Agentic AI with the Agent2Agent Protocol (A2A) and MCP using Apache Kafka as Event Broker appeared first on Kai Waehner.

]]>
Agentic AI is gaining traction as a design pattern for building more intelligent, autonomous, and collaborative systems. Unlike traditional task-based automation, agentic AI involves intelligent agents that operate independently, make contextual decisions, and collaborate with other agents or systems—across domains, departments, and even enterprises.

In the enterprise world, agentic AI is more than just a technical concept. It represents a shift in how systems interact, learn, and evolve. But unlocking its full potential requires more than AI models and point-to-point APIs—it demands the right integration backbone.

That’s where Apache Kafka as event broker for true decoupling comes into play together with two emerging AI standards: Google’s Application-to-Application (A2A) Protocol and Antrophic’s Model Context Protocol (MCP) in an enterprise architecture for Agentic AI.

Agentic AI with Apache Kafka as Event Broker Combined with MCP and A2A Protocol

Inspired by my colleague Sean Falconer’s blog post, Why Google’s Agent2Agent Protocol Needs Apache Kafka, this blog post explores the Agentic AI adoption in enterprises and how an event-driven architecture with Apache Kafka fits into the AI architecture.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases, including various AI examples across industries.

Business Value of Agentic AI in the Enterprise

For enterprises, the promise of agentic AI is compelling:

  • Smarter automation through self-directed, context-aware agents
  • Improved customer experience with faster and more personalized responses
  • Operational efficiency by connecting internal and external systems more intelligently
  • Scalable B2B interactions that span suppliers, partners, and digital ecosystems

But none of this works if systems are coupled by brittle point-to-point APIs, slow batch jobs, or disconnected data pipelines. Autonomous agents need continuous, real-time access to events, shared state, and a common communication fabric that scales across use cases.

Model Context Protocol (MCP) + Agent2Agent (A2A): New Standards for Agentic AI

The Model Context Protocol (MCP) coined by Anthropic offers a standardized, model-agnostic interface for context exchange between AI agents and external systems. Whether the interaction is streaming, batch, or API-based, MCP abstracts how agents retrieve inputs, send outputs, and trigger actions across services. This enables real-time coordination between models and tools—improving autonomy, reusability, and interoperability in distributed AI systems.

Model Context Protocol MCP by Anthropic
Source: Anthropic

Google’s Agent2Agent (A2A) protocol complements this by defining how autonomous software agents can interact with one another in a standard way. A2A enables scalable agent-to-agent collaboration—where agents discover each other, share state, and delegate tasks without predefined integrations. It’s foundational for building open, multi-agent ecosystems that work across departments, companies, and platforms.

Agent2Agent A2A Protocol by Google and MCP
Source: Google

Why Apache Kafka Is a Better Fit Than an API (HTTP/REST) for A2A and MCP

Most enterprises today use HTTP-based APIs to connect services—ideal for simple, synchronous request-response interactions.

In contrast, Apache Kafka is a distributed event streaming platform designed for asynchronous, high-throughput, and loosely coupled communication—making it a much better fit for multi-agent (A2A) and agentic AI architectures.

API-Based IntegrationKafka-Based Integration
Synchronous, blockingAsynchronous, event-driven
Point-to-point couplingLoose coupling with pub/sub topics
Hard to scale to many agentsSupports multiple consumers natively
No shared memoryKafka retains and replays event history
Limited observabilityFull traceability with schema registry & DLQs

Kafka serves as the decoupling layer. It becomes the place where agents publish their state, subscribe to updates, and communicate changes—independently and asynchronously. This enables multi-agent coordination, resilience, and extensibility.

MCP + Kafka = Open, Flexible Communication

As the adoption of Agentic AI accelerates, there’s a growing need for scalable communication between AI agents, services, and operational systems. The Model-Context Protocol (MCP) is emerging as a standard to structure these interactions—defining how agents access tools, send inputs, and receive results. But a protocol alone doesn’t solve the challenges of integration, scaling, or observability.

This is where Apache Kafka comes in.

By combining MCP with Kafka, agents can interact through a Kafka topic—fully decoupled, asynchronous, and in real time. Instead of direct, synchronous calls between agents and services, all communication happens through Kafka topics, using structured events based on the MCP format.

This model supports a wide range of implementations and tech stacks. For instance:

  • A Python-based AI agent deployed in a SaaS environment
  • A Spring Boot Java microservice running inside a transactional core system
  • A Flink application deployed at the edge performing low-latency stream processing
  • An API gateway translating HTTP requests into MCP-compliant Kafka events

Regardless of where or how an agent is implemented, it can participate in the same event-driven system. Kafka ensures durability, replayability, and scalability. MCP provides the semantic structure for requests and responses.

Agentic AI with Apache Kafka as Event Broker

The result is a highly flexible, loosely coupled architecture for Agentic AI—one that supports real-time processing, cross-system coordination, and long-term observability. This combination is already being explored in early enterprise projects and will be a key building block for agent-based systems moving into production.

Stream Processing as the Agent’s Companion

Stream processing technologies like Apache Flink or Kafka Streams allow agents to:

  • Filter, join, and enrich events in motion
  • Maintain stateful context for decisions (e.g., real-time credit risk)
  • Trigger new downstream actions based on complex event patterns
  • Apply AI directly within the stream processing logic, enabling real-time inference and contextual decision-making with embedded models or external calls to a model server, vector database, or any other AI platform

Agents don’t need to manage all logic themselves. The data streaming platform can pre-process information, enforce policies, and even trigger fallback or compensating workflows—making agents simpler and more focused.

Technology Flexibility for Agentic AI Design with Data Contracts

One of the biggest advantages of Kafka-based event-driven and decoupled backend for agentic systems is that agents can be implemented in any stack:

  • Languages: Python, Java, Go, etc.
  • Environments: Containers, serverless, JVM apps, SaaS tools
  • Communication styles: Event streaming, REST APIs, scheduled jobs

The Kafka topic is the stable data contract for quality and policy enforcement. Agents can evolve independently, be deployed incrementally, and interoperate without tight dependencies.

Microservices, Data Products, and Reusability – Agentic AI Is Just One Piece of the Puzzle

To be effective, Agentic AI needs to connect seamlessly with existing operational systems and business workflows.

Kafka topics enable the creation of reusable data products that serve multiple consumers—AI agents, dashboards, services, or external partners. This aligns perfectly with data mesh and microservice principles, where ownership, scalability, and interoperability are key.

Agent2Agent Protocol (A2A) and MCP via Apache Kafka as Event Broker for Truly Decoupled Agentic AI

A single stream of enriched order events might be consumed via a single data product by:

  • A fraud detection agent
  • A real-time alerting system
  • An agent triggering SAP workflow updates
  • A lakehouse for reporting and batch analytics

This one-to-many model is the opposite of traditional REST designs and crucial for enabling agentic orchestration at scale.

Agentic Al Needs Integration with Core Enterprise Systems

Agentic AI is not a standalone trend—it’s becoming an integral part of broader enterprise AI strategies. While this post focuses on architectural foundations like Kafka, MCP, and A2A, it’s important to recognize how this infrastructure complements the evolution of major AI platforms.

Leading vendors such as Databricks, Snowflake, and others are building scalable foundations for machine learning, analytics, and generative AI. These platforms often handle model training and serving. But to bring agentic capabilities into production—especially for real-time, autonomous workflows—they must connect with operational, transactional systems and other agents at runtime. (See also: Confluent + Databricks blog series | Apache Kafka + Snowflake blog series)

This is where Kafka as the event broker becomes essential: it links these analytical backends with AI agents, transactional systems, and streaming pipelines across the enterprise.

At the same time, enterprise application vendors are embedding AI assistants and agents directly into their platforms:

  • SAP Joule / Business AI – Embedded AI for finance, supply chain, and operations
  • Salesforce Einstein / Copilot Studio – Generative AI for CRM and sales automation
  • ServiceNow Now Assist – Predictive automation across IT and employee services
  • Oracle Fusion AI / OCI – ML for ERP, HCM, and procurement
  • Microsoft Copilot – Integrated AI across Dynamics and Power Platform
  • IBM watsonx, Adobe Sensei, Infor Coleman AI – Governed, domain-specific AI agents

Each of these solutions benefits from the same architectural foundation: real-time data access, decoupled integration, and standardized agent communication.

Whether deployed internally or sourced from vendors, agents need reliable event-driven infrastructure to coordinate with each other and with backend systems. Apache Kafka provides this core integration layer—supporting a consistent, scalable, and open foundation for agentic AI across the enterprise.

Agentic AI Requires Decoupling – Apache Kafka Supports A2A and MCP as an Event Broker

To deliver on the promise of agentic AI, enterprises must move beyond point-to-point APIs and batch integrations. They need a shared, event-driven foundation that enables agents (and other enterprise software) to work independently and together—with shared context, consistent data, and scalable interactions.

Apache Kafka provides exactly that. Combined with MCP and A2A for standardized Agentic AI communication, Kafka unlocks the flexibility, resilience, and openness needed for next-generation enterprise AI.

It’s not about picking one agent platform—it’s about giving every agent the same, reliable interface to the rest of the world. Kafka is that interface.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases, including various AI examples across industries.

The post Agentic AI with the Agent2Agent Protocol (A2A) and MCP using Apache Kafka as Event Broker appeared first on Kai Waehner.

]]>
Powering Fantasy Sports at Scale: How Dream11 Uses Apache Kafka for Real-Time Gaming https://www.kai-waehner.de/blog/2025/05/19/powering-fantasy-sports-at-scale-how-dream11-uses-apache-kafka-for-real-time-gaming/ Mon, 19 May 2025 06:48:27 +0000 https://www.kai-waehner.de/?p=7916 Fantasy sports has evolved into a data-driven, real-time digital industry with high stakes and massive user engagement. At the heart of this transformation is Dream11, India’s leading fantasy sports platform, which relies on Apache Kafka to deliver instant updates, seamless gameplay, and trustworthy user experiences for over 230 million fans. This blog post explores how Dream11 leverages Kafka to meet extreme traffic demands, scale infrastructure efficiently, and maintain real-time responsiveness—even during the busiest moments of live sports.

The post Powering Fantasy Sports at Scale: How Dream11 Uses Apache Kafka for Real-Time Gaming appeared first on Kai Waehner.

]]>
Fantasy sports has become one of the most dynamic and data-intensive digital industries of the past decade. What started as a casual game for sports fans has evolved into a massive business, blending real-time analytics, mobile engagement, and personalized gaming experiences. At the center of this transformation is Apache Kafka—a critical enabler for platforms like Dream11, where millions of users expect live scores, instant feedback, and seamless gameplay. This post explores how fantasy sports works, why real-time data is non-negotiable, and how Dream11 has scaled its Kafka infrastructure to handle some of the world’s most demanding user traffic patterns.

Real Time Gaming with Apache Kafka Powers Dream11 Fantasy Sports

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases, including several success stories around gaming, loyalty platforms, and personalized advertising.

Fantasy Sports: Real-Time Gaming Meets Real-World Sports

Fantasy sports allows users to create virtual teams based on real-life athletes. As matches unfold, players earn points based on the performance of their selected athletes. The better the team performs, the higher the user’s score—and the bigger the prize.

Key characteristics of fantasy gaming:

  • Multi-sport experience: Users can play across cricket, football, basketball, and more.
  • Live interaction: Scoring is updated in real time as matches progress.
  • Contests and leagues: Players join public or private contests, often with cash prizes.
  • Peak traffic patterns: Most activity spikes in the minutes before a match begins.

This user behavior creates a unique business and technology challenge. Millions of users make critical decisions at the same time, just before the start of each game. The result: extreme concurrency, massive request volumes, and a hard dependency on data accuracy and low latency.

Real-time infrastructure isn’t optional in this model. It’s fundamental to user trust and business success.

Dream11: A Fantasy Sports Giant with Massive Scale

Founded in India, Dream11 is the largest fantasy sports platform in the country—and one of the biggest globally. With over 230 million users, it dominates fantasy gaming across cricket and 11 other sports. The platform sees traffic that rivals the world’s largest digital services.

Dream11 Mobile App
Source: Dream11

Bipul Karnanit from Dream11 presented very interesting overview at Current 2025 in Bangalore India. Here are a few statistics about Dream11’s scale:

  • 230M users
  • 12 sports
  • 12,000 matches/year
  • 44TB data per day
  • 15M+ peak concurrent users
  • 43M+ peak transactions/day

During major events like the IPL, Dream11 experiences hockey-stick traffic curves, where tens of millions of users log in just minutes before a match begins—making lineup changes, joining contests, and waiting for live updates.

This creates a business-critical need for:

  • Low latency
  • Guaranteed data consistency
  • Fault tolerance
  • Real-time analytics and scoring
  • High developer productivity to iterate fast

Apache Kafka at the Heart of Dream11’s Platform

To meet these demands, Dream11 uses Apache Kafka as the foundation of its real-time data infrastructure. Kafka powers the messaging between services that manage user actions, match scores, payouts, leaderboards, and more.

Apache Kafka enables:

  • Event-driven microservices
  • Scalable ingestion and processing of user and game data
  • Loose coupling between systems with data products for operational and analytical consumers
  • High throughput with guaranteed ordering and durability

Event-driven Architecture with Data Streaming for Gaming using Apache Kafka and Flink

Solving Kafka Consumer Challenges at Scale

As the business grew, Dream11’s engineering team encountered challenges with Kafka’s standard consumer APIs, particularly around rebalancing, offset management, and processing guarantees under peak load.

To address these issues, Dream11 built a custom Java-based Kafka consumer library—a foundational component of its internal platform that simplifies Kafka integration across services and boosts developer productivity.

Dream11 Kafka Consumer Library:

  • Purpose: A custom-built Java library designed to handle high-volume Kafka message consumption at Dream11 scale.
  • Key Benefit: Abstracts away low-level Kafka consumer details, simplifying tasks like offset management, error handling, and multi-threading, allowing developers to focus on business logic.
  • Simple Interfaces: Provides easy-to-use interfaces for processing records.
  • Increased Developer Productivity: Standardized library lead to faster development and fewer errors.

This library plays a crucial role in enabling real-time updates and ensuring seamless gameplay—even under the most demanding user scenarios.

For deeper technical insights, including how Dream11 decoupled polling and processing, implemented at-least-once delivery, and improved throughput with custom worker pools, watch the Dream11 engineering session from Current India 2025 presented by Bipul Karnanit.

Fantasy Sports, Real-Time Expectations, and Business Value

Dream11’s business success is built on user trust, real-time responsiveness, and high-quality gameplay. With millions of users relying on accurate, timely updates, the platform can’t afford downtime, data loss, or delays.

Data Streaming with Apache Kafka enables Dream11 to:

  • React to user interactions instantly
  • Deliver consistent data across microservices and devices
  • Scale dynamically during live events
  • Streamline the development and deployment of new features

This is not just a backend innovation—it’s a competitive advantage in a space where milliseconds matter and trust is everything.

Dream11’s Kafka Journey: The Backbone of Fantasy Sports at Scale

Fantasy sports is one of the most demanding environments for real-time data platforms. Dream11’s approach—scaling Apache Kafka to serve hundreds of millions of events with precision—is a powerful example of aligning architecture with business needs.

As more industries adopt event-driven systems, Dream11’s journey offers a clear message: Apache Kafka is not just a messaging layer—it’s a strategic platform for building reliable, low-latency digital experiences at scale.

Whether you’re in gaming, finance, telecom, or logistics, there’s much to learn from the way fantasy sports leaders like Dream11 harness data streaming to deliver world-class services.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases, including several success stories around gaming, loyalty platforms, and personalized advertising.

The post Powering Fantasy Sports at Scale: How Dream11 Uses Apache Kafka for Real-Time Gaming appeared first on Kai Waehner.

]]>
Shift Left Architecture for AI and Analytics with Confluent and Databricks https://www.kai-waehner.de/blog/2025/05/09/shift-left-architecture-for-ai-and-analytics-with-confluent-and-databricks/ Fri, 09 May 2025 06:03:07 +0000 https://www.kai-waehner.de/?p=7774 Confluent and Databricks enable a modern data architecture that unifies real-time streaming and lakehouse analytics. By combining shift-left principles with the structured layers of the Medallion Architecture, teams can improve data quality, reduce pipeline complexity, and accelerate insights for both operational and analytical workloads. Technologies like Apache Kafka, Flink, and Delta Lake form the backbone of scalable, AI-ready pipelines across cloud and hybrid environments.

The post Shift Left Architecture for AI and Analytics with Confluent and Databricks appeared first on Kai Waehner.

]]>
Modern enterprise architectures are evolving. Traditional batch data pipelines and centralized processing models are being replaced by more flexible, real-time systems. One of the driving concepts behind this change is the Shift Left approach. This blog compares Databricks’ Medallion Architecture with a Shift Left Architecture popularized by Confluent. It explains where each concept fits best—and how they can work together to create a more complete, flexible, and scalable architecture.

Shift Left Architecture with Confluent Data Streaming and Databricks Lakehouse Medallion

About the Confluent and Databricks Blog Series

This article is part of a blog series exploring the growing roles of Confluent and Databricks in modern data and AI architectures:

Learn how these platforms will affect data use in businesses in future articles. Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including more details about the shift left architecture with data streaming and lakehouses.

Medallion Architecture: Structured, Proven, but Not Always Optimal

The Medallion Architecture, popularized by Databricks, is a well-known design pattern for organizing and processing data within a lakehouse. It provides structure, modularity, and clarity across the data lifecycle by breaking pipelines into three logical layers:

  • Bronze: Ingest raw data in its original format (often semi-structured or unstructured)
  • Silver: Clean, normalize, and enrich the data for usability
  • Gold: Aggregate and transform the data for reporting, dashboards, and machine learning
Databricks Medallion Architecture for Lakehouse ETL
Source: Databricks

This layered approach is valuable for teams looking to establish governed and scalable data pipelines. It supports incremental refinement of data and enables multiple consumers to work from well-defined stages.

Challenges of the Medallion Architecture

The Medallion Architecture also introduces challenges:

  • Pipeline delays: Moving data from Bronze to Gold can take minutes or longer—too slow for operational needs
  • Infrastructure overhead: Each stage typically requires its own compute and storage footprint
  • Redundant processing: Data transformations are often repeated across layers
  • Limited operational use: Data is primarily at rest in object storage; using it for real-time operational systems often requires inefficient reverse ETL pipelines.

For use cases that demand real-time responsiveness and/or critical SLAs—such as fraud detection, personalized recommendations, or IoT alerting—this traditional batch-first model may fall short. In such cases, an event-driven streaming-first architecture, powered by a data streaming platform like Confluent, enables faster, more cost-efficient pipelines by performing validation, enrichment, and even model inference before data reaches the lakehouse.

Importantly, this data streaming approach doesn’t replace the Medallion pattern—it complements it. It allows you to “shift left” critical logic, reducing duplication and latency while still feeding trusted, structured data into Delta Lake or other downstream systems for broader analytics and governance.

In other words, shifting data processing left (i.e., before it hits a data lake or Lakehouse) is especially valuable when the data needs to serve multiple downstream systems—operational and analytical alike—because it avoids duplication, reduces latency, and ensures consistent, high-quality data is available wherever it’s needed.

Shift Left Architecture: Process Earlier, Share Faster

In a Shift Left Architecture, data processing happens earlier—closer to the source, both physically and logically. This often means:

  • Transforming and validating data as it streams in
  • Enriching and filtering in real time
  • Sharing clean, usable data quickly across teams AND different technologies/applications

Shift Left Architecture with Data Streaming into Data Lake Warehouse Lakehouse

This is especially useful for:

  • Reducing time to insight
  • Improving data quality at the source
  • Creating reusable, consistent data products
  • Operational workloads with critical SLAs

How Confluent Enables Shift Left with Databricks

In a Shift Left setup, Apache Kafka provides scalable, low-latency, and truly decoupled ingestion of data across operational and analytical systems, forming the backbone for unified data pipelines.

Schema Registry and data governance policies enforce consistent, validated data across all streams, ensuring high-quality, secure, and compliant data delivery from the very beginning.

Apache Flink enables early data processing — closer to where data is produced. This reduces complexity downstream, improves data quality, and allows real-time decisions and analytics.

Shift Left Architecture with Confluent Databricks and Delta Lake

Data Quality Governance via Data Contracts and Schema Validation

Flink can enforce data contracts by validating incoming records against predefined schemas (e.g., using JSON Schema, Apache Avro or Protobuf with Schema Registry). This ensures structurally valid data continues through the pipeline. In cases where schema violations occur, records can be automatically routed to a Dead Letter Queue (DLQ) for inspection.

Confluent Schema Registry for good Data Quality, Policy Enforcement and Governance using Apache Kafka

Additionally, data contracts can enforce policy-based rules at the schema level—such as field-level encryption, masking of sensitive data (PII), type coercion, or enrichment defaults. These controls help maintain compliance and reduce risk before data reaches regulated or shared environments.

Flink can perform the following tasks before data ever lands in a data lake or warehouse:

Filtering and Routing

Events can be filtered based on business rules and routed to the appropriate downstream system or Kafka topic. This allows different consumers to subscribe only to relevant data, optimizing both performance and cost.

Metric Calculation

Use Flink to compute rolling aggregates (e.g., counts, sums, averages, percentiles) over windows of data in motion. This is useful for business metrics, anomaly detection, or feeding real-time dashboards—without waiting for batch jobs.

Real-Time Joins and Enrichment

Flink supports both stream-stream and stream-table joins. This enables real-time enrichment of incoming events with contextual information from reference data (e.g., user profiles, product catalogs, pricing tables), often sourced from Kafka topics, databases, or external APIs.

Streaming ETL with Apache Flink SQL

By shifting this logic to the beginning of the pipeline, teams can reduce duplication, avoid unnecessary storage and compute costs in downstream systems, and ensure that data products are clean, policy-compliant, and ready for both operational and analytical use—as soon as they are created.

Example: A financial application might use Flink to calculate running balances, detect anomalies, and enrich records with reference data before pushing to Databricks for reporting and training analytic models.

In addition to enhancing data quality and reducing time-to-insight in the lakehouse, this approach also makes data products immediately usable for operational workloads and downstream applications—without building separate pipelines.

Learn more about stateless and stateful stream processing in real-time architectures using Apache Flink in this in-depth blog post.

Combining Shift Left with Medallion Architecture

These architectures are not mutually exclusive. Shift Left is about processing data earlier. Medallion is about organizing data once it arrives.

You can use Shift Left principles to:

  • Pre-process operational data before it enters the Bronze layer
  • Ensure clean, validated data enters Silver with minimal transformation needed
  • Reduce the need for redundant processing steps between layers

Confluent’s Tableflow bridges the two worlds. It converts Kafka streams into Delta tables, integrating cleanly with the Medallion model while supporting real-time flows.

Shift Left with Delta Lake, Iceberg, and Tableflow

Confluent Tableflow makes it easy to publish Kafka streams into Delta Lake or Apache Iceberg formats. These can be discovered and queried inside Databricks via Unity Catalog.

This integration:

  • Simplifies integration, governance and discovery
  • Enables live updates to AI features and dashboards
  • Removes the need to manage Spark streaming jobs

This is a natural bridge between a data streaming platform and the lakehouse.

Confluent Tableflow to Unify Operational and Analytical Workloads with Apache Iceberg and Delta Lake
Source: Confluent

AI Use Cases for Shift Left with Confluent and Databricks

The Shift Left model benefits both predictive and generative AI:

  • Model training: Real-time data pipelines can stream features to Delta Lake
  • Model inference: In some cases, predictions can happen in Confluent (via Flink) and be pushed back to operational systems instantly
  • Agentic AI: Real-time event-driven architectures are well suited for next-gen, stateful agents

Databricks supports model training and hosting via MosaicML. Confluent can integrate with these models, or run lightweight inference directly from the stream processing application.

Data Warehouse Use Cases for Shift Left with Confluent and Databricks

  • Batch reporting: Continue using Databricks for traditional BI
  • Real-time analytics: Flink or real-time OLAP engines (e.g., Apache Pinot, Apache Druid) may be a better fit for sub-second insights
  • Hybrid: Push raw events into Databricks for historical analysis and use Flink for immediate feedback

Where you do the data processing depends on the use case.

Architecture Benefits Beyond Technology

Shift Left also brings architectural benefits:

  • Cost Reduction: Processing early can lower storage and compute usage
  • Faster Time to Market: Data becomes usable earlier in the pipeline
  • Reusability: Processed streams can be reused and consumed by multiple technologies/applications (not just Databricks teams)
  • Compliance and Governance: Validated data with lineage can be shared with confidence

These are important for strategic enterprise data architectures.

Bringing in New Types of Data

Shift Left with a data streaming platform supports a wider range of data sources:

  • Operational databases (like Oracle, DB2, SQL Server, Postgres, MongoDB)
  • ERP systems (SAP et al)
  • Mainframes and other legacy technologies
  • IoT interfaces (MQTT, OPC-UA, proprietary IIoT gateway, etc.)
  • SaaS platforms (Salesforce, ServiceNow, and so on)
  • Any other system that does not directly fit into the “table-driven analytics perspective” of a Lakehouse

With Confluent, these interfaces can be connected in real time, enriched at the edge or in transit, and delivered to analytics platforms like Databricks.

This expands the scope of what’s possible with AI and analytics.

Shift Left Using ONLY Databricks

A shift left architecture only with Databricks is possible, too. A Databricks consultant took my Shift Left slide and adjusted it that way:

Shift Left Architecture with Databricks and Delta Lake

 

Relying solely on Databricks for a “Shift Left Architecture” can work if all workloads (should) stay within the platform — but it’s a poor fit for many real-world scenarios.

Databricks focuses on ELT, not true ETL, and lacks native support for operational workloads like APIs, low-latency apps, or transactional systems. This forces teams to rely on reverse ETL tools – a clear anti-pattern in the enterprise architecture – just to get data where it’s actually needed. The result: added complexity, latency, and tight coupling.

The Shift Left Architecture is valuable, but in most cases it requires a modular approach, where streaming, operational, and analytical components work together — not a monolithic platform.

That said, shift left principles still apply within Databricks. Processing data as early as possible improves data quality, reduces overall compute cost, and minimizes downstream data engineering effort. For teams that operate fully inside the Databricks ecosystem, shifting left remains a powerful strategy to simplify pipelines and accelerate insight.

Meesho: Scaling a Real-Time Commerce Platform with Confluent and Databricks

Many high-growth digital platforms adopt a shift-left approach out of necessity—not as a buzzword, but to reduce latency, improve data quality, and scale efficiently by processing data closer to the source.

Meesho, one of India’s largest online marketplaces, relies on Confluent and Databricks to power its hyper-growth business model focused on real-time e-commerce. As the company scaled rapidly, supporting millions of small businesses and entrepreneurs, the need for a resilient, scalable, and low-latency data architecture became critical.

To handle massive volumes of operational events — from inventory updates to order management and customer interactions — Meesho turned to Confluent Cloud. By adopting a fully managed data streaming platform using Apache Kafka, Meesho ensures real-time event delivery, improved reliability, and faster application development. Kafka serves as the central nervous system for their event-driven architecture, connecting multiple services and enabling instant, context-driven customer experiences across mobile and web platforms.

Alongside their data streaming architecture, Meesho migrated from Amazon Redshift to Databricks to build a next-generation analytics platform. Databricks’ lakehouse architecture empowers Meesho to unify operational data from Kafka with batch data from other sources, enabling near real-time analytics at scale. This migration not only improved performance and scalability but also significantly reduced costs and operational overhead.

With Confluent managing real-time event processing and ingestion, and Databricks providing powerful, scalable analytics, Meesho is able to:

  • Deliver real-time personalized experiences to customers
  • Optimize operational workflows based on live data
  • Enable faster, data-driven decision-making across business teams

By combining real-time data streaming with advanced lakehouse analytics, Meesho has built a flexible, future-ready data infrastructure to support its mission of democratizing online commerce for millions across India.

Shift Left: Reducing Complexity, Increasing Value for the Lakehouse (and other Operational Systems)

Shift Left is not about replacing Databricks. It’s about preparing better data earlier in the pipeline—closer to the source—and reducing end-to-end complexity.

  • Use Confluent for real-time ingestion, enrichment, and transformation
  • Use Databricks for advanced analytics, reporting, and machine learning
  • Use Tableflow and Delta Lake to govern and route high-quality data to the right consumers

This architecture not only improves data quality for the lakehouse, but also enables the same real-time data products to be reused across multiple downstream systems—including operational, transactional, and AI-powered applications.

The result: increased agility, lower costs, and scalable innovation across the business.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including more details about the shift left architecture with data streaming and lakehouses.

The post Shift Left Architecture for AI and Analytics with Confluent and Databricks appeared first on Kai Waehner.

]]>
Confluent Data Streaming Platform vs. Databricks Data Intelligence Platform for Data Integration and Processing https://www.kai-waehner.de/blog/2025/05/05/confluent-data-streaming-platform-vs-databricks-data-intelligence-platform-for-data-integration-and-processing/ Mon, 05 May 2025 03:47:21 +0000 https://www.kai-waehner.de/?p=7768 This blog explores how Confluent and Databricks address data integration and processing in modern architectures. Confluent provides real-time, event-driven pipelines connecting operational systems, APIs, and batch sources with consistent, governed data flows. Databricks specializes in large-scale batch processing, data enrichment, and AI model development. Together, they offer a unified approach that bridges operational and analytical workloads. Key topics include ingestion patterns, the role of Tableflow, the shift-left architecture for earlier data validation, and real-world examples like Uniper’s energy trading platform powered by Confluent and Databricks.

The post Confluent Data Streaming Platform vs. Databricks Data Intelligence Platform for Data Integration and Processing appeared first on Kai Waehner.

]]>
Many organizations use both Confluent and Databricks. While these platforms serve different primary goals—real-time data streaming vs. analytical processing—there are areas where they overlap. This blog explores how the Confluent Data Streaming Platform (DSP) and the Databricks Data Intelligence Platform handle data integration and processing. It explains their different roles, where they intersect, and when one might be a better fit than the other.

Confluent and Databricks for Data Integration and Stream Processing

About the Confluent and Databricks Blog Series

This article is part of a blog series exploring the growing roles of Confluent and Databricks in modern data and AI architectures:

Learn how these platforms will affect data use in businesses in future articles. Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including technical architectures and the relation to analytical platforms like Databricks.

Data Integration and Processing: Shared Space, Different Strengths

Confluent is focused on continuous, event-based data movement and processing. It connects to hundreds of real-time and non-real-time data sources and targets. It enables low-latency stream processing using Apache Kafka and Flink, forming the backbone of an event-driven architecture. Databricks, on the other hand, combines data warehousing, analytics, and machine learning on a unified, scalable architecture.

Confluent: Event-Driven Integration Platform

Confluent is increasingly used as modern operational middleware, replacing traditional message queues (MQ) and enterprise service buses (ESB) in many enterprise architectures.

Thanks to its event-driven foundation, it supports not just real-time event streaming but also integration with request/response APIs and batch-based interfaces. This flexibility allows enterprises to standardize on the Kafka protocol as the data hub—bridging asynchronous event streams, synchronous APIs, and legacy systems. The immutable event store and true decoupling of producers and consumers help maintain data consistency across the entire pipeline, regardless of whether data flows in real-time, in scheduled batches or via API calls.

Batch Processing vs Event-Driven Architecture with Continuous Data Streaming

Databricks: Batch-Driven Analytics and AI Platform

Databricks excels in batch processing and traditional ELT workloads. It is optimized for storing data first and then transforming it within its platform, but it’s not built as a real-time ETL tool for directly connecting to operational systems or handling complex, upstream data mappings.

Databricks enables data transformations at scale, supporting complex joins, aggregations, and data quality checks over large historical datasets. Its Medallion Architecture (Bronze, Silver, Gold layers) provides a structured approach to incrementally refine and enrich raw data for analytics and reporting. The engine is tightly integrated with Delta Lake and Unity Catalog, ensuring governed and high-performance access to curated datasets for data science, BI, and machine learning.

For most use cases, the right choice is simple.

  • Confluent is ideal for building real-time pipelines and unifying operational systems.
  • Databricks is optimized for batch analytics, warehousing, and AI development.

Together, Confluent and Databricks cover both sides of the modern data architecture—streaming and batch, operational and analytical. And Confluent’s Tableflow and a shift-left architecture enable native integration with earlier data validation, simplified pipelines, and faster access to AI-ready data.

Data Ingestion Capabilities

Databricks recently introduced LakeFlow Connect and acquired Arcion to strengthen its capabilities around Change Data Capture (CDC) and data ingestion into Delta Lake. These are good steps toward improving integration, particularly for analytical use cases.

However, Confluent is the industry leader in operational data integration, serving as modern middleware for connecting mainframes, ERP systems, IoT devices, APIs, and edge environments. Many enterprises have already standardized on Confluent to move and process operational data in real time with high reliability and low latency.

Introducing yet another tool—especially for ETL and ingestion—creates unnecessary complexity. It risks a return to Lambda-style architectures, where separate pipelines must be built and maintained for real-time and batch use cases. This increases engineering overhead, inflates cost, and slows time to market.

Lambda Architecture - Separate ETL Pipelines for Real Time and Batch Processing

In contrast, Confluent supports a Kappa architecture model: a single, unified event-driven data streaming pipeline that powers both operational and analytical workloads. This eliminates duplication, simplifies the data flow, and enables consistent, trusted data delivery from source to sink.

Kappa Architecture - Single Data Integration Pipeline for Real Time and Batch Processing

Confluent for Data Ingestion into Databricks

Confluent’s integration capabilities provide:

  • 100+ enterprise-grade connectors, including SAP, Salesforce, and mainframe systems
  • Native CDC support for Oracle, SQL Server, PostgreSQL, MongoDB, Salesforce, and more
  • Flexible integration via Kafka Clients for any relevant programming language, REST/HTTP, MQTT, JDBC, and other APIs
  • Support for operational sinks (not just analytics platforms)
  • Built-in governance, durability, and replayability

A good example: Confluent’s Oracle CDC Connector uses Oracle’s XStream API and delivers “GoldenGate-level performance”, with guaranteed ordering, high throughput, and minimal latency. This enables real-time delivery of operational data into Kafka, Flink, and downstream systems like Databricks.

Bottom line: Confluent offers the most mature, scalable, and flexible ingestion capabilities into Databricks—especially for real-time operational data. For enterprises already using Confluent as the central nervous system of their architecture, adding another ETL layer specifically for the lakehouse integration with weaker coverage and SLAs only slows progress and increases cost.

Stick with a unified approach—fewer moving parts, faster implementation, and end-to-end consistency.

Real-Time vs. Batch: When to Use Each

Batch ETL is well understood. It works fine when data does not need to be processed immediately—e.g., for end-of-day reports, monthly audits, or historical analysis.

Streaming ETL is best when data must be processed in motion. This enables real-time dashboards, live alerts, or AI features based on the latest information.

Confluent DSP is purpose-built for streaming ETL. Kafka and Flink allow filtering, transformation, enrichment, and routing in real time.

Databricks supports batch ELT natively. Delta Live Tables offers a managed way to build data pipelines on top of Spark. Delta Live Tables lets you declaratively define how data should be transformed and processed using SQL or Python. On the other side, Spark Structured Streaming can handle streaming data in near real-time. But it still requires persistent clusters and infrastructure management. 

If you’re already invested in Spark, Structured Streaming or Delta Live Tables might be sufficient. But if you’re starting fresh—or looking to simplify your architecture — Confluent’s Tableflow provides a more streamlined, Kafka-native alternative. Tableflow represents Kafka streams as Delta Lake tables. No cluster management. No offset handling. Just discoverable, governed data in Databricks Unity Catalog.

Real-Time and Batch: A Perfect Match at Walmart for Replenishment Forecasting in the Supply Chain

Walmart demonstrates how real-time and batch processing can work together to optimize a large-scale, high-stakes supply chain.

At the heart of this architecture is Apache Kafka, powering Walmart’s real-time inventory management and replenishment system.

Kafka serves as the central data hub, continuously streaming inventory updates, sales transactions, and supply chain events across Walmart’s physical stores and digital channels. This enables real-time replenishment to ensure product availability and timely fulfillment for millions of online and in-store customers.

Batch processing plays an equally important role. Apache Spark processes historical sales, seasonality trends, and external factors in micro-batches to feed forecasting models. These forecasts are used to generate accurate daily order plans across Walmart’s vast store network.

Replenishment Supply Chain Logistics at Walmart Retail with Apache Kafka and Spark
Source: Walmart

This hybrid architecture brings significant operational and business value:

  • Kafka provides not just low latency, but true decoupling between systems, enabling seamless integration across real-time streams, batch pipelines, and request-response APIs—ensuring consistent, reliable data flow across all environments
  • Spark delivers scalable, high-performance analytics to refine predictions and improve long-term planning
  • The result: reduced cycle times, better accuracy, increased scalability and elasticity, improved resiliency, and substantial cost savings

Walmart’s supply chain is just one of many use cases where Kafka powers real-time business processes, decisioning and workflow orchestration at global scale—proof that combining streaming and batch is key to modern data infrastructure.

Apache Flink supports both streaming and batch processing within the same engine. This enables teams to build unified pipelines that handle real-time events and batch-style computations without switching tools or architectures. In Flink, batch is treated as a special case of streaming—where a bounded stream (or a complete window of events) can be processed once all data has arrived.

This approach simplifies operations by avoiding the need for parallel pipelines or separate orchestration layers. It aligns with the principles of the shift-left architecture, allowing earlier processing, validation, and enrichment—closer to the data source. As a result, pipelines are more maintainable, scalable, and responsive.

That said, batch processing is not going away—nor should it. For many use cases, batch remains the most practical solution. Examples include:

  • Daily financial reconciliations
  • End-of-day retail reporting
  • Weekly churn model training
  • Monthly compliance and audit jobs

In these cases, latency is not critical, and workloads often involve large volumes of historical data or complex joins across datasets.

This is where Databricks excels—especially with its Delta Lake and Medallion architecture, which structures raw, refined, and curated data layers for high-performance analytics, BI, and AI/ML training.

In summary, Flink offers the flexibility to consolidate streaming and batch pipelines, making it ideal for unified data processing. But when batch is the right choice—especially at scale or with complex transformations—Databricks remains a best-in-class platform. The two technologies are not mutually exclusive. They are complementary parts of a modern data stack.

Streaming CDC and Lakehouse Analytics

Streaming CDC is a key integration pattern. It captures changes from operational databases and pushes them into analytics platforms. But CDC isn’t limited to databases. CDC is just as important for business applications like Salesforce, where capturing customer updates in real time enables faster, more responsive analytics and downstream actions.

Confluent is well suited for this. Kafka Connect and Flink can continuously stream changes. These change events are sent to Databricks as Delta tables using Tableflow. Streaming CDC ensures:

  • Data consistency across operational and analytical workloads leveraging a single data pipeline
  • Reduced ETL / ELT lag
  • Near real-time updates to BI dashboards
  • Timely training of AI/ML models

Streaming CDC also avoids data duplication, reduces latency, and minimizes storage costs.

Reverse ETL: An (Anti) Pattern to Avoid with Confluent and Databricks

Some architectures push data from data lakes or warehouses back into operational systems using reverse ETL. While this may appear to bridge the analytical and operational worlds, it often leads to increased latency, duplicate logic, and fragile point-to-point workflows. These tools typically reprocess data that was already transformed once, leading to inefficiencies, governance issues, and unclear data lineage.

Reverse ETL is an architectural anti-pattern. It violates the principles of an event-driven system. Rather than reacting to events as they happen, reverse ETL introduces delays and additional moving parts—pushing stale insights back into systems that expect real-time updates.

Data at Rest and Reverse ETL

With the upcoming bidirectional integration of Tableflow with Delta Lake, these issues can be avoided entirely. Insights generated in Databricks—from analytics, machine learning, or rule-based engines—can be pushed directly back into Kafka topics.

This approach removes the need for reverse ETL tools, reduces system complexity, and ensures that both operational and analytical layers operate on a shared, governed, and timely data foundation.

It also brings lineage, schema enforcement, and observability into both directions of data flow—streamlining feedback loops and enabling true event-driven decisioning across the enterprise.

In short: Don’t pull data back into operational systems after the fact. Push insights forward at the speed of events.

Multi-Cloud and Hybrid Integration with an Event-Driven Architecture

Confluent is designed for distributed data movement across environments in real-time for operational and analytical use cases:

  • On-prem, cloud, and edge
  • Multi-region and multi-cloud
  • Support for SaaS, BYOC, and private networking

Features like Cluster Linking and Schema Registry ensure consistent replication and governance across environments.

Databricks runs only in the cloud. It supports hybrid access and partner integrations. But the platform is not built for event-driven data distribution across hybrid environments.

In a hybrid architecture, Confluent acts as the bridge. It moves operational data securely and reliably. Then, Databricks can consume it for analytics and AI use cases. Here is an example architecture for industrial IoT use cases:

Data Streaming and Lakehouse with Confluent and Databricks for Hybrid Cloud and Industrial IoT

Uniper: Real-Time Energy Trading with Confluent and Databricks

Uniper, a leading international energy company, leverages Confluent and Databricks to modernize its energy trading operations.

Uniper - The beating of energy

I covered the value of data streaming with Apache Kafka and Flink for energy trading in a dedicated blog post already.

Confluent Cloud with Apache Kafka and Apache Flink provides a scalable real-time data streaming foundation for Uniper, enabling efficient ingestion and processing of market data, IoT sensor inputs, and operational events. This setup supports the full trading lifecycle, improving decision-making, risk management, and operational agility.

Apache Kafka and Flink integrated into the Uniper IT landscape

Within its Azure environment, Uniper uses Databricks to empower business users to rapidly build trading decision-support tools and advanced analytics applications. By combining a self-service data platform with scalable processing power, Uniper significantly reduces the lead time for developing data apps—from weeks to just minutes.

To deliver real-time insights to its teams, Uniper also leverages Plotly’s Dash Enterprise, creating interactive dashboards that consolidate live data from Databricks, Kafka, Snowflake, and various databases. This end-to-end integration enables dynamic, collaborative workflows, giving analysts and traders fast, actionable insights that drive smarter, faster trading strategies.

By combining real-time data streaming, advanced analytics, and intuitive visualization, Uniper has built a resilient, flexible data architecture that meets the demands of today’s fast-moving energy markets.

From Ingestion to Insight: Modern Data Integration and Processing for AI with Confluent and Databricks

While both platforms can handle integration and processing, their roles are different:

  • Use Confluent when you need real-time ingestion and processing of operational and analytical workloads, or data delivery across systems and clouds.
  • Use Databricks for AI workloads, analytics and data warehousing.

When used together, Confluent and Databricks form a complete data integration and processing pipeline for AI and analytics:

  1. Confluent ingests and processes operational data in real time.
  2. Tableflow pushes this data into Delta Lake in a discoverable, secure format.
  3. Databricks performs analytics and model development.
  4. Tableflow (bidirectional) pushes insights or AI models back into Kafka for use in operational systems.

This is the foundation for modern data and AI architectures—real-time pipelines feeding intelligent applications.

Stay tuned for deep dives into how these platforms are shaping the future of data-driven enterprises. Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including technical architectures and the relation to analytical platforms like Databricks.

The post Confluent Data Streaming Platform vs. Databricks Data Intelligence Platform for Data Integration and Processing appeared first on Kai Waehner.

]]>
Real-Time Data Sharing in the Telco Industry for MVNO Growth and Beyond with Data Streaming https://www.kai-waehner.de/blog/2025/04/30/real-time-data-sharing-in-the-telco-industry-for-mvno-growth-and-beyond-with-data-streaming/ Wed, 30 Apr 2025 07:04:07 +0000 https://www.kai-waehner.de/?p=7786 The telecommunications industry is transforming rapidly as Telcos expand partnerships with MVNOs, IoT platforms, and enterprise customers. Traditional batch-driven architectures can no longer meet the demands for real-time, secure, and flexible data access. This blog explores how real-time data streaming technologies like Apache Kafka and Flink, combined with hybrid cloud architectures, enable Telcos to build trusted, scalable data ecosystems. It covers the key components of a modern data sharing platform, critical use cases across the Telco value chain, and how policy-driven governance and tailored data products drive new business opportunities, operational excellence, and regulatory compliance. Mastering real-time data sharing positions Telcos to turn raw events into strategic advantage faster and more securely than ever before.

The post Real-Time Data Sharing in the Telco Industry for MVNO Growth and Beyond with Data Streaming appeared first on Kai Waehner.

]]>
The telecommunications industry is entering a new era. Partnerships with MVNOs, IoT platforms, and enterprise customers demand flexible, secure, and real-time access to network and customer data. Traditional batch-driven architectures are no longer sufficient. Instead, real-time data streaming combined with policy-driven data sharing provides a powerful foundation for building scalable data products for internal and external consumers. A modern Telco must manage data collection, processing, governance, data sharing, and distribution with the same rigor as its core network services. Leading Telcos now operate centralized real-time data streaming platforms to integrate and share network events, subscriber information, billing records, and telemetry from thousands of data sources across the edge and core networks.

Data Sharing for MVNO Growth and Beyond with Data Streaming in the Telco Industry

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including a dedicated chapter about the telco industry.

Data Streaming in the Telco Industry

Telecommunications networks generate vast amounts of data every second. Every call, message, internet session, device interaction, and network event produces valuable information. Historically, much of this data was processed in batches — often hours or even days after it was collected. This delayed model no longer meets the needs of modern Telcos, partners, and customers.

Data streaming transforms how Telcos handle information. Instead of storing and processing data later, it is ingested, processed, and acted upon in real time as it is generated. This enables continuous intelligence across all parts of the network and business.

Learn more about “The Top 20 Problems with Batch Processing (and How to Fix Them with Data Streaming)“.

Business Value of Data Streaming in the Telecom Sector

Key benefits of data streaming for Telcos include:

  • Real-Time Visibility: Immediate insight into network health, customer behavior, fraud attempts, and service performance.
  • Operational Efficiency: Faster detection and resolution of issues reduces downtime, improves customer satisfaction, and lowers operating costs.
  • New Revenue Opportunities: Real-time data enables new services such as dynamic pricing, personalized offers, and proactive customer support.
  • Enhanced Security and Compliance: Immediate anomaly detection and instant auditability support regulatory requirements and protect against cyber threats.

Technologies like Apache Kafka and Apache Flink are now core components of Telco IT architectures. They allow Telcos to integrate massive, distributed data flows from radio access networks (RAN), 5G core systems, IoT ecosystems, billing and support platforms, and customer devices.

Modern Telcos use data streaming to not only improve internal operations but also to deliver trusted, secure, and differentiated services to external partners such as MVNOs, IoT platforms, and enterprise customers.

Learn More about Data Streaming in Telco

Learn more about data streaming in the telecommunications sector:

Data streaming is not an allrounder to solve every problem. Hence, a modern enterprise architecture combines data streaming with purpose-built telco-specific platforms and SaaS solutions, and data lakes/warehouses/lakehouses like Snowflake or Databricks for the analytical workloads.

I already wrote about the combination of data streaming platforms like Confluent together with Snowflake and Microsoft Fabric. A blog series about data streaming with Confluent combined with AI and analytics using Databricks is coming right after this blog post here.

Building a Real-Time Data Sharing Platform in the Telco Industry with Data Streaming

By mastering real-time data streaming, Telcos unlock the ability to share valuable insights securely and efficiently with internal divisions, IoT platforms, and enterprise customers.

Mobile Virtual Network Operators (MVNOs) — companies that offer mobile services without owning their own network infrastructure — are an equally important group of consumers. As an MVNO delivers niche services, competitive pricing, and tailored customer experiences, real-time data sharing becomes essential to support their growth and enable differentiation in a highly competitive market.

Real-Time Data Sharing Between Organizations Is Necessary in the Telco Industry

A strong real-time data sharing platform in the telco industry integrates multiple types of components and stakeholders, organized into four critical areas:

Data Sources

A real-time data platform aggregates information from a wide range of technical systems across the Telco infrastructure.

  • Radio Access Network (RAN) Metrics: Capture real-time information about signal quality, handovers, and user session performance.
  • 5G Core Network Functions: Manage traffic flows, session lifecycles, and device mobility through UPF, SMF, and AMF components.
  • Operational Support Systems (OSS) and Business Support Systems (BSS): Provide data for service assurance, provisioning, customer management, and billing processes.
  • IoT Devices: Send continuous telemetry data from connected vehicles, industrial assets, healthcare monitors, and consumer electronics.
  • Customer Premises Equipment (CPE): Supply performance and operational data from routers, gateways, modems, and set-top boxes.
  • Billing Events: Stream usage records, real-time charging information, and transaction logs to support accurate billing.
  • Customer Profiles: Update subscription plans, user preferences, device types, and behavioral attributes dynamically.
  • Security Logs: Capture authentication events, threat detections, network access attempts, and audit trail information.

Stream Processing

Stream processing technologies ensure raw events are turned into enriched, actionable data products as they move through the system.

  • Real-Time Data Ingestion: Continuously collect and process events from all sources with low latency and high reliability.
  • Data Aggregation and Enrichment: Transform raw network, billing, and device data into structured, valuable datasets.
  • Actionable Data Products: Create enriched, ready-to-consume information for operational and business use cases across the ecosystem.

Data Governance

Effective governance frameworks guarantee that data sharing is secure, compliant, and aligned with commercial agreements.

  • Policy-Based Access Control: Enforce business, regulatory, and contractual rules on how data is shared internally and externally.
  • Data Protection Techniques: Apply masking, anonymization, and encryption to secure sensitive information at every stage.
  • Compliance Assurance: Meet regulatory requirements like GDPR, CCPA, and telecom-specific standards through real-time monitoring and enforcement.

Data Consumers

Multiple internal and external stakeholders rely on tailored, policy-controlled access to real-time data streams to achieve business outcomes.

  • MVNO Partners: Consume real-time network metrics, subscriber insights, and fraud alerts to offer better customer experiences and safeguard operations.
  • Internal Telco Divisions: Use operational data to improve network uptime, optimize marketing initiatives, and detect revenue leakage early.
  • IoT Platform Services: Rely on device telemetry and mobility data to improve fleet management, predictive maintenance, and automated operations.
  • Enterprise Customers: Integrate real-time network insights and SLA compliance monitoring into private network and corporate IT systems.
  • Regulatory and Compliance Bodies: Access live audit streams, security incident data, and privacy-preserving compliance reports as required by law.

Key Data Products Driving Value for Data Sharing in the Telco Industry

In modern Telco architectures, data products act as the building blocks for a data mesh approach, enabling decentralized ownership, scalable integration with microservices, and direct access for consumers across the business and partner ecosystem.

Data Sharing in Telco with a Data Mesh and Data Products using Data Streaming with Apache Kafka

The right data products accelerate time-to-insight and enable additional revenue streams. Leading Telcos typically offer:

  • Network Quality Metrics: Monitoring service degradation, latency spikes, and coverage gaps continuously.
  • Customer Behavior Analytics: Tracking app usage, mobility patterns, device types, and engagement trends.
  • Fraud and Anomaly Detection Feeds: Capturing unusual usage, SIM swaps, or suspicious roaming activities in real time.
  • Billing and Charging Data Streams: Delivering session records and consumption details instantly to billing systems or MVNO partners.
  • Device Telemetry and Health Data: Providing operational status and error signals from smartphones, CPE, and IoT devices.
  • Subscriber Profile Updates: Streaming changes in service plans, device upgrades, or user preferences.
  • Location-Aware Services Data: Powering geofencing, smart city applications, and targeted marketing efforts.
  • Churn Prediction Models: Scoring customer retention risks based on usage behavior and network experience.
  • Network Capacity and Traffic Forecasts: Helping optimize resource allocation and investment planning.
  • Policy Compliance Monitoring: Ensuring real-time validation of internal and external SLAs, privacy agreements, and regulatory requirements.

These data products can be offered via APIs, secure topics, or integrated into partner platforms for direct consumption.

How Each Data Consumer Gains Strategic Value

Real-time data streaming empowers each data consumer within the Telco ecosystem to achieve specific business outcomes, drive operational excellence, and unlock new growth opportunities based on continuous, trusted insights.

Internal Telco Divisions

Real-time insights into network behavior allow proactive incident management and customer support. Marketing teams optimize campaigns based on live subscriber data, while finance teams minimize revenue leakage by tracking billing and usage patterns instantly.

MVNO Partners

Access to live network quality indicators helps MVNOs improve customer satisfaction and loyalty. Real-time fraud monitoring protects against financial losses. Tailored subscriber insights enable MVNOs to offer personalized plans and upsells based on actual usage.

IoT Platform Services

Large-scale telemetry streaming enables better device management, predictive maintenance, and operational automation. Real-time geolocation data improves logistics, fleet management, and smart infrastructure performance. Event-driven alerts help detect and resolve device malfunctions rapidly.

Enterprise Customers

Private 5G networks and managed services depend on live analytics to meet SLA obligations. Enterprises integrate real-time network telemetry into their own systems for smarter decision-making. Data-driven optimizations ensure higher uptime, better resource utilization, and enhanced customer experiences.

Building a Trusted Data Ecosystem for Telcos with Real-Time Streaming and Hybrid Cloud

Real-time data sharing is no longer a luxury for Telcos — it is a competitive necessity. A successful platform must balance openness with control, ensuring that every data exchange respects privacy, governance, and commercial boundaries.

Hybrid cloud architectures play a critical role in this evolution. They enable Telcos to process, govern, and share real-time data across on-premises infrastructure, edge environments, and public clouds seamlessly. By combining the flexibility of cloud-native services with the security and performance of on-premises systems, hybrid cloud ensures that data remains accessible, scalable, cost-efficient and compliant wherever it is needed.

Hybrid 5G Telco Architecture with Data Streaming with AWS Cloud and Confluent Edge and Cloud

By deploying scalable data streaming solutions across a hybrid cloud environment, Telcos enable secure, real-time data sharing with MVNOs, IoT platforms, enterprise customers, and internal business units. This empowers critical use cases such as dynamic quality of service monitoring, real-time fraud detection, customer behavior analytics, predictive maintenance for connected devices, and SLA compliance reporting — all without compromising performance or regulatory requirements.

The future of telecommunications belongs to those who implement real-time data streaming and controlled data sharing — turning raw events into strategic advantage faster, more securely, and more effectively than ever before.

How do you share data in your organization? Do you already leverage data streaming or still operate in batch mode? Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases.

The post Real-Time Data Sharing in the Telco Industry for MVNO Growth and Beyond with Data Streaming appeared first on Kai Waehner.

]]>
Fraud Detection in Mobility Services (Ride-Hailing, Food Delivery) with Data Streaming using Apache Kafka and Flink https://www.kai-waehner.de/blog/2025/04/28/fraud-detection-in-mobility-services-ride-hailing-food-delivery-with-data-streaming-using-apache-kafka-and-flink/ Mon, 28 Apr 2025 06:29:25 +0000 https://www.kai-waehner.de/?p=7516 Mobility services like Uber, Grab, and FREE NOW (Lyft) rely on real-time data to power seamless trips, deliveries, and payments. But this real-time nature also opens the door to sophisticated fraud schemes—ranging from GPS spoofing to payment abuse and fake accounts. Traditional fraud detection methods fall short in speed and adaptability. By using Apache Kafka and Apache Flink, leading mobility platforms now detect and block fraud as it happens, protecting their revenue, users, and trust. This blog explores how real-time data streaming is transforming fraud prevention across the mobility industry.

The post Fraud Detection in Mobility Services (Ride-Hailing, Food Delivery) with Data Streaming using Apache Kafka and Flink appeared first on Kai Waehner.

]]>
Mobility services like Uber, Grab, FREE NOW (Lyft), and DoorDash are built on real-time data. Every trip, delivery, and payment relies on accurate, instant decision-making. But as these services scale, they become prime targets for sophisticated fraud—GPS spoofing, fake accounts, payment abuse, and more. Traditional, batch-based fraud detection can’t keep up. It reacts too late, misses complex patterns, and creates blind spots that fraudsters exploit. To stop fraud before it happens, mobility platforms need data streaming technologies like Apache Kafka and Apache Flink for fraud detection. This blog explores how leading platforms are using real-time event processing to detect and block fraud as it happens—protecting revenue, user trust, and platform integrity at scale.

Fraud Prevention in Mobility Services with Data Streaming using Apache Kafka and Flink with AI Machine Learning

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases.

The Business of Mobility Services (Ride-Hailing, Food Delivery, Taxi Aggregators, etc.)

Mobility services have become an essential part of modern urban life. They offer convenience and efficiency through ride-hailing, food delivery, car-sharing, e-scooters, taxi aggregators, and micro-mobility options. Companies such as Uber, Lyft, FREE NOW (former MyTaxi; acquired by Lyft recently), Grab, Careem, and DoorDash connect millions of passengers, drivers, restaurants, retailers, and logistics partners to enable seamless transactions through digital platforms.

Taxis and Delivery Services in a Modern Smart City

These platforms operate in highly dynamic environments where real-time data is crucial for pricing, route optimization, customer experience, and fraud detection. However, this very nature of mobility services also makes them prime targets for fraudulent activities. Fraud in this sector can lead to financial losses, reputational damage, and deteriorating customer trust.

To effectively combat fraud, mobility services must rely on real-time data streaming with technologies such as Apache Kafka and Apache Flink. These technologies enable continuous event processing and allow platforms to detect and prevent fraud before transactions are finalized.

Why Fraud is a Major Challenge in Mobility Services

Fraudsters continually exploit weaknesses in digital mobility platforms. Some of the most common fraud types include:

  1. Fake Rides and GPS Spoofing: Drivers manipulate GPS data to simulate trips that never occurred. Passengers use location spoofing to receive cheaper fares or exploit promotions.
  1. Payment Fraud and Stolen Credit Cards: Fraudsters use stolen payment methods to book rides or order food.
  1. Fake Drivers and Passengers: Fraudsters create multiple accounts and pretend to be both the driver and passenger to collect incentives. Some drivers manipulate fares by manually adjusting distances in their favor.
  1. Promo Abuse: Users create multiple fake accounts to exploit referral bonuses and promo discounts.
  1. Account Takeovers and Identity Fraud: Hackers gain access to legitimate accounts, misusing stored payment information. Fraudsters use fake identities to bypass security measures.

Fraud not only impacts revenue but also creates risks for legitimate users and drivers. Without proper fraud prevention measures, ride-hailing and delivery companies could face serious losses, both financially and operationally.

The Unseen Enemy: Core Challenges in Mobility Fraud
Detection

Traditional fraud detection relies on batch processing and manual rule-based systems. However, these approaches are no longer effective due to the speed and complexity of modern mobile apps with real-time experiences combined with modern fraud schemes.

Payment Fraud - The Hidden Enemy in a Digital World
Payment Fraud – The Hidden Enemy in a Digital World

Key challenges in mobility fraud detection include:

  • Fraud occurs in real-time, requiring instant detection and prevention before transactions are completed.
  • Millions of events per second must be processed, requiring scalable and efficient systems.
  • Fraud patterns constantly evolve, making static rule-based approaches ineffective.
  • Platforms operate across hybrid and multi-cloud environments, requiring seamless integration of fraud detection systems.

To overcome these challenges, real-time streaming analytics powered by Apache Kafka and Apache Flink provide an effective solution.

Event-driven Architecture for Mobility Services with Data Streaming using Apache Kafka and Flink

Apache Kafka: The Backbone of Event-Driven Fraud Detection

Kafka serves as the core event streaming platform. It captures and processes real-time data from multiple sources such as:

  • GPS location data
  • Payment transactions
  • User and driver behavior analytics
  • Device fingerprints and network metadata

Kafka provides:

  • High-throughput data streaming, capable of processing millions of events per second to support real-time decision-making.
  • An event-driven architecture that enables decoupled, flexible systems—ideal for scalable and maintainable mobility platforms.
  • Seamless scalability across hybrid and multi-cloud environments to meet growing demand and regional expansion.
  • Always-on reliability, ensuring 24/7 data availability and consistency for mission-critical services such as fraud detection, pricing, and trip orchestration.

An excellent success story about the transition to data streaming comes from DoorDash: Why DoorDash migrated from Cloud-native Amazon SQS and Kinesis to Apache Kafka and Flink.

Apache Flink enables real-time fraud detection through advanced event correlation and applied AI:

  • Detects anomalies in GPS data, such as sudden jumps, route manipulation, or unrealistic movement patterns.
  • Analyzes historical user behavior to surface signs of account takeovers or other forms of identity misuse.
  • Joins multiple real-time streams—including payment events, location updates, and account interactions—to generate accurate, low-latency fraud scores.
  • Applies machine learning models in-stream, enabling the system to flag and stop suspicious transactions before they are processed.
  • Continuously adapts to new fraud patterns, updating models with fresh data in near real-time to reflect evolving user behavior and emerging threats.

With Kafka and Flink, fraud detection can shift from reactive to proactive to stop fraudulent transactions before they are completed.

I already covered various data streaming success stories from financial services companies such as Paypal, Capital One and ING Bank in a dedicated blog post. And a separate case study from about “Fraud Prevention in Under 60 Seconds with Apache Kafka: How A Bank in Thailand is Leading the Charge“.

Real-World Fraud Prevention Stories from Mobility Leaders

Fraud is not just a technical issue—it’s a business-critical challenge that impacts trust, revenue, and operational stability in mobility services. The following real-world examples from industry leaders like FREE NOW (Lyft), Grab, and Uber show how data streaming with advanced stream processing and AI are used around the world to detect and stop fraud in real time, at massive scale.

FREE NOW (Lyft): Detecting Fraudulent Trips in Real Time by Analyzing GPS Data of Cars

FREE NOW operates in more than 150 cities across Europe with 48 million users. It integrates multiple mobility services, including taxis, private vehicles, car-sharing, e-scooters, and bikes.

The company was recently acquired by Lyft, the U.S.-based ride-hailing giant known for its focus on multimodal urban transport and strong presence in North America. This acquisition marks Lyft’s strategic entry into the European mobility ecosystem, expanding its footprint beyond the U.S. and Canada.

FREE NOW - former MyTaxi - Company Overview
Source: FREE NOW

Fraud Prevention Approach leveraging Data Streaming (presented at Kafka Summit)

  • Uses Kafka Streams and Kafka Connect to analyze GPS trip data in real-time.
  • Deploys fraud detection models that identify anomalies in trip routes and fare calculations.
  • Operates data streaming on fully managed Confluent Cloud and applications on Kubernetes for scalable fraud detection.
Fraud Prevention in Mobility Services with Data Streaming using Kafka Streams and Connect at FREE NOW
Source: FREE NOW

Example: Detecting Fake Rides

  1. A driver inputs trip details into the app.
  2. Kafka Streams predicts expected trip fare based on distance and duration.
  3. GPS anomalies and unexpected route changes are flagged.
  4. Fraud alerts are triggered for suspicious transactions.

By implementing real-time fraud detection with Kafka and Flink, FREE NOW (Lyft) has significantly reduced fraudulent trips and improved platform security.

Grab: AI-Powered Fraud Detection for Ride-Hailing and Delivery with Data Streaming and AI/ML

Grab is a leading mobility platform in Southeast Asia, handling millions of transactions daily. Fraud accounts for 1.6 percent of total revenue loss in the region.

To address these significant fraud numbers, Grab developed GrabDefence—an AI-powered fraud detection engine that leverages real-time data and machine learning to detect and block suspicious activity across its platform.

Fraud Detection and Presentation with Kafka and AI ML at Grab in Asia
Source: Grab

Fraud Detection Approach

  • Uses Kafka Streams and machine learning for fraud risk scoring.
  • Leverages Flink for feature aggregation and anomaly detection.
  • Detects fraudulent transactions before they are completed.
GrabDefence - Fraud Prevention with Data Streaming and AI / Machine Learning in Grab Mobility Service
Source: Grab

Example: Fake Driver and Passenger Fraud

  1. Fraudsters create accounts as both driver and passenger to claim rewards.
  2. Kafka ingests device fingerprints, payment transactions, and ride data.
  3. Flink aggregates historical fraud behavior and assigns risk scores.
  4. High-risk transactions are blocked instantly.

With GrabDefence built with data streaming, Grab reduced fraud rates to 0.2 percent, well below the industry average. Learn more about GrabDefence in the Kafka Summit talk.

Uber: Project RADAR – AI-Powered Fraud Detection with Human Oversight

Uber processes millions of payments per second globally. Fraud detection is complex due to chargebacks and uncollected payments.

To combat this, Uber launched Project RADAR—a hybrid system that combines machine learning with human reviewers to continuously detect, investigate, and adapt to evolving fraud patterns in near real time. Low latency is not required in this scenario. And humans are in the loop of the business process. Hence, Apache Spark is sufficient for Uber.

Uber Project Radar for Scam Detection with Humans in the Loop
Source: Uber

Fraud Prevention Approach

  • Uses Kafka and Spark for multi-layered fraud detection.
  • Implements machine learning models to detect chargeback fraud.
  • Incorporates human analysts for rule validation.
Uber Project RADAR with Apache Kafka and Spark for Scam Detection with AI and Machine Learning
Source: Uber

Example: Chargeback Fraud Detection

  1. Kafka collects all ride transactions in real time.
  2. Stream processing detects anomalies in payment patterns and disputes.
  3. AI-based fraud scoring identifies high-risk transactions.
  4. Uber’s RADAR system allows human analysts to validate fraud alerts.

Uber’s combination of AI-driven detection and human oversight has significantly reduced chargeback-related fraud.

Fraud in mobility services is a real-time challenge that requires real-time solutions that work 24/7, even at extreme scale for millions of events. Traditional batch processing systems are too slow, and static rule-based approaches cannot keep up with evolving fraud tactics.

By leveraging data streaming with Apache Kafka in conjunction with Kafka Streams or Apache Flink, mobility platforms can:

  • Process millions of events per second to detect fraud in real time.
  • Prevent fraudulent transactions before they occur.
  • Use AI-driven real-time fraud scoring for accurate risk assessment.
  • Adapt dynamically through continuous learning to evolving fraud patterns.

Mobility platforms such as Uber, Grab, and FREE NOW (Lyft) are leading the way in using real-time streaming analytics to protect their platforms from fraud. By implementing similar approaches, other mobility businesses can enhance security, reduce financial losses, and maintain customer trust.

Real-time fraud prevention in mobility services is not an option; it is a necessity. The ability to detect and stop fraud in real time will define the future success of ride-hailing, food delivery, and urban mobility platforms.

Stay ahead of the curve! Subscribe to my newsletter for insights into data streaming and connect with me on LinkedIn to continue the conversation. And download my free book about data streaming use cases.

The post Fraud Detection in Mobility Services (Ride-Hailing, Food Delivery) with Data Streaming using Apache Kafka and Flink appeared first on Kai Waehner.

]]>
Apache Kafka 4.0: The Business Case for Scaling Data Streaming Enterprise-Wide https://www.kai-waehner.de/blog/2025/04/19/apache-kafka-4-0-the-business-case-for-scaling-data-streaming-enterprise-wide/ Sat, 19 Apr 2025 13:32:55 +0000 https://www.kai-waehner.de/?p=7723 Apache Kafka 4.0 represents a major milestone in the evolution of real-time data infrastructure. Used by over 150,000 organizations worldwide, Kafka has become the de facto standard for data streaming across industries. This article focuses on the business value of Kafka 4.0, highlighting how it enables operational efficiency, faster time-to-market, and architectural flexibility across cloud, on-premise, and edge environments. Rather than detailing technical improvements, it explores Kafka’s strategic role in modern data platforms, the growing data streaming ecosystem, and how enterprises can turn event-driven architecture into competitive advantage. Kafka is no longer just infrastructure—it’s a foundation for digital business

The post Apache Kafka 4.0: The Business Case for Scaling Data Streaming Enterprise-Wide appeared first on Kai Waehner.

]]>
Apache Kafka 4.0 is more than a version bump. It marks a pivotal moment in how modern organizations build, operate, and scale their data infrastructure. While developers and architects may celebrate feature-level improvements, the true value of this release is what it enables at the business level: operational excellence, faster time-to-market, and competitive agility powered by data in motion. Kafka 4.0 represents a maturity milestone in the evolution of the event-driven enterprise.

The Business Case for Data Streaming at Enterprise Scale

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases and business value, including customer stories across all industries.

From Event Hype to Event Infrastructure

Over the last decade, Apache Kafka has evolved from a scalable log for engineers at LinkedIn to the de facto event streaming platform adopted across every industry. Banks, automakers, telcos, logistics firms, and retailers alike rely on Kafka as the nervous system for critical data.

Event-driven Architecture for Data Streaming

Today, over 150,000 organizations globally use Apache Kafka to enable real-time operations, modernize legacy systems, and support digital innovation. Kafka 4.0 moves even deeper into this role as a business-critical backbone. If you want to learn more about use case and industry success stories, download my free ebook and subscribe to my newsletter.

Version 4.0 of Apache Kafka signals readiness for CIOs, CTOs, and enterprise architects who demand:

  • Uninterrupted uptime and failover for global operations
  • Data-driven automation and decision-making at scale
  • Flexible deployment across on-premises, cloud, and edge environments
  • A future-proof foundation for modernization and innovation

Apache Kafka 4.0 doesn’t just scale throughput—it scales business outcomes:

Use Cases for Data Streaming with Apache Kafka by Business Value
Source: Lyndon Hedderly (Confluent)

This post does not cover the technical improvements and new features of the 4.0 release, like ZooKeeper removal, Queues for Kafka, and so on. Those are well-documented elsewhere. Instead, it highlights the strategic business value Kafka 4.0 delivers to modern enterprises.

Kafka 4.0: A Platform Built for Growth

Today’s IT leaders are not just looking at throughput and latency. They are investing in platforms that align with long-term architectural goals and unlock value across the organization.

Apache Kafka 4.0 offers four core advantages for business growth:

1. Open De Facto Standard for Data Streaming

Apache Kafka is the open, vendor-neutral protocol that has become the de facto standard for data streaming across industries. Its wide adoption and strong community ecosystem make it both a reliable choice and a flexible one.

Organizations can choose between open-source Kafka distributions, managed services like Confluent Cloud, or even build their own custom engines using Kafka’s open protocol. This openness enables strategic independence and long-term adaptability—critical factors for any enterprise architect planning a future-proof data infrastructure.

2. Operational Efficiency at Enterprise Scale

Reliability, resilience, and ease of operation are key to any business infrastructure. Kafka 4.0 reduces operational complexity and increases uptime through a simplified architecture. Key components of the platform have been re-engineered to streamline deployment and reduce points of failure, minimizing the effort required to keep systems running smoothly.

Kafka is now easier to manage, scale, and secure—whether deployed in the cloud, on-premises, or at the edge in environments like factories or retail locations. It reduces the need for lengthy maintenance windows, accelerates troubleshooting, and makes system upgrades far less disruptive. As a result, teams can operate with greater efficiency, allowing leaner teams to support larger, more complex workloads with greater confidence and stability.

Storage management has also evolved in the past releases by decoupling compute and storage. This optimization allows organizations to retain large volumes of event data cost-effectively without compromising performance. This extends Kafka’s role from a real-time pipeline to a durable system of record that supports both immediate and long-term data needs.

With fewer manual interventions, less custom integration, and more built-in intelligence, Kafka 4.0 allows engineering teams to focus on delivering new services and capabilities—rather than maintaining infrastructure. This operational maturity translates directly into faster time-to-value and lower total cost of ownership at enterprise scale.

3. Innovation Enablement Through Real-Time Data

Real-time data unlocks entirely new business models: predictive maintenance in manufacturing, personalized digital experiences in retail, and fraud detection in financial services. Kafka 4.0 empowers teams to build applications around streams of events, driving automation and responsiveness across the value chain.

This shift is not just technical—it’s organizational. Kafka decouples producers and consumers of data, enabling individual teams to innovate independently without being held back by rigid system dependencies or central coordination. Whether building with Java, Python, Go, or integrating with SaaS platforms and cloud-native services, teams can choose the tools and technologies that best fit their goals.

This architectural flexibility accelerates development cycles and reduces cross-team friction. As a result, new features and services reach the market faster, experimentation is easier, and the overall organization becomes more agile in responding to customer needs and competitive pressures. Kafka 4.0 turns real-time architecture into a strategic asset for business acceleration.

4. Cloud-Native Flexibility

Kafka 4.0 reinforces Kafka’s role as the backbone of hybrid and multi-cloud strategies. In a data streaming landscape that spans public cloud, private infrastructure, and on-premise environments, Kafka provides the consistency, portability, and control that modern organizations require.

Whether deployed in AWS, Azure, GCP, or edge locations like factories or retail stores, Kafka delivers uniform performance, API compatibility, and integration capabilities. This ensures operational continuity across regions, satisfies data sovereignty and regulatory needs, and reduces latency by keeping data processing close to where it’s generated.

Beyond Kafka brokers, it is the Kafka protocol itself that has become the standard for real-time data streaming—adopted by vendors, platforms, and developers alike. This protocol standardization gives organizations the freedom to integrate with a growing ecosystem of tools, services, and managed offerings that speak Kafka natively, regardless of the underlying engine.

For instance, innovative data streaming platforms built using the Kafka protocol, such as WarpStream, provide a Bring Your Own Cloud (BYOC) model to allow organizations to maintain full control over their data and infrastructure while still benefiting from managed services and platform automation. This flexibility is especially valuable in regulated industries and globally distributed enterprises, where cloud neutrality and deployment independence are strategic priorities.

Kafka 4.0 not only supports cloud-native operations—it strengthens the organization’s ability to evolve, modernize, and scale without vendor lock-in or architectural compromise.

Real-Time as a Business Imperative

Data is no longer static. It is dynamic, fast-moving, and continuous. Businesses that treat data as something to collect and analyze later will fall behind. Kafka enables a shift from data at rest to data in motion.

Kafka 4.0 supports this transformation across all industries. For instance:

  • Automotive: Streaming data from factories, fleets, and connected vehicles
  • Banking: Real-time fraud detection and transaction analytics
  • Telecom: Customer engagement, network monitoring, and monetization
  • Healthcare: Monitoring devices, alerts, and compliance tracking
  • Retail: Dynamic pricing, inventory tracking, and personalized offers

These use cases cannot be solved by daily batch jobs. Kafka 4.0 enables systems—and decision-making—to operate at business speed. “The Top 20 Problems with Batch Processing (and How to Fix Them with Data Streaming)” explore this in more detail.

Additionally, Apache Kafka ensures data consistency across real-time streams, batch processes, and request-response APIs—because not all workloads are real-time, and that’s okay.

The Kafka Ecosystem and the Data Streaming Landscape

Running Apache Kafka at enterprise scale requires more than open-source software. Kafka has become the de facto standard for data streaming, but success with Kafka depends on using more than just the core project. Real-time applications demand capabilities like data integration, stream processing, governance, security, and 24/7 operational support.

Today, a rich and rapidly developing data streaming ecosystem has emerged. Organizations can choose from a growing number of platforms and cloud services built on or compatible with the Kafka protocol—ranging from self-managed infrastructure to Bring Your Own Cloud (BYOC) models and fully managed SaaS offerings. These solutions aim to simplify operations, accelerate time-to-market, and reduce risk while maintaining the flexibility and openness that Kafka is known for.

Confluent leads this category as the most complete data streaming platform, but it is part of a broader ecosystem that includes vendors like Amazon MSK, Cloudera, Azure Event Hubs, and emerging players in cloud-native and BYOC deployments. The data streaming landscape explores all the different vendors in this software category:

The Data Streaming Landscape 2025 with Kafka Flink Confluent Amazon MSK Cloudera Event Hubs and Other Platforms

The market is moving toward complete data streaming platforms (DSP)—offering end-to-end capabilities from ingestion to stream processing and governance. Choosing the right solution means evaluating not only performance and compatibility but also how well the platform aligns with your business strategy, security requirements, and deployment preferences.

Kafka is at the center—but the future of data streaming belongs to platforms that turn Kafka 4.0’s architecture into real business value.

The Road Ahead with Apache Kafka 4.0 and Beyond

Apache Kafka 4.0 is a strategic enabler responsible for driving modernization, innovation, and resilience. It directly supports the key transformation goals:

  • Modernization without disruption: Kafka integrates seamlessly with legacy systems and provides a bridge to cloud-native, event-driven architectures.
  • Platform standardization: Kafka becomes a central nervous system across departments and business units, reducing fragmentation and enabling shared services.
  • Faster ROI from digital initiatives: Kafka accelerates the launch and evolution of digital services, helping teams iterate and deliver measurable value quickly.

Kafka 4.0 reduces operational complexity, unlocks developer productivity, and allows organizations to respond in real time to both opportunities and risks. This release marks a significant milestone in the evolution of real-time business architecture.

Kafka is no longer an emerging technology—it is a reliable foundation for companies that treat data as a continuous, strategic asset. Data streaming is now as foundational as databases and APIs. With Kafka 4.0, organizations can build connected products, automate operations, and reinvent the customer experience easier than ever before.

And with innovations on the horizon—such as built-in queueing capabilities, brokerless writes directly to object storage, and expanded transactional guarantees supporting the two-phase commit protocol (2PC)—Kafka continues to push the boundaries of what’s possible in real-time, event-driven architecture.

The future of digital business is real-time. Apache Kafka 4.0 is ready.

Want to learn more about Kafka in the enterprise? Let’s connect and exchange ideas. Subscribe to the Data Streaming Newsletter. Explore the Kafka Use Case Book for real-world stories from industry leaders.

The post Apache Kafka 4.0: The Business Case for Scaling Data Streaming Enterprise-Wide appeared first on Kai Waehner.

]]>
How Apache Kafka and Flink Power Event-Driven Agentic AI in Real Time https://www.kai-waehner.de/blog/2025/04/14/how-apache-kafka-and-flink-power-event-driven-agentic-ai-in-real-time/ Mon, 14 Apr 2025 09:09:10 +0000 https://www.kai-waehner.de/?p=7265 Agentic AI marks a major evolution in artificial intelligence—shifting from passive analytics to autonomous, goal-driven systems capable of planning and executing complex tasks in real time. To function effectively, these intelligent agents require immediate access to consistent, trustworthy data. Traditional batch processing architectures fall short of this need, introducing delays, data staleness, and rigid workflows. This blog post explores why event-driven architecture (EDA)—powered by Apache Kafka and Apache Flink—is essential for building scalable, reliable, and adaptive AI systems. It introduces key concepts such as Model Context Protocol (MCP) and Google’s Agent-to-Agent (A2A) protocol, which are redefining interoperability and context management in multi-agent environments. Real-world use cases from finance, healthcare, manufacturing, and more illustrate how Kafka and Flink provide the real-time backbone needed for production-grade Agentic AI. The post also highlights why popular frameworks like LangChain and LlamaIndex must be complemented by robust streaming infrastructure to support stateful, event-driven AI at scale.

The post How Apache Kafka and Flink Power Event-Driven Agentic AI in Real Time appeared first on Kai Waehner.

]]>
Artificial Intelligence is evolving beyond passive analytics and reactive automation. Agentic AI represents a new wave of autonomous, goal-driven AI systems that can think, plan, and execute complex workflows without human intervention. However, for these AI agents to be effective, they must operate on real-time, consistent, and trustworthy data—a challenge that traditional batch processing architectures simply cannot meet. This is where Data Streaming with Apache Kafka and Apache Flink, coupled with an event-driven architecture (EDA), form the backbone of Agentic AI. By enabling real-time and continuous decision-making, EDA ensures that AI systems can act instantly and reliably in dynamic, high-speed environments. Emerging standards like the Model Context Protocol (MCP) and Google’s Agent-to-Agent (A2A) protocol are now complementing this foundation, providing structured, interoperable layers for managing context and coordination across intelligent agents—making AI not just event-driven, but also context-aware and collaborative.

Event-Driven Agentic AI with Data Streaming using Apache Kafka and Flink

In this post, I will explore:

  • How Agentic AI works and why it needs real-time data
  • Why event-driven architectures are the best choice for AI automation
  • Key use cases across industries
  • How Kafka and Flink provide the necessary data consistency and real-time intelligence for AI-driven decision-making
  • The role of MCP, A2A, and frameworks like LangChain and LlamaIndex in enabling scalable, context-aware, and collaborative AI systems

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases.

What is Agentic AI?

Agentic AI refers to AI systems that exhibit autonomous, goal-driven decision-making and execution. Unlike traditional automation tools that follow rigid workflows, Agentic AI can:

  • Understand and interpret natural language instructions
  • Set objectives, create strategies, and prioritize actions
  • Adapt to changing conditions and make real-time decisions
  • Execute multi-step tasks with minimal human supervision
  • Integrate with multiple operational and analytical systems and data sources to complete workflows

Here is an example AI Agent dependency graph from Sean Falconer’s article “Event-Driven AI: Building a Research Assistant with Kafka and Flink“:

Example AI Agent Dependency Graph
Source: Sean Falconer

Instead of merely analyzing data, Agentic AI acts on data, making it invaluable for operational and transactional use cases—far beyond traditional analytics.

However, without real-time, high-integrity data, these systems cannot function effectively. If AI is working with stale, incomplete, or inconsistent information, its decisions become unreliable and even counterproductive. This is where Kafka, Flink, and event-driven architectures become indispensable.

Why Batch Processing Fails for Agentic AI

Traditional AI and analytics systems have relied heavily on batch processing, where data is collected, stored, and processed in predefined intervals. This approach may work for generating historical reports or training machine learning models offline, but it completely breaks down when applied to operational and transactional AI use cases—which are at the core of Agentic AI.

Why Batch Processing Fails for Agentic AI

I recently explored the Top 20 Problems with Batch Processing (and How to Fix Them with Data Streaming). And here’s why batch processing is fundamentally incompatible with Agentic AI and the real-world challenges it creates:

1. Delayed Decision-Making Slows AI Reactions

Agentic AI systems are designed to autonomously respond to real-time changes in the environment, whether it’s optimizing a telecommunications network, detecting fraud in banking, or dynamically adjusting supply chains.

In a batch-driven system, data is processed hours or even days later, making AI responses obsolete before they even reach the decision-making phase. For example:

  • Fraud detection: If a bank processes transactions in nightly batches, fraudulent activities may go unnoticed for hours, leading to financial losses.
  • E-commerce recommendations: If a retailer updates product recommendations only once per day, it fails to capture real-time shifts in customer behavior.
  • Network optimization: If a telecom company analyzes network traffic in batch mode, it cannot prevent congestion or outages before it affects users.

Agentic AI requires instantaneous decision-making based on streaming data, not delayed insights from batch reports.

2. Data Staleness Creates Inaccurate AI Decisions

AI agents must act on fresh, real-world data, but batch processing inherently means working with outdated information. If an AI agent is making decisions based on yesterday’s or last hour’s data, those decisions are no longer reliable.

Consider a self-healing IT infrastructure that uses AI to detect and mitigate outages. If logs and system metrics are processed in batch mode, the AI agent will be acting on old incident reports, missing live system failures that need immediate attention.

In contrast, an event-driven system powered by Kafka and Flink ensures that AI agents receive live system logs as they occur, allowing for proactive self-healing before customers are impacted.

3. High Latency Kills Operational AI

In industries like finance, healthcare, and manufacturing, even a few seconds of delay can lead to severe consequences. Batch processing introduces significant latency, making real-time automation impossible.

For example:

  • Healthcare monitoring: A real-time AI system should detect abnormal heart rates from a patient’s wearable device and alert doctors immediately. If health data is only processed in hourly batches, a critical deterioration could be missed, leading to life-threatening situations.
  • Automated trading in finance: AI-driven trading systems must respond to market fluctuations within milliseconds. Batch-based analysis would mean losing high-value trading opportunities to faster competitors.

Agentic AI must operate on a live data stream, where every event is processed instantly, allowing decisions to be made in real-time, not retrospectively.

4. Rigid Workflows Increase Complexity and Costs

Batch processing forces businesses to predefine rigid workflows that do not adapt well to changing conditions. In a batch-driven world:

  • Data must be manually scheduled for ingestion.
  • Systems must wait for the entire dataset to be processed before making decisions.
  • Business logic is hard-coded, requiring expensive engineering effort to update workflows.

Agentic AI, on the other hand, is designed for continuous, adaptive decision-making. By leveraging an event-driven architecture, AI agents listen to streams of real-time data, dynamically adjusting workflows on the fly instead of relying on predefined batch jobs.

This flexibility is especially critical in industries with rapidly changing conditions, such as supply chain logistics, cybersecurity, and IoT-based smart cities.

5. Batch Processing Cannot Support Continuous Learning

A key advantage of Agentic AI is its ability to learn from past experiences and self-improve over time. However, this is only possible if AI models are continuously updated with real-time feedback loops.

Batch-driven architectures limit AI’s ability to learn because:

  • Models are retrained infrequently, leading to outdated insights.
  • Feedback loops are slow, preventing AI from adjusting strategies in real time.
  • Drift in data patterns is not immediately detected, causing AI performance degradation.

For instance, in customer service chatbots, an AI-powered agent should adapt to customer sentiment in real time. If a chatbot is trained on stale customer interactions from last month, it won’t understand emerging trends or newly common issues.

By contrast, a real-time data streaming architecture ensures that AI agents continuously receive live customer interactions, retrain in real time, and evolve dynamically.

Agentic AI Requires an Event-Driven Architecture

Agentic AI must act in real time and integrate operational and analytical information. Whether it’s an AI-driven fraud detection system, an autonomous network optimization agent, or a customer service chatbot, acting on outdated information is not an option.

The Event-Driven Approach

An Event-Driven Architecture (EDA) enables continuous processing of real-time data streams, ensuring that AI agents always have the latest information available. By decoupling applications and processing events asynchronously, EDA allows AI to respond dynamically to changes in the environment without being constrained by rigid workflows.

Event-driven Architecture for Data Streaming with Apache Kafka and Flink

AI can also be seamlessly integrated into existing business processes leveraging an EDA, bridging modern and legacy technologies without requiring a complete system overhaul. Not every data source may be real-time, but EDA ensures data consistency across all consumers—if an application processes data, it sees exactly what every other application sees. This guarantees synchronized decision-making, even in hybrid environments combining historical data with real-time event streams.

Why Apache Kafka is Essential for Agentic AI

For AI to be truly autonomous and effective, it must operate in real time, adapt to changing conditions, and ensure consistency across all applications. An Event-Driven Architecture (EDA) built with Apache Kafka provides the foundation for this by enabling:

  • Immediate Responsiveness → AI agents receive and act on events as they occur.
  • High Scalability → Components are decoupled and can scale independently.
  • Fault Tolerance → AI processes continue running even if some services fail.
  • Improved Data Consistency → Ensures AI agents are working with accurate, real-time data.

To build truly autonomous AI systems, organizations need a real-time data infrastructure that can process, analyze, and act on events as they happen.

Building Event-Driven Multi-Agents with Data Streaming using Apache Kafka and Flink
Source: Sean Falconer

Apache Kafka: The Real-Time Data Streaming Backbone

Apache Kafka provides a scalable, event-driven messaging infrastructure that ensures AI agents receive a constant, real-time stream of events. By acting as a central nervous system, Kafka enables:

  • Decoupled AI components that communicate through event streams.
  • Efficient data ingestion from multiple sources (IoT devices, applications, databases).
  • Guaranteed event delivery with fault tolerance and durability.
  • High-throughput processing to support real-time AI workloads.

Apache Flink complements Kafka by providing stateful stream processing for AI-driven workflows. With Flink, AI agents can:

  • Analyze real-time data streams for anomaly detection, predictions, and decision-making.
  • Perform complex event processing to detect patterns and trigger automated responses.
  • Continuously learn and adapt based on evolving real-time data.
  • Orchestrate multi-agent workflows dynamically.

Across industries, Agentic AI is redefining how businesses and governments operate. By leveraging event-driven architectures and real-time data streaming, organizations can unlock the full potential of AI-driven automation, improving efficiency, reducing costs, and delivering better experiences.

Here are key use cases across different industries:

Financial Services: Real-Time Fraud Detection and Risk Management

Traditional fraud detection systems rely on batch processing, leading to delayed responses and financial losses.

Agentic AI enables real-time transaction monitoring, detecting anomalies as they occur and blocking fraudulent activities instantly.

AI agents continuously learn from evolving fraud patterns, reducing false positives and improving security. In risk management, AI analyzes market trends, adjusts investment strategies, and automates compliance processes to ensure financial institutions stay ahead of threats and regulatory requirements.

Telecommunications: Autonomous Network Optimization

Telecom networks require constant tuning to maintain service quality, but traditional network management is reactive and expensive.

Agentic AI can proactively monitor network traffic, predict congestion, and automatically reconfigure network resources in real time. AI-powered agents optimize bandwidth allocation, detect outages before they impact customers, and enable self-healing networks, reducing operational costs and improving service reliability.

Retail: AI-Powered Personalization and Dynamic Pricing

Retailers struggle with static recommendation engines that fail to capture real-time customer intent.

Agentic AI analyzes customer interactions, adjusts recommendations dynamically, and personalizes promotions based on live purchasing behavior. AI-driven pricing strategies adapt to supply chain fluctuations, competitor pricing, and demand changes in real time, maximizing revenue while maintaining customer satisfaction.

AI agents also enhance logistics by optimizing inventory management and reducing stock shortages.

Healthcare: Real-Time Patient Monitoring and Predictive Care

Hospitals and healthcare providers require real-time insights to deliver proactive care, but batch processing delays critical decisions.

Agentic AI continuously streams patient vitals from medical devices to detect early signs of deterioration and triggering instant alerts to medical staff. AI-driven predictive analytics optimize hospital resource allocation, improve diagnosis accuracy, and enable remote patient monitoring, reducing emergency incidents and improving patient outcomes.

Gaming: Dynamic Content Generation and Adaptive AI Opponents

Modern games need to provide immersive, evolving experiences, but static game mechanics limit engagement.

Agentic AI enables real-time adaptation of gameplay to generate dynamic environments and personalizing challenges based on a player’s behavior. AI-driven opponents can learn and adapt to individual playstyles, keeping games engaging over time. AI agents also manage server performance, detect cheating, and optimize in-game economies for a better gaming experience.

Manufacturing & Automotive: Smart Factories and Autonomous Systems

Manufacturing relies on precision and efficiency, yet traditional production lines struggle with downtime and defects.

Agentic AI monitors production processes in real time to detect quality issues early and adjusting machine parameters autonomously. This directly improves Overall Equipment Effectiveness (OEE) by reducing downtime, minimizing defects, and optimizing machine performance to ensure higher productivity and operational efficiency to ensure higher productivity and operational efficiency.

In automotive, AI-driven agents analyze real-time sensor data from self-driving cars to make instant navigation decisions, predict maintenance needs, and optimize fleet operations for logistics companies.

Public Sector: AI-Powered Smart Cities and Citizen Services

Governments face challenges in managing infrastructure, public safety, and citizen services efficiently.

Agentic AI can optimize traffic flow by analyzing real-time data from sensors and adjusting signals dynamically. AI-powered public safety systems detect threats from surveillance data and dispatch emergency services instantly. AI-driven chatbots handle citizen inquiries, automate document processing, and improve response times for government services.

The Business Value of Real-Time AI using Autonomous Agents

By leveraging Kafka and Flink in an event-driven AI architecture, organizations can achieve:

  • Better Decision-Making → AI operates on fresh, accurate data.
  • Faster Time-to-Action → AI agents respond to events immediately.
  • Reduced Costs → Less reliance on expensive batch processing and manual intervention by humans.
  • Greater Scalability → AI systems can handle massive workloads in real time.
  • Vendor Independence → Kafka and Flink support open standards and hybrid/multi-cloud deployments, preventing vendor lock-in.

Why LangChain, LlamaIndex, and Similar Frameworks Are Not Enough for Agentic AI in Production

Frameworks like LangChain, LlamaIndex, and others have gained popularity for making it easy to prototype AI agents by chaining prompts, tools, and external APIs. They provide useful abstractions for reasoning steps, retrieval-augmented generation (RAG), and basic tool use—ideal for experimentation and lightweight applications.

However, when building agentic AI for operational, business-critical environments, these frameworks fall short on several fronts:

  • Many frameworks like LangChain are inherently synchronous and follows a request-response model, which limits its ability to handle real-time, event-driven inputs at scale. In contrast, LlamaIndex takes an event-driven approach, using a message broker—including support for Apache Kafka—for inter-agent communication.
  • Debugging, observability, and reproducibility are weak—there’s often no persistent, structured record of agent decisions or tool interactions.
  • State is ephemeral and in-memory, making long-running tasks, retries, or rollback logic difficult to implement reliably.
  • Most Agentic AI frameworks lack support for distributed, fault-tolerant execution and scalable orchestration, which are essential for production systems.

That said, these frameworks like LangChain and Llamaindex can still play a valuable, complementary role when integrated into an event-driven architecture. For example, an agent might use LangChain for planning or decision logic within a single task, while Apache Kafka and Apache Flink handle the real-time flow of events, coordination between agents, persistence, and system-level guarantees.

LangChain and similar toolkits help define how an agent thinks. But to run that thinking at scale, in real time, and with full traceability, you need a robust data streaming foundation. That’s where Kafka and Flink come in.

Model Context Protocol (MCP) and Agent-to-Agent (A2A) for Scalable, Composable Agentic AI Architectures

Model Context Protocol (MCP) is one of the hottest topics in AI right now. Coined by Anthropic, with early support emerging from OpenAI, Google, and other leading AI infrastructure providers, MCP is rapidly becoming a foundational layer for managing context in agentic systems. MCP enables systems to define, manage, and exchange structured context windows—making AI interactions consistent, portable, and state-aware across tools, sessions, and environments.

Google’s recently announced Agent-to-Agent (A2A) protocol adds further momentum to this movement, setting the groundwork for standardized interaction across autonomous agents. These advancements signal a new era of AI interoperability and composability.

Together with Kafka and Flink, MCP and protocols like A2A help bridge the gap between stateless LLM calls and stateful, event-driven agent architectures. Naturally, event-driven architecture is the perfect foundation for all this. The key now is to build enough product functionality and keep pushing the boundaries of innovation.

A dedicated blog post is coming soon to explore how MCP and A2A connect data streaming and request-response APIs in modern AI systems.

Agentic AI is poised to revolutionize industries by enabling fully autonomous, goal-driven AI systems that perceive, decide, and act continuously. But to function reliably in dynamic, production-grade environments, these agents require real-time, event-driven architectures—not outdated, batch-oriented pipelines.

Apache Kafka and Apache Flink form the foundation of this shift. Kafka ensures agents receive reliable, ordered event streams, while Flink provides stateful, low-latency stream processing for real-time reactions and long-lived context management. This architecture enables AI agents to process structured events as they happen, react to changes in the environment, and coordinate with other services or agents through durable, replayable data flows.

If your organization is serious about AI, the path forward is clear:

Move from batch to real-time, from passive analytics to autonomous action, and from isolated prompts to event-driven, context-aware agents—enabled by Kafka and Flink.

As a next step, learn more about “Online Model Training and Model Drift in Machine Learning with Apache Kafka and Flink“.

Let’s connect on LinkedIn and discuss how to implement these ideas in your organization. Stay informed about new developments by subscribing to my newsletter. And make sure to download my free book about data streaming use cases.

The post How Apache Kafka and Flink Power Event-Driven Agentic AI in Real Time appeared first on Kai Waehner.

]]>
Shift Left Architecture at Siemens: Real-Time Innovation in Manufacturing and Logistics with Data Streaming https://www.kai-waehner.de/blog/2025/04/11/shift-left-architecture-at-siemens-real-time-innovation-in-manufacturing-and-logistics-with-data-streaming/ Fri, 11 Apr 2025 12:32:50 +0000 https://www.kai-waehner.de/?p=7475 Industrial enterprises face increasing pressure to move faster, automate more, and adapt to constant change—without compromising reliability. Siemens Digital Industries addresses this challenge by combining real-time data streaming, modular design, and Shift Left principles to modernize manufacturing and logistics. This blog outlines how technologies like Apache Kafka, Apache Flink, and Confluent Cloud support scalable, event-driven architectures. A real-world example from Siemens’ Modular Intralogistics Platform illustrates how this approach improves data quality, system responsiveness, and operational agility.

The post Shift Left Architecture at Siemens: Real-Time Innovation in Manufacturing and Logistics with Data Streaming appeared first on Kai Waehner.

]]>
Industrial enterprises are under pressure to modernize. They need to move faster, automate more, and adapt to constant change—without sacrificing reliability or control. Siemens Digital Industries is meeting this challenge head-on by combining software, edge computing, and cloud-native technologies into a new architecture. This blog explores how Siemens is using data streaming, modular design, and Shift Left thinking to enable real-time decision-making, improve data quality, and unlock scalable, reusable data products across manufacturing and logistics operations. A real-world example for industrial IoT, intralogistics and shop floor manufacturing illustrates the architecture and highlights the business value behind this transformation.

Shift Left Architecture at Siemens with Stream Processing using Apache Kafka and Flink

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including customer stories across all industries.

The Data Streaming Use Case Show: Episode #1 – Manufacturing and Automotive

These Siemens success stories are part of The Data Streaming Use Case Show, a new industry webinar series hosted by me.

In the first episode, we focus on the manufacturing and automotive industries. It features:

  • Experts from Siemens Digital Industries and Siemens Healthineers
  • The Founder of ‘IoT Use Case, a content and community platform focused on real-world industrial IoT applications
  • Deep insights into how industrial companies combine OT, IT, cloud, and data streaming with the shift left architecture.

The Data Streaming Industry Use Case Show by Confluent with Host Kai Waehner

The series explores real-world solutions across industries, showing how leaders turn data into action through open architectures and real-time platforms.

Siemens Digital Industries: Company and Vision

Siemens Digital Industries is the technology and software arm of Siemens AG, focused on advancing industrial automation and digitalization. It empowers manufacturers and machine builders to become more agile, efficient, and resilient through intelligent software and integrated systems.

Its business model bridges the physical and digital worlds—combining operational technology (OT) with modern information technology (IT). From programmable logic controllers to industrial IoT, Siemens delivers end-to-end solutions across industries.

Today, the company is transforming itself into a software- and cloud-driven organization, focusing strongly on edge computing, real-time analytics, and data streaming as key enablers of modern manufacturing.

With edge and cloud working in harmony, Siemens helps industrial enterprises break up monoliths and develop toward modular, flexible architectures. These software-driven approaches make plants and factories more adaptive, intelligent, and autonomous.

Data Streaming at Industrial Companies

In industrial settings, data is continuously generated by machines, production systems, robots, and logistics processes. But traditional batch-oriented IT systems are not designed to handle this in real time.

To make smarter, faster decisions, companies need to process data as it is generated. That’s where data streaming comes in.

Apache Kafka and Apache Flink enable event-driven architectures. These allow industrial data to flow in real time, from edge to cloud, across hybrid environments.

Event-driven Architecture with Data Streaming using Kafka and Flink in Industrial IoT and Manufacturing

Check out my other blogs about use cases and architecture for manufacturing and Industrial IoT powered by data streaming.

Edge and Hybrid Cloud as a Standard

Modern industrial use cases are increasingly hybrid by design. Machines and controllers produce data at the edge. Decisions must be made close to the source. However, cloud platforms offer powerful compute and AI capabilities.

Industrial IoT Data Streaming Everywhere Edge Hybrid Cloud with Apache Kafka and Flink

Siemens leverages edge devices to capture and preprocess data on-site. Data streaming with Confluent provides Siemens a real-time backbone for integrating this data with cloud-based systems, including Snowflake, SAP, Salesforce, and others.

This hybrid architecture supports low latency, high availability, and full control over data processing and analytics workflows.

The Shift Left Architecture for Industrial IoT

In many industrial architectures, Kafka has traditionally been used to ingest data into analytics platforms like Snowflake or Databricks. Processing, transformation, and enrichment happened late in the data pipeline.

ETL and ELT Data Integration to Data Lake Warehouse Lakehouse in Batch

But Siemens is shifting that model.

The Shift Left Architecture moves processing closer to the source, directly into the streaming layer. Instead of waiting to transform data in a data warehouse, Siemens now applies stream processing in real time, using Confluent Cloud and Kafka topics.

Shift Left Architecture with Data Streaming into Data Lake Warehouse Lakehouse

This shift enables faster decision-making, better data quality, and broader reuse of high-quality data across both analytical and operational systems.

For a deeper look at how Shift Left is transforming industrial architectures, read the full article about the Shift Left Architecture with Data Streaming.

Siemens Data Streaming Success Story: Modular Intralogistics Platform

A key example of this new architecture is Siemens’ Modular Intralogistics Platform, used in manufacturing plants for material handling and supply chain optimization. I explored the shift left architecture in our data streaming use case show with Stefan Baer, Senior Key Expert – Data Streaming at Siemens IT.

Traditionally, intralogistic systems were tightly coupled, with rigid integrations between

  • Enterprise Resource Planning (ERP): Order management, master data
  • Manufacturing Operations Management (MOM): Production scheduling, quality, maintenance
  • Warehouse Execution System (EWM): Inventory, picking, warehouse automation
  • Execution Management System (eMS): Transport control, automated guided vehicle (AGV) orchestration, conveyor logic

The new approach breaks this down into package business capabilities—each one modular, orchestrated, and connected through Confluent Cloud.

Key benefits:

  • Real-time orchestration of logistics operations
  • Automated material delivery—no manual reordering required
  • ERP and MOM systems integrated flexibly via Kafka
  • High adaptability through modular components
  • GenAI used for package station load optimization

Stream processing with Apache Flink transforms events in motion. For example, when a production order changes or material shortages occur, the system reacts instantly—adjusting delivery routes, triggering alerts, or rebalancing station loads using AI.

Architecture: Data Products + Shift Left

At the heart of the solution is a combination of data products and stream processing:

  • Kafka Topics serve as real-time interfaces and persistency layer between business domains.
  • Confluent Cloud hosts the event streaming infrastructure as a fully-managed service with low latency, elasticity, and critical SLAs.
  • Stream processing with serverless Flink logic enriches and transforms data in motion.
  • Snowflake receives curated, ready-to-use data for analytics.
  • Other operational and analytical downstream consumers—such as GenAI modules or shop floor dashboards—access the same consistent data in real time.
Siemens Digital Industries - Modular Intralogistics Platform 
Source: Siemens Digital Industries

This reuse of data products ensures consistent semantics, reduces duplication, and simplifies governance.

By processing data earlier in the pipeline, Siemens improves both data quality and system responsiveness. This model replaces brittle, point-to-point integrations with a more sustainable, scalable platform architecture.

Siemens Shift Left Architecture and Data Products with Data Streaming using Apache Kafka and Flink
Source: Siemens Digital Industries

Business Value of Data Streaming and Shift Left at Siemens Digital Industries

The combination of real-time data streaming, modular data products, and Shift Left design principles unlocks significant value:

  • Faster response to dynamic events in production and logistics
  • Improved operational resilience and agility
  • Higher quality data for both analytics and AI
  • Reuse across multiple consumers (analytics, operations, automation)
  • Lower integration costs and easier scaling

This approach is not just technically superior—it supports measurable business outcomes like shorter lead times, lower stock levels, and increased manufacturing throughput.

Siemens Healthineers: Shift Left with IoT, Data Streaming, AI/ML, Confluent and Snowflake in Manufacturing and Healthcare

In a recent blog post, I explored how Siemens Healthineers uses Apache Kafka and Flink to transform both manufacturing and healthcare with a wide range of data streaming use cases. From predictive maintenance to real-time logistics, their approach is a textbook example of how to modernize complex environments with an event-driven architecture and data streamingeven if they don’t explicitly label it “shift left.”

Siemens Healthineers Data Cloud Technology Stack with Apache Kafka and Snowflake
Source: Siemens Healthineers

Their architecture enables proactive decision-making by pushing real-time insights and automation earlier in the process. Examples include telemetry streaming from medical devices, machine integration with SAP and KUKA robots, and logistics event streaming from SAP for faster packaging and delivery. Each use case shows how real-time data—combined with cloud-native platforms like Confluent and Snowflake—improves efficiency, reliability, and responsiveness.

Just like the intralogistics example from Siemens Digital Industries, Healthineers applies shift-left thinking by enabling teams to act on data sooner, reduce latency, and prevent costly delays. This approach enhances not only operational workflows but also outcomes that matter, like patient care and regulatory compliance.

This is shift left in action: embedding intelligence and quality controls early, where they have the greatest impact.

Rethinking Industrial Data Architectures with Data Streaming and Shift Left Architecture

Siemens Digital Industries is demonstrating what’s possible when you rethink the data architecture beyond just analytics in a data lake.

With data streaming leveraging Confluent Cloud, data products for modular software, and a Shift Left approach, Siemens is transforming traditional factories into intelligent, event-driven operations. A data streaming platform based on Apache Kafka is no longer just an ingestion layer. It is a central nervous system for real-time processing and decision-making.

This is not about chasing trends. It’s about building resilient, scalable, and future-proof industrial systems. And it’s just the beginning.

To learn more, watch the on-demand industry use case show with Siemens Digital Industries and Siemens Healthineers or connect with us to explore what data streaming can do for your organization.

Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter. And download my free book about data streaming use cases.

The post Shift Left Architecture at Siemens: Real-Time Innovation in Manufacturing and Logistics with Data Streaming appeared first on Kai Waehner.

]]>