MCP Archives - Kai Waehner

Agentic AI with the Agent2Agent Protocol (A2A) and MCP using Apache Kafka as Event Broker

Kai Waehner — Mon, 26 May 2025 05:32:01 +0000

Agentic AI is gaining traction as a design pattern for building more intelligent, autonomous, and collaborative systems. Unlike traditional task-based automation, agentic AI involves intelligent agents that operate independently, make contextual decisions, and collaborate with other agents or systems—across domains, departments, and even enterprises.

In the enterprise world, agentic AI is more than just a technical concept. It represents a shift in how systems interact, learn, and evolve. But unlocking its full potential requires more than AI models and point-to-point APIs—it demands the right integration backbone.

That’s where Apache Kafka as event broker for true decoupling comes into play together with two emerging AI standards: Google’s Application-to-Application (A2A) Protocol and Antrophic’s Model Context Protocol (MCP) in an enterprise architecture for Agentic AI.

Inspired by my colleague Sean Falconer’s blog post, “Why Google’s Agent2Agent Protocol Needs Apache Kafka”, this blog post explores the Agentic AI adoption in enterprises and how an event-driven architecture with Apache Kafka fits into the AI architecture.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases, including various AI examples across industries.

Business Value of Agentic AI in the Enterprise

For enterprises, the promise of agentic AI is compelling:

Smarter automation through self-directed, context-aware agents
Improved customer experience with faster and more personalized responses
Operational efficiency by connecting internal and external systems more intelligently
Scalable B2B interactions that span suppliers, partners, and digital ecosystems

But none of this works if systems are coupled by brittle point-to-point APIs, slow batch jobs, or disconnected data pipelines. Autonomous agents need continuous, real-time access to events, shared state, and a common communication fabric that scales across use cases.

Model Context Protocol (MCP) + Agent2Agent (A2A): New Standards for Agentic AI

The Model Context Protocol (MCP) coined by Anthropic offers a standardized, model-agnostic interface for context exchange between AI agents and external systems. Whether the interaction is streaming, batch, or API-based, MCP abstracts how agents retrieve inputs, send outputs, and trigger actions across services. This enables real-time coordination between models and tools—improving autonomy, reusability, and interoperability in distributed AI systems.

Source: Anthropic

Google’s Agent2Agent (A2A) protocol complements this by defining how autonomous software agents can interact with one another in a standard way. A2A enables scalable agent-to-agent collaboration—where agents discover each other, share state, and delegate tasks without predefined integrations. It’s foundational for building open, multi-agent ecosystems that work across departments, companies, and platforms.

Source: Google

Why Apache Kafka Is a Better Fit Than an API (HTTP/REST) for A2A and MCP

Most enterprises today use HTTP-based APIs to connect services—ideal for simple, synchronous request-response interactions.

In contrast, Apache Kafka is a distributed event streaming platform designed for asynchronous, high-throughput, and loosely coupled communication—making it a much better fit for multi-agent (A2A) and agentic AI architectures.

API-Based Integration	Kafka-Based Integration
Synchronous, blocking	Asynchronous, event-driven
Point-to-point coupling	Loose coupling with pub/sub topics
Hard to scale to many agents	Supports multiple consumers natively
No shared memory	Kafka retains and replays event history
Limited observability	Full traceability with schema registry & DLQs

Kafka serves as the decoupling layer. It becomes the place where agents publish their state, subscribe to updates, and communicate changes—independently and asynchronously. This enables multi-agent coordination, resilience, and extensibility.

MCP + Kafka = Open, Flexible Communication

As the adoption of Agentic AI accelerates, there’s a growing need for scalable communication between AI agents, services, and operational systems. The Model-Context Protocol (MCP) is emerging as a standard to structure these interactions—defining how agents access tools, send inputs, and receive results. But a protocol alone doesn’t solve the challenges of integration, scaling, or observability.

This is where Apache Kafka comes in.

By combining MCP with Kafka, agents can interact through a Kafka topic—fully decoupled, asynchronous, and in real time. Instead of direct, synchronous calls between agents and services, all communication happens through Kafka topics, using structured events based on the MCP format.

This model supports a wide range of implementations and tech stacks. For instance:

A Python-based AI agent deployed in a SaaS environment
A Spring Boot Java microservice running inside a transactional core system
A Flink application deployed at the edge performing low-latency stream processing
An API gateway translating HTTP requests into MCP-compliant Kafka events

Regardless of where or how an agent is implemented, it can participate in the same event-driven system. Kafka ensures durability, replayability, and scalability. MCP provides the semantic structure for requests and responses.

The result is a highly flexible, loosely coupled architecture for Agentic AI—one that supports real-time processing, cross-system coordination, and long-term observability. This combination is already being explored in early enterprise projects and will be a key building block for agent-based systems moving into production.

Stream Processing as the Agent’s Companion

Stream processing technologies like Apache Flink or Kafka Streams allow agents to:

Filter, join, and enrich events in motion
Maintain stateful context for decisions (e.g., real-time credit risk)
Trigger new downstream actions based on complex event patterns
Apply AI directly within the stream processing logic, enabling real-time inference and contextual decision-making with embedded models or external calls to a model server, vector database, or any other AI platform

Agents don’t need to manage all logic themselves. The data streaming platform can pre-process information, enforce policies, and even trigger fallback or compensating workflows—making agents simpler and more focused.

Technology Flexibility for Agentic AI Design with Data Contracts

One of the biggest advantages of Kafka-based event-driven and decoupled backend for agentic systems is that agents can be implemented in any stack:

Languages: Python, Java, Go, etc.
Environments: Containers, serverless, JVM apps, SaaS tools
Communication styles: Event streaming, REST APIs, scheduled jobs

The Kafka topic is the stable data contract for quality and policy enforcement. Agents can evolve independently, be deployed incrementally, and interoperate without tight dependencies.

Microservices, Data Products, and Reusability – Agentic AI Is Just One Piece of the Puzzle

To be effective, Agentic AI needs to connect seamlessly with existing operational systems and business workflows.

Kafka topics enable the creation of reusable data products that serve multiple consumers—AI agents, dashboards, services, or external partners. This aligns perfectly with data mesh and microservice principles, where ownership, scalability, and interoperability are key.

A single stream of enriched order events might be consumed via a single data product by:

A fraud detection agent
A real-time alerting system
An agent triggering SAP workflow updates
A lakehouse for reporting and batch analytics

This one-to-many model is the opposite of traditional REST designs and crucial for enabling agentic orchestration at scale.

Agentic Al Needs Integration with Core Enterprise Systems

Agentic AI is not a standalone trend—it’s becoming an integral part of broader enterprise AI strategies. While this post focuses on architectural foundations like Kafka, MCP, and A2A, it’s important to recognize how this infrastructure complements the evolution of major AI platforms.

Leading vendors such as Databricks, Snowflake, and others are building scalable foundations for machine learning, analytics, and generative AI. These platforms often handle model training and serving. But to bring agentic capabilities into production—especially for real-time, autonomous workflows—they must connect with operational, transactional systems and other agents at runtime. (See also: Confluent + Databricks blog series | Apache Kafka + Snowflake blog series)

This is where Kafka as the event broker becomes essential: it links these analytical backends with AI agents, transactional systems, and streaming pipelines across the enterprise.

At the same time, enterprise application vendors are embedding AI assistants and agents directly into their platforms:

SAP Joule / Business AI – Embedded AI for finance, supply chain, and operations
Salesforce Einstein / Copilot Studio – Generative AI for CRM and sales automation
ServiceNow Now Assist – Predictive automation across IT and employee services
Oracle Fusion AI / OCI – ML for ERP, HCM, and procurement
Microsoft Copilot – Integrated AI across Dynamics and Power Platform
IBM watsonx, Adobe Sensei, Infor Coleman AI – Governed, domain-specific AI agents

Each of these solutions benefits from the same architectural foundation: real-time data access, decoupled integration, and standardized agent communication.

Whether deployed internally or sourced from vendors, agents need reliable event-driven infrastructure to coordinate with each other and with backend systems. Apache Kafka provides this core integration layer—supporting a consistent, scalable, and open foundation for agentic AI across the enterprise.

Agentic AI Requires Decoupling – Apache Kafka Supports A2A and MCP as an Event Broker

To deliver on the promise of agentic AI, enterprises must move beyond point-to-point APIs and batch integrations. They need a shared, event-driven foundation that enables agents (and other enterprise software) to work independently and together—with shared context, consistent data, and scalable interactions.

Apache Kafka provides exactly that. Combined with MCP and A2A for standardized Agentic AI communication, Kafka unlocks the flexibility, resilience, and openness needed for next-generation enterprise AI.

It’s not about picking one agent platform—it’s about giving every agent the same, reliable interface to the rest of the world. Kafka is that interface.

The post Agentic AI with the Agent2Agent Protocol (A2A) and MCP using Apache Kafka as Event Broker appeared first on Kai Waehner.

Databricks and Confluent Leading Data and AI Architectures – What About Snowflake, BigQuery, and Friends?

Kai Waehner — Thu, 15 May 2025 09:57:25 +0000

The modern data landscape is shaped by platforms that excel in different but increasingly overlapping domains. Confluent leads in data streaming with enterprise-grade infrastructure for real-time data movement and processing. Databricks and Snowflake dominate the lakehouse and analytics space—each with unique strengths. Databricks is known for scalable AI and machine learning pipelines, while Snowflake stands out for its simplicity, governed data sharing, and performance in cloud-native analytics.

This final blog in the series brings together everything covered so far and highlights how these technologies power real customer innovation. At Erste Bank, Confluent and Databricks are combined to build an event-driven architecture for Generative AI use cases in customer service. At Siemens, Confluent and Snowflake support a shift-left architecture to drive real-time manufacturing insights and medical AI—using streaming data not just for analytics, but also to trigger operational workflows across systems.

Together, these examples show why so many enterprises adopt a multi-platform strategy—with Confluent as the event-driven backbone, and Databricks or Snowflake (or both) as the downstream platforms for analytics, governance, and AI.

About the Confluent and Databricks Blog Series

This article is part of a blog series exploring the growing roles of Confluent and Databricks in modern data and AI architectures:

Blog 1: The Past, Present and Future of Confluent (The Kafka Company) and Databricks (The Spark Company)
Blog 2: Confluent Data Streaming Platform vs. Databricks Data Intelligence Platform for Data Integration and Processing
Blog 3: Shift-Left Architecture for AI and Data Warehousing with Confluent and Databricks
Blog 4: Databricks and Confluent in Enterprise Software Environments (with SAP as Example)
Blog 5 (THIS ARTICLE): Databricks and Confluent Leading Data and AI Architectures – and How They Compare to Competitors

Learn how these platforms will affect data use in businesses in future articles. Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including technical architectures and the relation to analytical platforms like Databricks and Snowflake.

The Broader Data Streaming and Lakehouse Landscape

The data streaming and lakehouse space continues to expand, with a variety of platforms offering different capabilities for real-time processing, analytics, and storage.

Data Streaming Market

On the data streaming side, Confluent is the leader. Other cloud-native services like Amazon MSK, Azure Event Hubs, and Google Cloud Managed Kafka provide Kafka-compatible offerings, though they vary in protocol support, ecosystem maturity, and operational simplicity. StreamNative, based on Apache Pulsar, competes with the Kafka offerings, while Decodable and DeltaStream leverage Apache Flink for real-time stream processing using a complementary approach. Startups such as AutoMQ and BufStream pitch reimagining Kafka infrastructure for improved scalability and cost-efficiency in cloud-native architectures.

The data streaming landscape is growing year by year. Here is the latest overview of the data streaming market:

Lakehouse Market

In the lakehouse and analytics platform category, Databricks leads with its cloud-native model combining compute and storage, enabling modern lakehouse architectures. Snowflake is another leading cloud data platform, praised for its ease of use, strong ecosystem, and ability to unify diverse analytical workloads. Microsoft Fabric aims to unify data engineering, real-time analytics, and AI on Azure under one platform. Google BigQuery offers a serverless, scalable solution for large-scale analytics, while platforms like Amazon Redshift, ClickHouse, and Athena serve both traditional and high-performance OLAP use cases.

The Forrester Wave for Lakehouses analyzes and explores the vendor options, showing Databricks, Snowflake and Google as the leaders. Unfortunately, it is not allowed to post the Forrester Wave, so you need to download it from a vendor.

Confluent + Databricks

This blog series highlights Databricks and Confluent because they represent a powerful combination at the intersection of data streaming and the lakehouse paradigm. Together, they enable real-time, AI-driven architectures that unify operational and analytical workloads across modern enterprise environments.

Each platform in the data streaming and Lakehouse space has distinct advantages, but none offer the same combination of real-time capabilities, open architecture, and end-to-end integration as Confluent and Databricks.

It’s also worth noting that open source remains a big – if not the biggest – competitor to all of these vendors. Many enterprises still rely on open-source data lakes built on Elastic, legacy Hadoop, or open table formats such as Apache Hudi—favoring flexibility and cost control over fully managed services.

Confluent: The Leading Data Streaming Platform (DSP)

Confluent is the enterprise-standard platform for data streaming, built on Apache Kafka and extended for cloud-native, real-time operations at global scale. The data streaming platform (DSP) delivers a complete and unified platform with multiple deployment options to meet diverse needs and budgets:

Confluent Cloud – Fully managed, serverless Kafka and Flink service across AWS, Azure, and Google Cloud
Confluent Platform – Self-managed software for on-premises, private cloud, or hybrid environments
WarpStream – Kafka-compatible, cloud-native infrastructure optimized for BYOC (Bring Your Own Cloud) using low-cost object storage like S3

Together, these options offer cost efficiency and flexibility across a wide range of streaming workloads:

Small-volume, mission-critical use cases such as payments or fraud detection, where zero data loss, strict SLAs, and low latency are non-negotiable
High-volume, analytics-driven use cases like clickstream processing for real-time personalization and recommendation engines, where throughput and scalability are key

Confluent supports these use cases with:

Cluster Linking for real-time, multi-region and hybrid cloud data movement
100+ prebuilt connectors for seamless integration with enterprise systems and cloud services
Apache Flink for rich stream processing at scale
Governance and observability with Schema Registry, Stream Catalog, role-based access control, and SLAs
Tableflow for native integration with Delta Lake, Apache Iceberg, and modern lakehouse architectures

While other providers offer fragments—such as Amazon MSK for basic Kafka infrastructure or Azure Event Hubs for ingestion—only Confluent delivers a unified, cloud-native data streaming platform with consistent operations, tooling, and security across environments.

Confluent is trusted by over 6,000 enterprises and backed by deep experience in large-scale streaming deployments, hybrid architectures, and Kafka migrations. It combines industry-leading technology with enterprise-grade support, expertise, and consulting services to help organizations turn real-time data into real business outcomes—securely, efficiently, and at any scale.

Databricks: The Leading Lakehouse for AI and Analytics

Databricks is the leading platform for unified analytics, data engineering, and AI—purpose-built to help enterprises turn massive volumes of data into intelligent, real-time decision-making. Positioned as the Data Intelligence Platform, Databricks combines a powerful lakehouse foundation with full-spectrum AI capabilities, making it the platform of choice for modern data teams.

Its core strengths include:

Delta Lake + Unity Catalog – A robust foundation for structured, governed, and versioned data at scale
Apache Spark – Distributed compute engine for ETL, data preparation, and batch/stream processing
MosaicML – End-to-end tooling for efficient model training, fine-tuning, and deployment of custom AI models
AI/ML tools for data scientists, ML engineers, and analysts—integrated across the platform
Native connectors to BI tools (like Power BI, Tableau) and MLOps platforms for model lifecycle management

Databricks directly competes with Snowflake, especially in the enterprise AI and analytics space. While Snowflake shines with simplicity and governed warehousing, Databricks differentiates by offering a more flexible and performant platform for large-scale model training and advanced AI pipelines.

The platform supports:

Batch and (sort of) streaming analytics
ML model training and inference on shared data
GenAI use cases, including RAG (Retrieval-Augmented Generation) with unstructured and structured sources
Data sharing and collaboration across teams and organizations with open formats and native interoperability

Databricks is trusted by thousands of organizations for AI workloads, offering not only powerful infrastructure but also integrated governance, observability, and scalability—whether deployed on a single cloud or across multi-cloud environments.

Combined with Confluent’s real-time data streaming capabilities, Databricks completes the AI-driven enterprise architecture by enabling organizations to analyze, model, and act on high-quality, real-time data at scale.

Stronger Together: A Strategic Alliance for Data and AI with Tableflow and Delta Lake

Confluent and Databricks are not trying to replace each other. Their partnership is strategic and product-driven.

Recent innovation: Tableflow + Delta Lake – this feature enables bi-directional data exchange between Kafka and Delta Lake.

Direction 1: Confluent streams → Tableflow → Delta Lake (via Unity Catalog)
Direction 2: Databricks insights → Tableflow → Kafka → Flink or other operational systems

This simplifies architecture, reduces cost and latency, and removes the need for Spark jobs to manage streaming data.

Source: Confluent

Confluent becomes the operational data backbone for AI and analytics. Databricks becomes the analytics and AI engine fed with data from Confluent.

Where needed, operational or analytical real-time AI predictions can be done within Confluent’s data streaming platform: with embedded or remote model inference, native integration for search with vector databases, and built-in models for common predictive use cases such as forecasting.

Erste Bank: Building a Foundation for GenAI with Confluent and Databricks

Erste Group Bank AG, one of the largest financial services providers in Central and Eastern Europe, is leveraging Confluent and Databricks to transform its customer service operations with Generative AI. Recognizing that successful GenAI initiatives require more than just advanced models, Erste Bank first focused on establishing a solid foundation of real-time, consistent, and high-quality data leveraging data streaming and an event-driven architecture.

Using Confluent, Erste Bank connects real-time streams, batch workloads, and request-response APIs across its legacy and cloud-native systems in a decoupled way but ensuring data consistency through Kafka. This architecture ensures that operational and analytical data — whether from core banking platforms, digital channels, mobile apps, or CRM systems — flows reliably and consistently across the enterprise. By integrating event streams, historical data, and API calls into a unified data pipeline, Confluent enables Erste Bank to create a live, trusted digital representation of customer interactions.

With this real-time foundation in place, Erste Bank leverages Databricks as its AI and analytics platform to build and scale GenAI applications. At the Data in Motion Tour 2024 in Frankfurt, Erste Bank presented a pilot project where customer service chatbots consume contextual data flowing through Confluent into Databricks, enabling intelligent, personalized responses. Once a customer request is processed, the chatbot triggers a transaction back through Kafka into the Salesforce CRM, ensuring seamless, automated follow-up actions.

Source: Erste Group Bank AG

By combining Confluent’s real-time data streaming capabilities with Databricks’ powerful AI infrastructure, Erste Bank is able to:

Deliver highly contextual, real-time customer service interactions
Automate CRM workflows through real-time event-driven architectures
Build a scalable, resilient platform for future AI-driven applications

This architecture positions Erste Bank to continue expanding GenAI use cases across financial services, from customer engagement to operational efficiency, powered by consistent, trusted, and real-time data.

Confluent: The Neutral Streaming Backbone for Any Data Stack

Confluent is not tied to a monolithic compute engine within a cloud provider. This neutrality is a strength:

Bridges operational systems (mainframes, SAP) with modern data platforms (AI, lakehouses, etc.)
An event-driven architecture built with a data streaming platform feeds multiple lakehouses at once
Works across all major cloud providers, including AWS, Azure, and GCP
Operates at the edge, on-prem, in the cloud and in hybrid scenarios
One size doesn’t fit all – follow best practices from microservices architectures and data mesh to tailor your architecture with purpose-built solutions.

The flexibility makes Confluent the best platform for data distribution—enabling decoupled teams to use the tools and platforms best suited to their needs.

Confluent’s Tableflow also supports Apache Iceberg to enable seamless integration from Kafka into lakehouses beyond Delta Lake and Databricks—such as Snowflake, BigQuery, Amazon Athena, and many other data platforms and analytics engines.

Example: A global enterprise uses Confluent as its central nervous system for data streaming. Customer interaction events flow in real time from web and mobile apps into Confluent. These events are then:

Streamed into Databricks once for multiple GenAI and analytics use cases.
Written to an operational PostgreSQL database to update order status and customer profiles
Pushed into an customer-facing analytics engine like StarTree (powered by Apache Pinot) for live dashboards and real-time customer behavior analytics
Shared with Snowflake through a lift-and-shift M&A use case to unify analytics from an acquired company

This setup shows the power of Confluent’s neutrality and flexibility: enabling real-time, multi-directional data sharing across heterogeneous platforms, without coupling compute and storage.

Snowflake: A Cloud-Native Companion to Confluent – Natively Integrated with Apache Iceberg and Polaris Catalog

Snowflake pairs naturally with Confluent to power modern data architectures. As a cloud-native SaaS from the start, Snowflake has earned broad adoption across industries thanks to its scalability, simplicity, and fully managed experience.

Together, Confluent and Snowflake unlock high-impact use cases:

Near real-time ingestion and enrichment: Stream operational data into Snowflake for immediate analytics and action.
Unified operational and analytical workloads: Combine Confluent’s Tableflow with Snowflake’s Apache Iceberg support through its open source Polaris catalog to bridge operational and analytical data layers.
Shift-left data quality: Improve reliability and reduce costs by validating and shaping data upstream, before it hits storage.

With Confluent as the streaming backbone and Snowflake as the analytics engine, enterprises get a cloud-native stack that’s fast, flexible, and built to scale. Many enterprises use Confluent as data ingestion platform for Databricks, Snowflake, and other analytical and operational downstream applications.

Shift Left at Siemens: Real-Time Innovation with Confluent and Snowflake

Siemens is a global technology leader operating across industry, infrastructure, mobility, and healthcare. Its portfolio includes industrial automation, digital twins, smart building systems, and advanced medical technologies—delivered through units like Siemens Digital Industries and Siemens Healthineers.

To accelerate innovation and reduce operational costs, Siemens is embracing a shift-left architecture to enrich data early in the pipeline before it reaches Snowflake. This enables reusable, real-time data products in the data streaming platform leveraging an event-driven architecture for data sharing with analytical and operational systems beyond Snowflake.

Siemens Digital Industries applies this model to optimize manufacturing and intralogistics, using streaming ETL to feed real-time dashboards and trigger actions like automated inventory replenishment—while continuing to use Snowflake for historical analysis, reporting, and long-term data modeling.

Source: Siemens Digital Industries

Siemens Healthineers embeds AI directly in the stream processor to detect anomalies in medical equipment telemetry, improving response time and avoiding costly equipment failures—while leveraging Snowflake to power centralized analytics, compliance reporting, and cross-device trend analysis.

Source: Siemens Healthineers

These success stories are part of The Data Streaming Use Case Show, my new industry webinar series. Learn more about Siemens’ usage of Confluent and Snowflake and watch the video recording about “shift left”.

Open Outlook: Agentic AI with Model-Context Protocol (MCP) and Agent2Agent Protocol (A2A)

While data and AI platforms like Databricks and Snowflake play a key role, some Agentic AI projects will likely rely on emerging, independent SaaS platforms and specialized tools. Flexibility and open standards are key for future success.

What better way to close a blog series on Confluent and Databricks (and Snowflake) than by looking ahead to one of the most exciting frontiers in enterprise AI: Agentic AI.

As enterprise AI matures, there is growing interest in bi-directional interfaces between operational systems and AI agents. Google’s A2A (Agent-to-Agent) architecture reinforces this shift—highlighting how intelligent agents can autonomously communicate, coordinate, and act across distributed systems.

Confluent + Databricks is an ideal combination to support these emerging Agentic AI patterns, where event-driven agents continuously learn from and act upon streaming data. Models can be embedded directly in Flink for low-latency applications or hosted and orchestrated in Databricks for more complex inference workflows.

The Model-Context-Protocol (MCP) is gaining traction as a design blueprint for standardized interaction between services, models, and event streams. In this context, Confluent and Databricks are well positioned to lead:

Confluent: Event-driven delivery of context, inputs, and actions
Databricks: Model hosting, training, inference, and orchestration
Jointly: Closed feedback loops between business systems and AI agents

Together with protocols like A2A and MCP, this architecture will shape the next generation of intelligent, real-time enterprise applications.

Confluent + Databricks: The Future-Proof Data Stack for AI and Analytics

Databricks and Confluent are not just partners. They are leaders in their respective domains. Together, they enable real-time, intelligent data architectures that support operational excellence and AI innovation.

Other AI and data platforms are part of the landscape, and many bring valuable capabilities. As explored in this blog series, the true decoupling using an event-driven architecture with Apache Kafka allows using any kind of combination of vendors and cloud services. I see many enterprises using Databricks and Snowflake integrated to Confluent. However, the alignment between Confluent and Databricks stands out due to its combination of strengths:

Confluent’s category leadership in data streaming, powering thousands of production deployments across industries
Databricks’ strong position in the lakehouse and AI space, with broad enterprise adoption for analytics and machine learning
Shared product vision and growing engineering and go-to-market alignment across technical and field organizations

For enterprises shaping a long-term data and AI strategy, this combination offers a proven, scalable foundation—bridging real-time operations with analytical depth, without forcing trade-offs between speed, flexibility, or future-readiness.

Stay tuned for deep dives into how these platforms are shaping the future of data-driven enterprises. Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including technical architectures and the relation to analytical platforms like Databricks and Snowflake.

The post Databricks and Confluent Leading Data and AI Architectures – What About Snowflake, BigQuery, and Friends? appeared first on Kai Waehner.

How Apache Kafka and Flink Power Event-Driven Agentic AI in Real Time

Kai Waehner — Mon, 14 Apr 2025 09:09:10 +0000

Artificial Intelligence is evolving beyond passive analytics and reactive automation. Agentic AI represents a new wave of autonomous, goal-driven AI systems that can think, plan, and execute complex workflows without human intervention. However, for these AI agents to be effective, they must operate on real-time, consistent, and trustworthy data—a challenge that traditional batch processing architectures simply cannot meet. This is where Data Streaming with Apache Kafka and Apache Flink, coupled with an event-driven architecture (EDA), form the backbone of Agentic AI. By enabling real-time and continuous decision-making, EDA ensures that AI systems can act instantly and reliably in dynamic, high-speed environments. Emerging standards like the Model Context Protocol (MCP) and Google’s Agent-to-Agent (A2A) protocol are now complementing this foundation, providing structured, interoperable layers for managing context and coordination across intelligent agents—making AI not just event-driven, but also context-aware and collaborative.

In this post, I will explore:

How Agentic AI works and why it needs real-time data
Why event-driven architectures are the best choice for AI automation
Key use cases across industries
How Kafka and Flink provide the necessary data consistency and real-time intelligence for AI-driven decision-making
The role of MCP, A2A, and frameworks like LangChain and LlamaIndex in enabling scalable, context-aware, and collaborative AI systems

What is Agentic AI?

Agentic AI refers to AI systems that exhibit autonomous, goal-driven decision-making and execution. Unlike traditional automation tools that follow rigid workflows, Agentic AI can:

Understand and interpret natural language instructions
Set objectives, create strategies, and prioritize actions
Adapt to changing conditions and make real-time decisions
Execute multi-step tasks with minimal human supervision
Integrate with multiple operational and analytical systems and data sources to complete workflows

Here is an example AI Agent dependency graph from Sean Falconer’s article “Event-Driven AI: Building a Research Assistant with Kafka and Flink“:

Source: Sean Falconer

Instead of merely analyzing data, Agentic AI acts on data, making it invaluable for operational and transactional use cases—far beyond traditional analytics.

However, without real-time, high-integrity data, these systems cannot function effectively. If AI is working with stale, incomplete, or inconsistent information, its decisions become unreliable and even counterproductive. This is where Kafka, Flink, and event-driven architectures become indispensable.

Why Batch Processing Fails for Agentic AI

Traditional AI and analytics systems have relied heavily on batch processing, where data is collected, stored, and processed in predefined intervals. This approach may work for generating historical reports or training machine learning models offline, but it completely breaks down when applied to operational and transactional AI use cases—which are at the core of Agentic AI.

I recently explored the Top 20 Problems with Batch Processing (and How to Fix Them with Data Streaming). And here’s why batch processing is fundamentally incompatible with Agentic AI and the real-world challenges it creates:

1. Delayed Decision-Making Slows AI Reactions

Agentic AI systems are designed to autonomously respond to real-time changes in the environment, whether it’s optimizing a telecommunications network, detecting fraud in banking, or dynamically adjusting supply chains.

In a batch-driven system, data is processed hours or even days later, making AI responses obsolete before they even reach the decision-making phase. For example:

Fraud detection: If a bank processes transactions in nightly batches, fraudulent activities may go unnoticed for hours, leading to financial losses.
E-commerce recommendations: If a retailer updates product recommendations only once per day, it fails to capture real-time shifts in customer behavior.
Network optimization: If a telecom company analyzes network traffic in batch mode, it cannot prevent congestion or outages before it affects users.

Agentic AI requires instantaneous decision-making based on streaming data, not delayed insights from batch reports.

2. Data Staleness Creates Inaccurate AI Decisions

AI agents must act on fresh, real-world data, but batch processing inherently means working with outdated information. If an AI agent is making decisions based on yesterday’s or last hour’s data, those decisions are no longer reliable.

Consider a self-healing IT infrastructure that uses AI to detect and mitigate outages. If logs and system metrics are processed in batch mode, the AI agent will be acting on old incident reports, missing live system failures that need immediate attention.

In contrast, an event-driven system powered by Kafka and Flink ensures that AI agents receive live system logs as they occur, allowing for proactive self-healing before customers are impacted.

3. High Latency Kills Operational AI

In industries like finance, healthcare, and manufacturing, even a few seconds of delay can lead to severe consequences. Batch processing introduces significant latency, making real-time automation impossible.

For example:

Healthcare monitoring: A real-time AI system should detect abnormal heart rates from a patient’s wearable device and alert doctors immediately. If health data is only processed in hourly batches, a critical deterioration could be missed, leading to life-threatening situations.
Automated trading in finance: AI-driven trading systems must respond to market fluctuations within milliseconds. Batch-based analysis would mean losing high-value trading opportunities to faster competitors.

Agentic AI must operate on a live data stream, where every event is processed instantly, allowing decisions to be made in real-time, not retrospectively.

4. Rigid Workflows Increase Complexity and Costs

Batch processing forces businesses to predefine rigid workflows that do not adapt well to changing conditions. In a batch-driven world:

Data must be manually scheduled for ingestion.
Systems must wait for the entire dataset to be processed before making decisions.
Business logic is hard-coded, requiring expensive engineering effort to update workflows.

Agentic AI, on the other hand, is designed for continuous, adaptive decision-making. By leveraging an event-driven architecture, AI agents listen to streams of real-time data, dynamically adjusting workflows on the fly instead of relying on predefined batch jobs.

This flexibility is especially critical in industries with rapidly changing conditions, such as supply chain logistics, cybersecurity, and IoT-based smart cities.

5. Batch Processing Cannot Support Continuous Learning

A key advantage of Agentic AI is its ability to learn from past experiences and self-improve over time. However, this is only possible if AI models are continuously updated with real-time feedback loops.

Batch-driven architectures limit AI’s ability to learn because:

Models are retrained infrequently, leading to outdated insights.
Feedback loops are slow, preventing AI from adjusting strategies in real time.
Drift in data patterns is not immediately detected, causing AI performance degradation.

For instance, in customer service chatbots, an AI-powered agent should adapt to customer sentiment in real time. If a chatbot is trained on stale customer interactions from last month, it won’t understand emerging trends or newly common issues.

By contrast, a real-time data streaming architecture ensures that AI agents continuously receive live customer interactions, retrain in real time, and evolve dynamically.

Agentic AI Requires an Event-Driven Architecture

Agentic AI must act in real time and integrate operational and analytical information. Whether it’s an AI-driven fraud detection system, an autonomous network optimization agent, or a customer service chatbot, acting on outdated information is not an option.

The Event-Driven Approach

An Event-Driven Architecture (EDA) enables continuous processing of real-time data streams, ensuring that AI agents always have the latest information available. By decoupling applications and processing events asynchronously, EDA allows AI to respond dynamically to changes in the environment without being constrained by rigid workflows.

AI can also be seamlessly integrated into existing business processes leveraging an EDA, bridging modern and legacy technologies without requiring a complete system overhaul. Not every data source may be real-time, but EDA ensures data consistency across all consumers—if an application processes data, it sees exactly what every other application sees. This guarantees synchronized decision-making, even in hybrid environments combining historical data with real-time event streams.

Why Apache Kafka is Essential for Agentic AI

For AI to be truly autonomous and effective, it must operate in real time, adapt to changing conditions, and ensure consistency across all applications. An Event-Driven Architecture (EDA) built with Apache Kafka provides the foundation for this by enabling:

Immediate Responsiveness → AI agents receive and act on events as they occur.
High Scalability → Components are decoupled and can scale independently.
Fault Tolerance → AI processes continue running even if some services fail.
Improved Data Consistency → Ensures AI agents are working with accurate, real-time data.

Building Agentic AI with Apache Kafka and Flink

To build truly autonomous AI systems, organizations need a real-time data infrastructure that can process, analyze, and act on events as they happen.

Source: Sean Falconer

Apache Kafka: The Real-Time Data Streaming Backbone

Apache Kafka provides a scalable, event-driven messaging infrastructure that ensures AI agents receive a constant, real-time stream of events. By acting as a central nervous system, Kafka enables:

Decoupled AI components that communicate through event streams.
Efficient data ingestion from multiple sources (IoT devices, applications, databases).
Guaranteed event delivery with fault tolerance and durability.
High-throughput processing to support real-time AI workloads.

Apache Flink: The Real-Time Stream Processing Engine

Apache Flink complements Kafka by providing stateful stream processing for AI-driven workflows. With Flink, AI agents can:

Analyze real-time data streams for anomaly detection, predictions, and decision-making.
Perform complex event processing to detect patterns and trigger automated responses.
Continuously learn and adapt based on evolving real-time data.
Orchestrate multi-agent workflows dynamically.

Use Cases of Agentic AI with Kafka and Flink Across Industries

Across industries, Agentic AI is redefining how businesses and governments operate. By leveraging event-driven architectures and real-time data streaming, organizations can unlock the full potential of AI-driven automation, improving efficiency, reducing costs, and delivering better experiences.

Here are key use cases across different industries:

Financial Services: Real-Time Fraud Detection and Risk Management

Traditional fraud detection systems rely on batch processing, leading to delayed responses and financial losses.

Agentic AI enables real-time transaction monitoring, detecting anomalies as they occur and blocking fraudulent activities instantly.

AI agents continuously learn from evolving fraud patterns, reducing false positives and improving security. In risk management, AI analyzes market trends, adjusts investment strategies, and automates compliance processes to ensure financial institutions stay ahead of threats and regulatory requirements.

Telecommunications: Autonomous Network Optimization

Telecom networks require constant tuning to maintain service quality, but traditional network management is reactive and expensive.

Agentic AI can proactively monitor network traffic, predict congestion, and automatically reconfigure network resources in real time. AI-powered agents optimize bandwidth allocation, detect outages before they impact customers, and enable self-healing networks, reducing operational costs and improving service reliability.

Retail: AI-Powered Personalization and Dynamic Pricing

Retailers struggle with static recommendation engines that fail to capture real-time customer intent.

Agentic AI analyzes customer interactions, adjusts recommendations dynamically, and personalizes promotions based on live purchasing behavior. AI-driven pricing strategies adapt to supply chain fluctuations, competitor pricing, and demand changes in real time, maximizing revenue while maintaining customer satisfaction.

AI agents also enhance logistics by optimizing inventory management and reducing stock shortages.

Healthcare: Real-Time Patient Monitoring and Predictive Care

Hospitals and healthcare providers require real-time insights to deliver proactive care, but batch processing delays critical decisions.

Agentic AI continuously streams patient vitals from medical devices to detect early signs of deterioration and triggering instant alerts to medical staff. AI-driven predictive analytics optimize hospital resource allocation, improve diagnosis accuracy, and enable remote patient monitoring, reducing emergency incidents and improving patient outcomes.

Gaming: Dynamic Content Generation and Adaptive AI Opponents

Modern games need to provide immersive, evolving experiences, but static game mechanics limit engagement.

Agentic AI enables real-time adaptation of gameplay to generate dynamic environments and personalizing challenges based on a player’s behavior. AI-driven opponents can learn and adapt to individual playstyles, keeping games engaging over time. AI agents also manage server performance, detect cheating, and optimize in-game economies for a better gaming experience.

Manufacturing & Automotive: Smart Factories and Autonomous Systems

Manufacturing relies on precision and efficiency, yet traditional production lines struggle with downtime and defects.

Agentic AI monitors production processes in real time to detect quality issues early and adjusting machine parameters autonomously. This directly improves Overall Equipment Effectiveness (OEE) by reducing downtime, minimizing defects, and optimizing machine performance to ensure higher productivity and operational efficiency to ensure higher productivity and operational efficiency.

In automotive, AI-driven agents analyze real-time sensor data from self-driving cars to make instant navigation decisions, predict maintenance needs, and optimize fleet operations for logistics companies.

Public Sector: AI-Powered Smart Cities and Citizen Services

Governments face challenges in managing infrastructure, public safety, and citizen services efficiently.

Agentic AI can optimize traffic flow by analyzing real-time data from sensors and adjusting signals dynamically. AI-powered public safety systems detect threats from surveillance data and dispatch emergency services instantly. AI-driven chatbots handle citizen inquiries, automate document processing, and improve response times for government services.

The Business Value of Real-Time AI using Autonomous Agents

By leveraging Kafka and Flink in an event-driven AI architecture, organizations can achieve:

Better Decision-Making → AI operates on fresh, accurate data.
Faster Time-to-Action → AI agents respond to events immediately.
Reduced Costs → Less reliance on expensive batch processing and manual intervention by humans.
Greater Scalability → AI systems can handle massive workloads in real time.
Vendor Independence → Kafka and Flink support open standards and hybrid/multi-cloud deployments, preventing vendor lock-in.

Why LangChain, LlamaIndex, and Similar Frameworks Are Not Enough for Agentic AI in Production

Frameworks like LangChain, LlamaIndex, and others have gained popularity for making it easy to prototype AI agents by chaining prompts, tools, and external APIs. They provide useful abstractions for reasoning steps, retrieval-augmented generation (RAG), and basic tool use—ideal for experimentation and lightweight applications.

However, when building agentic AI for operational, business-critical environments, these frameworks fall short on several fronts:

Many frameworks like LangChain are inherently synchronous and follows a request-response model, which limits its ability to handle real-time, event-driven inputs at scale. In contrast, LlamaIndex takes an event-driven approach, using a message broker—including support for Apache Kafka—for inter-agent communication.
Debugging, observability, and reproducibility are weak—there’s often no persistent, structured record of agent decisions or tool interactions.
State is ephemeral and in-memory, making long-running tasks, retries, or rollback logic difficult to implement reliably.
Most Agentic AI frameworks lack support for distributed, fault-tolerant execution and scalable orchestration, which are essential for production systems.

That said, these frameworks like LangChain and Llamaindex can still play a valuable, complementary role when integrated into an event-driven architecture. For example, an agent might use LangChain for planning or decision logic within a single task, while Apache Kafka and Apache Flink handle the real-time flow of events, coordination between agents, persistence, and system-level guarantees.

LangChain and similar toolkits help define how an agent thinks. But to run that thinking at scale, in real time, and with full traceability, you need a robust data streaming foundation. That’s where Kafka and Flink come in.

Model Context Protocol (MCP) and Agent-to-Agent (A2A) for Scalable, Composable Agentic AI Architectures

Model Context Protocol (MCP) is one of the hottest topics in AI right now. Coined by Anthropic, with early support emerging from OpenAI, Google, and other leading AI infrastructure providers, MCP is rapidly becoming a foundational layer for managing context in agentic systems. MCP enables systems to define, manage, and exchange structured context windows—making AI interactions consistent, portable, and state-aware across tools, sessions, and environments.

Google’s recently announced Agent-to-Agent (A2A) protocol adds further momentum to this movement, setting the groundwork for standardized interaction across autonomous agents. These advancements signal a new era of AI interoperability and composability.

Together with Kafka and Flink, MCP and protocols like A2A help bridge the gap between stateless LLM calls and stateful, event-driven agent architectures. Naturally, event-driven architecture is the perfect foundation for all this. The key now is to build enough product functionality and keep pushing the boundaries of innovation.

A dedicated blog post is coming soon to explore how MCP and A2A connect data streaming and request-response APIs in modern AI systems.

The Future of Agentic AI is Event-Driven with Data Streaming using Kafka and Flink

Agentic AI is poised to revolutionize industries by enabling fully autonomous, goal-driven AI systems that perceive, decide, and act continuously. But to function reliably in dynamic, production-grade environments, these agents require real-time, event-driven architectures—not outdated, batch-oriented pipelines.

Apache Kafka and Apache Flink form the foundation of this shift. Kafka ensures agents receive reliable, ordered event streams, while Flink provides stateful, low-latency stream processing for real-time reactions and long-lived context management. This architecture enables AI agents to process structured events as they happen, react to changes in the environment, and coordinate with other services or agents through durable, replayable data flows.

If your organization is serious about AI, the path forward is clear:

Move from batch to real-time, from passive analytics to autonomous action, and from isolated prompts to event-driven, context-aware agents—enabled by Kafka and Flink.

As a next step, learn more about “Online Model Training and Model Drift in Machine Learning with Apache Kafka and Flink“.

Let’s connect on LinkedIn and discuss how to implement these ideas in your organization. Stay informed about new developments by subscribing to my newsletter. And make sure to download my free book about data streaming use cases.

The post How Apache Kafka and Flink Power Event-Driven Agentic AI in Real Time appeared first on Kai Waehner.