Middleware Archives - Kai Waehner

Modernizing OT Middleware: The Shift to Open Industrial IoT Architectures with Data Streaming

Kai Waehner — Mon, 17 Mar 2025 12:45:14 +0000

Operational Technology (OT) has traditionally relied on legacy middleware to connect industrial systems, manage data flows, and integrate with enterprise IT. However, these monolithic, proprietary, and expensive middleware solutionsstruggle to keep up with real-time, scalable, and cloud-native architectures.

Just as mainframe offloading modernized enterprise IT, offloading and replacing legacy OT middleware is the next wave of digital transformation. Companies are shifting from vendor-locked, heavyweight OT middleware to real-time, event-driven architectures using Apache Kafka and Apache Flink—enabling cost efficiency, agility, and seamless edge-to-cloud integration.

This blog explores why and how organizations are replacing traditional OT middleware with data streaming, the benefits of this shift, and architectural patterns for hybrid and edge deployments.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases, including architectures and customer stories for hybrid IT/OT integration scenarios.

Why Replace Legacy OT Middleware?

Industrial environments have long relied on OT middleware like OSIsoft PI, proprietary SCADA systems, and industry-specific data buses. These solutions were designed for polling-based communication, siloed data storage, and batch integration. But today’s real-time, AI-driven, and cloud-native use cases demand more.

Challenges: Proprietary, Monolithic, Expensive

High Costs – Licensing, maintenance, and scaling expenses grow exponentially.
Proprietary & Rigid – Vendor lock-in restricts flexibility and data sharing.
Batch & Polling-Based – Limited ability to process and act on real-time events.
Complex Integration – Difficult to connect with cloud and modern IT systems.
Limited Scalability – Not built for the massive data volumes of IoT and edge computing.

Just as PLCs are transitioning to virtual PLCs, eliminating hardware constraints and enabling software-defined industrial control, OT middleware is undergoing a similar shift. Moving from monolithic, proprietary middleware to event-driven, streaming architectures with Kafka and Flink allows organizations to scale dynamically, integrate seamlessly with IT, and process industrial data in real time—without vendor lock-in or infrastructure bottlenecks.

The Data Streaming Approach: Kafka & Flink as the Foundation for Modern OT Middleware

Data streaming is NOT a direct replacement for OT middleware, but it serves as the foundation for modernizing industrial data architectures. With Kafka and Flink, enterprises can offload or replace OT middleware to achieve real-time processing, edge-to-cloud integration, and open interoperability.

While Kafka and Flink provide real-time, scalable, and event-driven capabilities, last-mile integration with PLCs, sensors, and industrial equipment still requires OT-specific SDKs, open interfaces, or lightweight middleware. This includes support for MQTT, OPC UA or open-source solutions like Apache PLC4X to ensure seamless connectivity with OT systems.

Apache Kafka: The Backbone of Real-Time OT Data Streaming

Kafka acts as the central nervous system for industrial data to ensure low-latency, scalable, and fault-tolerant event streaming between OT and IT systems.

Aggregates and normalizes OT data from sensors, PLCs, SCADA, and edge devices.
Bridges OT and IT by integrating with ERP, MES, cloud analytics, and AI/ML platforms.
Operates seamlessly in hybrid, multi-cloud, and edge environments, ensuring real-time data flow.
Works with open OT standards like MQTT and OPC UA, reducing reliance on proprietary middleware solutions.

And just to be clear: Apache Kafka and similar technologies support “IT real-time” (meaning milliseconds of latency and sometimes latency spikes). This is NOT about hard real-time in the OT world for embedded systems or safety critical applications.

Apache Flink: The Real-Time Processing Engine for OT Data

Flink powers real-time analytics, complex event processing, and anomaly detection for streaming industrial data.

Processes sensor data in real-time for predictive maintenance, energy optimization, and fault detection.
Enables stateful stream processing, supporting event correlation, time-series aggregation, and trend analysis.
Runs on edge devices, data centers, or cloud platforms, providing scalability and deployment flexibility.

By leveraging Kafka and Flink, enterprises can process OT and IT data only once, ensuring a real-time, unified data architecture that eliminates redundant processing across separate systems. This approach enhances operational efficiency, reduces costs, and accelerates digital transformation while still integrating seamlessly with existing industrial protocols and interfaces.

Unifying Operational (OT) and Analytical (IT) Workloads

As industries modernize, a shift-left architecture approach ensures that operational data is not just consumed for real-time operational OT workloads but is also made available for transactional and analytical IT use cases—without unnecessary duplication or transformation overhead.

The Shift-Left Architecture: Bringing Advanced Analytics Closer to Industrial IoT

In traditional architectures, OT data is first collected, processed, and stored in proprietary or siloed middleware systems before being moved later to IT systems for analysis. This delayed, multi-step process leads to inefficiencies, including:

High latency between data collection and actionable insights.
Redundant data storage and transformations, increasing complexity and cost.
Disjointed AI/ML pipelines, where models are trained on outdated, pre-processed data rather than real-time information.

A shift-left approach eliminates these inefficiencies by bringing analytics, AI/ML, and data science closer to the raw, real-time data streams from the OT environments.

Instead of waiting for batch pipelines to extract and move data for analysis, a modern architecture integrates real-time streaming with open table formats to ensure immediate usability across both operational and analytical workloads.

Open Table Format with Apache Iceberg / Delta Lake for Unified Workloads and Single Storage Layer

By integrating open table formats like Apache Iceberg and Delta Lake, organizations can:

Unify operational and analytical workloads to enable both real-time data streaming and batch analytics in a single architecture.
Eliminate data silos, ensuring that OT and IT teams access the same high-quality, time-series data without duplication.
Ensure schema evolution and ACID transactions to enable robust and flexible long-term data storage and retrieval.
Enable real-time and historical analytics, allowing engineers, business users, and AI/ML models to query both fresh and historical data efficiently.
Reduce the need for complex ETL pipelines, as data is written once and made available for multiple workloadssimultaneously. And no need to use the anti-pattern of Reverse ETL.

The Result: An Open, Cloud-Native, Future-Proof Data Historian for Industrial IoT

This open, hybrid OT/IT architecture allows organizations to maintain real-time industrial automation and monitoring with Kafka and Flink, while ensuring structured, queryable, and analytics-ready data with Iceberg or Delta Lake. The shift-left approach ensures that data streams remain useful beyond their initial OT function, powering AI-driven automation, predictive maintenance, and business intelligence in near real-time rather than relying on outdated and inconsistent batch processes.

By adopting this unified, streaming-first architecture to build an open and cloud-native data historian, organizations can:

Process data once and make it available for both real-time decisions and long-term analytics.
Reduce costs and complexity by eliminating unnecessary data duplication and movement.
Improve AI/ML effectiveness by feeding models with real-time, high-fidelity OT data.
Ensure compliance and historical traceability without compromising real-time performance.

This approach future-proofs industrial data infrastructures, allowing enterprises to seamlessly integrate IT and OT, while supporting cloud, edge, and hybrid environments for maximum scalability and resilience.

Key Benefits of Offloading OT Middleware to Data Streaming

Lower Costs – Reduce licensing fees and maintenance overhead.
Real-Time Insights – No more waiting for batch updates; analyze events as they happen.
One Unified Data Pipeline – Process data once and make it available for both OT and IT use cases.
Edge and Hybrid Cloud Flexibility – Run analytics at the edge, on-premise, or in the cloud.
Open Standards & Interoperability – Support MQTT, OPC UA, REST/HTTP, Kafka, and Flink, avoiding vendor lock-in.
Scalability & Reliability – Handle massive sensor and machine data streams continuously without performance degradation.

A Step-by-Step Approach: Offloading vs. Replacing OT Middleware with Data Streaming

Companies transitioning from legacy OT middleware have several strategies by leveraging data streaming as an integration and migration platform:

Hybrid Data Processing
Lift-and-Shift
Full OT Middleware Replacement

1. Hybrid Data Streaming: Process Once for OT and IT

Why?

Traditional OT architectures often duplicate data processing across multiple siloed systems, leading to higher costs, slower insights, and operational inefficiencies. Many enterprises still process data inside expensive legacy OT middleware, only to extract and reprocess it again for IT, analytics, and cloud applications.

A hybrid approach using Kafka and Flink enables organizations to offload processing from legacy middleware while ensuring real-time, scalable, and cost-efficient data streaming across OT, IT, cloud, and edge environments.

How?

Connect to the existing OT middleware via:

A Kafka Connector (if available).
HTTP APIs, OPC UA, or MQTT for data extraction.
Custom integrations for proprietary OT protocols.
Lightweight edge processing to pre-filter data before ingestion.

Use Kafka for real-time ingestion, ensuring all OT data is available in a scalable, event-driven pipeline.

Process data once with Flink to:

Apply real-time transformations, aggregations, and filtering at scale.
Perform predictive analytics and anomaly detection before storing or forwarding data.
Enrich OT data with IT context (e.g., adding metadata from ERP or MES).

Distribute processed data to the right destinations, such as:

Time-series databases for historical analysis and monitoring.
Enterprise IT systems (ERP, MES, CMMS, BI tools) for decision-making.
Cloud analytics and AI platforms for advanced insights.
Edge and on-prem applications that need real-time operational intelligence.

Result?

Eliminate redundant processing across OT and IT, reducing costs.
Real-time data availability for analytics, automation, and AI-driven decision-making.
Unified, event-driven architecture that integrates seamlessly with on-premise, edge, hybrid, and cloud environments.
Flexibility to migrate OT workloads over time, without disrupting current operations.

By offloading costly data processing from legacy OT middleware, enterprises can modernize their industrial data infrastructure while maintaining interoperability, efficiency, and scalability.

2. Lift-and-Shift: Reduce Costs While Keeping Existing OT Integrations

Why?

Many enterprises rely on legacy OT middleware like OSIsoft PI, proprietary SCADA systems, or industry-specific data hubs for storing and processing industrial data. However, these solutions come with high licensing costs, limited scalability, and an inflexible architecture.

A lift-and-shift approach provides an immediate cost reduction by offloading data ingestion and storage to Apache Kafka while keeping existing integrations intact. This allows organizations to modernize their infrastructure without disrupting current operations.

How?

Use the Stranger Fig Design Pattern as a gradual modernization approach where new systems incrementally replace legacy components, reducing risk and ensuring a seamless transition:

“The most important reason to consider a strangler fig application over a cut-over rewrite is reduced risk.” Martin Fowler

Replace expensive OT middleware for ingestion and storage:

Deploy Kafka as a scalable, real-time event backbone to collect and distribute data.
Offload sensor, PLC, and SCADA data from OSIsoft PI, legacy brokers, or proprietary middleware.
Maintain the connectivity with existing OT applications to prevent workflow disruption.

Streamline OT data processing:

Store and distribute data in Kafka instead of proprietary, high-cost middleware storage.
Leverage schema-based data governance to ensure compatibility across IT and OT systems.
Reduce data duplication by ingesting once and distributing to all required systems.

Maintain existing IT and analytics integrations:

Keep connections to ERP, MES, and BI platforms via Kafka connectors.
Continue using existing dashboards and reports while transitioning to modern analytics platforms.
Avoid vendor lock-in and enable future migration to cloud or hybrid solutions.

Result?

Immediate cost savings by reducing reliance on expensive middleware storage and licensing fees.
No disruption to existing workflows, ensuring continued operational efficiency.
Scalable, future-ready architecture with the flexibility to expand to edge, cloud, or hybrid environments over time.
Real-time data streaming capabilities, paving the way for predictive analytics, AI-driven automation, and IoT-driven optimizations.

A lift-and-shift approach serves as a stepping stone toward full OT modernization, allowing enterprises to gradually transition to a fully event-driven, real-time architecture.

3. Full OT Middleware Replacement: Cloud-Native, Scalable, and Future-Proof

Why?

Legacy OT middleware systems were designed for on-premise, batch-based, and proprietary environments, making them expensive, inflexible, and difficult to scale. As industries embrace cloud-native architectures, edge computing, and real-time analytics, replacing traditional OT middleware with event-driven streaming platforms enables greater flexibility, cost efficiency, and real-time operational intelligence.

A full OT middleware replacement eliminates vendor lock-in, outdated integration methods, and high-maintenance costs while enabling scalable, event-driven data processing that works across edge, on-premise, and cloud environments.

How?

Use Kafka and Flink as the Core Data Streaming Platform

Kafka replaces legacy data brokers and middleware storage by handling high-throughput event ingestion and real-time data distribution.
Flink provides advanced real-time analytics, anomaly detection, and predictive maintenance capabilities.
Process OT and IT data in real-time, eliminating batch-based limitations.

Replace Proprietary Connectors with Lightweight, Open Standards

Deploy MQTT or OPC UA gateways to enable seamless communication with sensors, PLCs, SCADA, and industrial controllers.
Eliminate complex, costly middleware like OSIsoft PI with low-latency, open-source integration.
Leverage Apache PLC4X for industrial protocol connectivity, avoiding proprietary vendor constraints.

Adopt a Cloud-Native, Hybrid, or On-Premise Storage Strategy

Store time-series data in scalable, purpose-built databases like InfluxDB or TimescaleDB.
Enable real-time query capabilities for monitoring, analytics, and AI-driven automation.
Ensure data availability across on-premise infrastructure, hybrid cloud, and multi-cloud deployments.

Modernize IT and Business Integrations

Enable seamless OT-to-IT integration with ERP, MES, BI, and AI/ML platforms.
Stream data directly into cloud-based analytics services, digital twins, and AI models.
Build real-time dashboards and event-driven applications for operators, engineers, and business stakeholders.

Result?

Fully event-driven and cloud-native OT architecture that eliminates legacy bottlenecks.
Real-time data streaming and processing across all industrial environments.
Scalability for high-throughput workloads, supporting edge, hybrid, and multi-cloud use cases.
Lower operational costs and reduced maintenance overhead by replacing proprietary, heavyweight OT middleware.
Future-ready, open, and extensible architecture built on Kafka, Flink, and industry-standard protocols.

By fully replacing OT middleware, organizations gain real-time visibility, predictive analytics, and scalable industrial automation, unlocking new business value while ensuring seamless IT/OT integration.

Helin is an excellent example for a cloud-native IT/OT data solution powered by Kafka and Flink to focus on real-time data integration and analytics, particularly in the context of industrial and operational environments. Its industry focus on maritime and energy sector, but this is relevant across all IIoT industries.

The next generation of OT architectures is being built on open standards, real-time streaming, and hybrid cloud.

Most new industrial sensors, machines, and control systems are now designed with Kafka, MQTT, and OPC UA compatibility.
Modern IT architectures demand event-driven data pipelines for AI, analytics, and automation.
Edge and hybrid computing require scalable, fault-tolerant, real-time processing.

Use Kafka Cluster Linking for seamless bi-directional data replication and command&control, ensuring low-latency, high-availability data synchronization across on-premise, edge, and cloud environments.

Enable multi-region and hybrid edge to cloud architectures with real-time data mirroring to allow organizations to maintain data consistency across global deployments while ensuring business continuity and failover capabilities.

It’s Time to Move Beyond Legacy OT Middleware to Open Standards like MQTT, OPC-UA, Kafka

The days of expensive, proprietary, and rigid OT middleware are numbered (at least for new deployments). Industrial enterprises need real-time, scalable, and open architectures to meet the growing demands of automation, predictive maintenance, and industrial IoT. By embracing open IoT and data streaming technologies, companies can seamlessly bridge the gap between Operational Technology (OT) and IT, ensuring efficient, event-driven communication across industrial systems.

MQTT, OPC-UA and Apache Kafka are a match in heaven for industrial IoT:

MQTT enables lightweight, publish-subscribe messaging for industrial sensors and edge devices.
OPC-UA provides secure, interoperable communication between industrial control systems and modern applications.
Kafka acts as the high-performance event backbone, allowing data from OT systems to be streamed, processed, and analyzed in real time.

Whether lifting and shifting, optimizing hybrid processing, or fully replacing legacy middleware, data streaming is the foundation for the next generation of OT and IT integration. With Kafka at the core, enterprises can decouple systems, enhance scalability, and unlock real-time analytics across the entire industrial landscape.

Stay ahead of the curve! Subscribe to my newsletter for insights into data streaming and connect with me on LinkedIn to continue the conversation. And make sure to download my free book about data streaming use cases and industry success stories.

The post Modernizing OT Middleware: The Shift to Open Industrial IoT Architectures with Data Streaming appeared first on Kai Waehner.

Industrial IoT Middleware for Edge and Cloud OT/IT Bridge powered by Apache Kafka and Flink

Kai Waehner — Fri, 20 Sep 2024 06:48:31 +0000

As industries continue to adopt digital transformation, the convergence of Operational Technology (OT) and Information Technology (IT) has become essential. The OT/IT Bridge is a key concept in industrial automation to connect real-time operational processes with business-oriented IT systems ensuring seamless data flow and coordination. This integration plays a critical role in the Industrial Internet of Things (IIoT). It enables industries to monitor, control, and optimize their operations through real-time data synchronization and improve the Overall Equipment Effectiveness (OEE). By leveraging IIoT middleware and data streaming technologies like Apache Kafka and Flink, businesses can achieve a unified approach to managing both production processes and higher-level business operations to drive greater efficiency, predictive maintenance, and streamlined decision-making.

Industrial Automation – The OT/IT Bridge

An OT/IT Bridge in industrial automation refers to the integration between Operational Technology (OT) systems, which manage real-time industrial processes, and Information Technology (IT) systems, which handle data, business operations, and analytics. This bridge is crucial for modern Industrial IoT (IIoT) environments, as it enables seamless data flow between machines, sensors, and industrial control systems (PLC, SCADA) on the OT side, and business management applications (ERP, MES) on the IT side.

The OT/IT Bridge facilitates real-time data synchronization. It allows industries to monitor and control their operations more efficiently, implement condition monitoring/predictive maintenance, and perform advanced analytics. The OT/IT bridge helps overcome the traditional siloing of OT and IT systems by integrating real-time data from production environments with business decision-making tools. Data Streaming frameworks like Kafka and Flink, often combined with specialized platforms for the last-mile IoT integration, act as intermediaries to ensure data consistency, interoperability, and secure communication across both domains.

This bridge enhances overall productivity and improves the OEE by providing actionable insights that help optimize performance and reduce downtime across industrial processes.

OT/IT Hierarchy – Different Layers based on ISA-95 and the Purdue Model

The OT/IT Levels 0-5 framework is commonly used to describe the different layers in industrial automation and control systems, often following the ISA-95 or Purdue model:

Level 0: Physical Process: This is the most basic level, consisting of the physical machinery, equipment, sensors, actuators, and production processes. It represents the actual processes being monitored or controlled in a factory or industrial environment.
Level 1: Sensing and Actuation: At this level, sensors, actuators, and field devices gather data from the physical processes. This includes things like temperature sensors, pressure gauges, motors, and valves that interact directly with the equipment at Level 0.
Level 2: Control Systems: Level 2 includes real-time control systems such as Programmable Logic Controllers (PLCs) and Distributed Control Systems (DCS). These systems interpret the data from Level 1 and make real-time decisions to control the physical processes.
Level 3: Manufacturing Operations Management (MOM): This level manages and monitors production workflows. It includes systems like Manufacturing Execution Systems (MES), which ensure that production runs smoothly and aligns with the business’s operational goals. It bridges the gap between the physical operations and higher-level business planning.
Level 4: Business Planning and Logistics: This is the IT layer that includes systems for business management, enterprise resource planning (ERP), and supply chain management (SCM). These systems handle business logistics such as order processing, materials procurement, and long-term planning.
Level 5: Enterprise Integration: This level encompasses corporate-wide IT functions such as financial systems, HR, sales, and overall business strategy. It ensures the alignment of all operations with the broader business goals.

In summary, Levels 0-2 focus on the OT (Operational Technology) side—real-time control and monitoring of industrial processes, while Levels 3-5 focus on the IT (Information Technology) side—managing data, logistics, and business operations.

While the modern, cloud-native IIoT world is not strictly hierarchical anymore (e.g. there is also lots of edge computing like sensor analytics), these layers are still often used to separate functions and responsibilities. Industrial IoT data platforms, including the data streaming platform, often connect to several of these layers in a decoupled hub and spoke architecture.

Industrial IoT Middleware

Industrial IoT (IIoT) Middleware is a specialized software infrastructure designed to manage and facilitate the flow of data between connected industrial devices and enterprise systems. It acts as a mediator that connects various industrial assets, such as machines, sensors, and controllers, with IT applications and services such as MES or ERP, often in a cloud or on-premises environment.

This middleware provides a unified interface for managing the complexities of data integration, protocol translation, and device communication to enable seamless interoperability among heterogeneous systems. It often includes features like real-time data processing, event management, scalability to handle large volumes of data, and robust security mechanisms to protect sensitive industrial operations.

In essence, IIoT Middleware is critical for enabling the smart factory concept, where connected devices and systems can communicate effectively, allowing for automated decision-making, predictive maintenance, and optimized production processes in industrial settings.

By providing these services, IIoT Middleware enables industrial organizations to optimize operations, enhance Overall Equipment Effectiveness (OEE), and improve system efficiency through seamless integration and real-time data analytics.

Relevant Industries for IIoT Middleware

Industrial IoT Middleware is essential across various industries that rely on connected equipment, sensors or vehicles and data-driven processes to optimize operations. Some key industries where IIoT Middleware is particularly needed include:

Manufacturing: For smart factories, IIoT Middleware enables real-time monitoring of production lines, predictive maintenance, and automation of manufacturing processes. It supports Industry 4.0 initiatives by integrating machines, robotics, and enterprise systems.
Energy and Utilities: IIoT Middleware is used to manage data from smart grids, power plants, and renewable energy sources. It helps in optimizing energy distribution, monitoring infrastructure health, and improving operational efficiency.
Oil and Gas: In this industry, IIoT Middleware facilitates the remote monitoring of pipelines, drilling rigs, and refineries. It enables predictive maintenance, safety monitoring, and optimization of extraction and refining processes.
Transportation and Logistics: IIoT Middleware is critical for managing fleet operations, tracking shipments, and monitoring transportation infrastructure. It supports real-time data analysis for route optimization, fuel efficiency, and supply chain management.
Healthcare: In healthcare, IIoT Middleware connects medical devices, patient monitoring systems, and healthcare IT systems. It enables real-time monitoring of patient vitals, predictive diagnostics, and efficient management of medical equipment.
Agriculture: IIoT Middleware is used in precision agriculture to connect sensors, drones, and farm equipment. It helps in monitoring soil conditions, weather patterns, and crop health, leading to optimized farming practices and resource management.
Aerospace and Defense: IIoT Middleware supports the monitoring and maintenance of aircraft, drones, and defense systems. It ensures the reliability and safety of critical operations by integrating real-time data from various sources.
Automotive: In the automotive industry, IIoT Middleware connects smart vehicles, assembly lines, and supply chains. It enables connected car services, autonomous driving, and the optimization of manufacturing processes.
Building Management: For smart buildings and infrastructure, IIoT Middleware integrates systems like HVAC, lighting, and security. It enables real-time monitoring and control, energy efficiency, and enhanced occupant comfort.
Pharmaceuticals: In pharmaceuticals, IIoT Middleware helps monitor production processes, maintain regulatory compliance, and ensure the integrity of the supply chain.

These industries benefit from IIoT Middleware by gaining better visibility into their operations. The digitalization of shop floor and business processes improves decision-making and drives efficiency through automation and real-time data analysis.

Industrial IoT Middleware Layers in OT/IT

While modern, cloud-native IoT architectures don’t always use an hierarchical model anymore, Industrial IoT (IIoT) middleware typically operates at Level 3 (Manufacturing Operations Management) and Level 2 (Control Systems) in the OT/IT hierarchy.

At Level 3, IIoT middleware integrates data from control systems, sensors, and other devices, coordinating operations, and connecting these systems to higher-level IT layers such as MES and ERP systems. At Level 2, the middleware handles real-time data exchange between industrial control systems (like PLCs) and IT infrastructure, ensuring data flow and interoperability between the OT and IT layers.

This middleware acts as a bridge between the operational technology (OT) at Levels 0-2 and the business-oriented IT systems at Levels 4-5.

Edge and Cloud Vendors for Industrial IoT

The industrial IoT space provides many solutions from various software vendors. Let’s explore the different options and their trade-offs.

Traditional “Legacy” Solutions

Traditional Industrial IoT (IIoT) solutions are often characterized by proprietary, monolithic architectures that can be inflexible and expensive to implement and maintain. These traditional platforms, offered by established industrial vendors like PTC ThingWorx, Siemens MindSphere, GE Predix, and Osisoft PI, are typically designed to meet specific industry needs but may lack the scalability, flexibility, and cost-efficiency required for modern industrial applications. However, while these solutions are often called “legacy” do a solid job integrating with proprietary PLCs, SCADA systems and data historians. They still operate the shop floor in most factories worldwide.

Emerging Cloud Solutions

In contrast to legacy systems, emerging cloud-based IIoT solutions offer elastic, scalable, and (hopefully) cost-efficient alternatives that are fully managed by cloud service providers. These platforms, such as AWS IoT Core, enable industrial organizations to quickly deploy and scale IoT applications while benefiting from the cloud’s inherent flexibility, reduced operational overhead, and integration with other cloud services.

However, emerging cloud solutions for IIoT can face challenges:

Latency and real-time processing limitations, making them less suitable for time-sensitive industrial applications.
High network transfer cost from the edge to the cloud.
Security and compliance concerns arise when transferring sensitive operational data to the cloud, particularly in regulated industries.
Depending on reliable internet connectivity, which can be a significant drawback in remote or unstable environments.
Very limited connectivity to proprietary (legacy) protocols such as Siemens S7 or Modbus.

The IIoT Enterprise Architecture is a Mix of Vendors and Platforms

Threre is no black and white comparing different solutions. The current IIoT landscape in real world deployments features a mix of traditional industrial vendors and new cloud-native solutions. Companies like Schneider Electric’s EcoStruxure still provide robust industrial platforms, while newer entrants like AWS IoT Core are gaining traction due to their modern, cloud-centric approaches. The shift towards cloud solutions reflects the growing demand for more agile and scalable IIoT infrastructures.

The reality in the industrial space is that:

OT/IT is usually hybrid edge to cloud, not just cloud
Most cloud-only solutions do not provide the right security, SLAs, latency, cost
IoT is a complex space. “Just” a OPC-UA or MQTT connector is not sufficient in most scenarios.

Data Streaming for Industrial IoT in the OT/IT World with Kafka and Flink

Data streaming with Apache Kafka and Flink is a powerful approach that enables the continuous flow and processing of real-time data across various systems. However, to be clear: Data streaming is NOT a silver bullet. It is complementary to other IoT middleware. And some modern, cloud-native industrial software is built on top of data streaming technologies like Kafka and Flink under the hood.

In the context of Industrial IoT, data streaming plays a crucial role by seamlessly integrating and processing data from numerous IoT devices, equipment, PLCs, MES and ERP in real-time. This capability enhances decision-making processes and operational efficiency by providing continuous insights, allowing industries to optimize their operations and respond proactively to changing conditions. The last-mile integration is usually done by complementary IIoT technologies providing sophisticated connectivity to OPC-UA, MQTT and proprietary legacy protocols like S7 or Modbus.

Apache Kafka and Flink in IT Environments

In data center and cloud settings, Kafka and Flink are used to provide continuous processing and data consistency across IT applications including sales and marketing, B2B communication with partners, and eCommerce. Data streaming facilitates data integration, processing and analytics to enhance the efficiency and responsiveness of IT operations and business; no matter if data sources or sinks are real-time, batch or request-response APIs.

Apache Kafka as an OT/IT Bridge

Kafka serves as a critical bridge between Operational Technology (OT) and Information Technology (IT) by enabling real-time data synchronization at scale. This integration ensures data consistency across different systems, supporting seamless communication and coordination between industrial operations and business systems.

Apache Kafka and Flink in OT Edge Applications

At the edge of operational technology, Kafka and Flink provide a robust backbone for use cases such as condition monitoring and predictive maintenance. By processing data locally and in real-time, these technologies improve the Overall Equipment Effectiveness (OEE), and support advanced analytics and decision-making directly within industrial environments.

IoT Success Story: Industrial Edge Intelligence with Helin and Confluent

Helin is a company specializes in providing advanced data solutions focusing on real-time data integration and analytics, particularly in the context of industrial and operational environments. Its industry focus on maritime and energy sector, but this is relevant across all IIoT industries.

Helin presented about its Industrial Edge Intelligence Platform at Confluent’s Data in Motion Tour in Utrecht, Netherlands in. 2024. The IIoT platform includes capabilities for data streaming, processing, and visualization to help organizations leverage their data more effectively for decision-making and operational improvements.

Source: Helin

Helin’s platform bridges the OT and IT worlds by seamlessly integrating industrial edge analytics with multi-tenant cloud solutions:

Source: Helin

The above architecture diagram shows how Helin maps to the OT/IT hierarchy:

OT – 0,1,2,3
- 1: Sensors, Actuators, Field Devices
- 2: Remote I/O
- 3: Controller
DMZ / Gateway – 3.5
BIZ (= IT) – 4,5
- 4 OT Applications (MES, SCADA, etc)
- 5 – outside of Helin – IT Applications (ERP, CRM, DWH, etc)

The strategy and value of Helin’s IoT platform is relevant for most industrial organizations: Making dumb assets smart by extracting data in real-time and utilize AI to transform it into significant business value and actionable insights for the maritime & energy sectors.

Business Value: Fuel Reduction, Increased Revenue, Saving Human Lives

Helin presented three success stories with huge business value:

8% Fuel reduction: Helin’s platform reduced the fuel consumption for Boskalis 8% by delivering real-time insights to vessel operators offshore.
20% Revenue: An increase of revenue for the solar parks of Sunrock with 20% by optimizing their assets by the platform.
Saving human lives: Optimization of drilling operations while increasing the safety of the crew on oil rigs by reducing human errors.

Data Streaming with Kafka and Flink in the Cloud

Why does the Helin IoT Platform use Kafka? Helin brought up a few powerful arguments:

Flexibility towards the integration between the edge and the cloud
Different data streams at different velocity
- Slow cold storage data
- Real time streams for analytics
- Data base endpoint for visualization
Multi-cloud with a standardized streaming protocol
- Reduced code overhead by not having to build adapters
- Open platform so that customers can land their data anywhere
- Failover baked in

Helin’s Data Streaming Journey from Self-Managed Kafka to Serverless Confluent Cloud

Helin started with self-managed Kafka and cumbersome Python scripts…

Source: Helin

… and transitioned to fully managed Kafka in Confluent Cloud:

Source: Helin

As a next step, Helin is migrating from cumbersome and unreliable Python mappings to Apache Flink for scalable and reliable data processing.

Please note that the last-mile IoT connectivity at the edge (SCADA, PLC, etc.) is implemented with technologies like OPC-UA, MQTT or custom integrations. You can see a common best practice: Choose and combine the right tools for the job.

Data Streaming with Kafka and Flink Empowers Edge-to-Cloud Industrial IoT Middleware

Data streaming plays a crucial role in bridging OT and IT in industrial automation. By enabling continuous data flow between the edge and the cloud, Kafka and Flink ensure that both operational data from sensors and machinery, and IT applications like ERP and MES, remain synchronized in real-time. Additionally, data consistency with non-real-time systems like a legacy batch system or a cloud-native data lakehouse are guaranteed out-of-the-box.

The real-time integration powered by Kafka and Flink improves the overall operational efficiency (OEE) and enables specific use cases such as enhanced predictive maintenance, condition monitoring. As industries increasingly adopt edge computing alongside cloud solutions, these data streaming tools provide the scalability, flexibility, and low-latency performance needed to drive Industrial IoT initiatives forward.

Helin’s Industrial Edge Intelligence platform is an excellent example for an IIoT middleware. It leverages Apache Kafka and Flink to integrate real-time data from industrial assets and enabling predictive analytics and operational optimization. By using this platform, companies like Boskalis achieved 8% fuel savings, and Sunrock increased revenue by 20%. These real world scenarios demonstrate the platform’s ability to drive significant business value through real-time insights and decision-making in industrial projects.

How does your OT/IT integration look like today? Do you plan to optimize the infrastructure with data streaming? How does the hybrid architecture look like? What are the use cases? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Industrial IoT Middleware for Edge and Cloud OT/IT Bridge powered by Apache Kafka and Flink appeared first on Kai Waehner.

The Past, Present and Future of Stream Processing

Kai Waehner — Wed, 20 Mar 2024 06:47:53 +0000

Stream processing has existed for decades. However, it really kicks off in the 2020s thanks to the adoption of open source frameworks like Apache Kafka and Flink. Fully managed cloud services make it easy to configure and deploy stream processing in a cloud-native way; even without the need to write any code. This blog post explores the past, present and future of stream processing. The discussion includes various technologies and cloud services, low code/ no code trade-offs, outlooks into the support of machine learning and GenAI, streaming databases, and the integration between data streaming and data lakes with Apache Iceberg.

In December 2023, the research company proved that data streaming is a new software category and not just yet another integration or data platform. Forrester published “The Forrester Wave: Streaming Data Platforms, Q4 2023“. Get free access to the report here. The leaders are Microsoft, Google and Confluent, followed by Oracle, Amazon, Cloudera, and a few others. A great time to review the past, present and future of stream processing as a key component in a data streaming architecture.

The Past of Stream Processing: The Move from Batch to Real-Time

The evolution of stream processing began as industries sought more timely insights from their data. Initially, batch processing was the norm. Data was collected over a period, stored, and processed at intervals. This method, while effective for historical analysis, proved inefficient for real-time decision-making.

In parallel to batch processing, message queues were created to provide real-time communication for transactional data. Message Brokers like IBM MQ or TIBCO EMS were a common way to decouple applications. Applications send data and receive data in an event-driven architecture without worrying about if the recipient was ready, how to handle backpressure, etc. The stream processing journey began.

Stream Processing is a Journey Over Decades…

… and we are still in a very early stage at most enterprises. Here is an excellent timeline of TimePlus about the journey of stream processing open source frameworks, proprietary platforms and SaaS cloud services:

Source: TimePlus

The stream processing journey started decades ago with research and first purpose-built proprietary products for specific use cases like stock trading.

Open source stream processing frameworks emerged during the big data and Hadoop era to make at least the ingestion layer a bit more real-time. Today, most enterprises at least get started understanding the value of stream processing for analytical and transactional use cases across industries. The cloud is a fundamental change as you can start streaming and processing data with a button click leveraging fully managed SaaS and simple UIs (if you don’t want to operate infrastructure or write low-level source code).

TIBCO StreamBase, Software AG Apama, IBM Streams

The advent of message queue technologies like IBM MQ and TIBCO EMS moved many critical applications to real-time message brokers. Real-time messaging enabled the consumption of data in real-time to store it in a database, mainframe, or application for further processing.

However, only true stream processing capabilities included in tools like TIBCO StreamBase, Software AG Apama or IBM (InfoSphere) Streams marked a significant shift towards real-time data processing. These products enabling businesses to react to information as it arrived by processing and correlating the data in motion.

Visual coding in tools like StreamBase or Apama represents an innovative approach to developing stream processing solutions. These tools provide a graphical interface that allows developers and analysts to design, build, and test applications by connecting various components and logic blocks visually, rather than writing code manually. Under the hood, the code generation worked with a Streaming SQL language.

Here is a screenshot of the TIBCO StreamBase IDE for visual drag & drop of streaming pipelines:

TIBCO StreamBase IDE

Some drawbacks of these early stream processing solutions include high cost, vendor lock-in, no flexibility regarding tools or APIs, and missing communities. These platforms are monolithic and were built far before cloud-native elasticity and scalability became a requirement for most RFIs and RFPs when evaluating vendors.

Open Source Event Streaming with Apache Kafka

The actual significant change for stream processing came with introducing Apache Kafka, a distributed streaming platform that allowed for high-throughput, fault-tolerant handling of real-time data feeds. Kafka, alongside other technologies like Apache Flink, revolutionized the landscape by providing the tools necessary to move from batch to real-time stream processing seamlessly.

The adoption of open source technologies changed all industries. Openness, flexibility, and community-driven development enabled easier influence on the features and faster innovation.

Over 100.000 organizations use Apache Kafka. The massive adoption came from a unique combination of capabilities: Messaging, storage, data integration, stream processing, all in one scalable and distributed infrastructure.

Various open source stream processing engines emerged. Kafka Streams was added to the Apache Kafka project. Other examples include Apache Storm, Spark Streaming, and Apache Flink.

The Present of Stream Processing: Architectural Evolution and Mass Adoption

The fundamental change to processing data in motion has enabled the development of data products and data mesh. Decentralizing data ownership and management with domain-driven design and technology-independent microservices promotes a more collaborative and flexible approach to data architecture. Each business unit can choose its own technology, API, cloud service, and communication paradigm like real-time, batch, or request-response.

From Lambda Architecture to Kappa Architecture

Today, stream processing is at the heart of modern data architecture, thanks in part to the emergence of the Kappa architecture. This model simplifies the traditional Lambda Architecture by using a single stream processing system to handle both real-time and historical data analysis, reducing complexity and increasing efficiency.

Lambda architecture with separate real-time and batch layers:

Kappa architecture with a single pipeline for real-time and batch processing:

For more details about the pros and cons of Kappa vs. Lambda, check out my “Kappa Architecture is Mainstream Replacing Lambda“. It explores case studies from Uber, Twitter, Disney and Shopify.

Kafka Streams and Apache Flink Become Mainstream

Apache Kafka has become synonymous with building scalable and fault-tolerant streaming data pipelines. Kafka facilitating true decoupling of domains and applications makes it integral to microservices and data mesh architectures.

Plenty of stream processing frameworks, products, and cloud services emerged in the past years. This includes open source frameworks like Kafka Streams, Apache Storm, Samza, Flume, Apex, Flink, Spark Streaming, and cloud services like Amazon Kinesis, Google Cloud Dataflow, Azure Stream Analytics. The “Data Streaming Landscape 2024” gives an overview of relevant technologies and vendors.

Apache Flink seems to become the de facto standard for many enterprises (and vendors). The adoption is like Kafka four years ago:

Source: Confluent

This does not mean other frameworks and solutions are bad. For instance, Kafka Streams is complementary to Apache Flink, as it suites different use cases.

No matter what technology enterprises choose, the mass adoption of stream processing is in progress right now. This includes modernizing existing batch processes AND building innovative new business models that only work in real time. As a concrete example, think about ride-hailing apps like Uber, Lyft, FREENOW, Grab. They are only possible because events are processed and correlated in real-time. Otherwise, everyone would still prefer a traditional taxi.

Stateless and Stateful Stream Processing

In data streaming, stateless and stateful stream processing are two approaches that define how data is handled and processed over time:

The choice between stateless and stateful processing depends on the specific requirements of the application, including the nature of the data, the complexity of the processing needed, and the performance and scalability requirements.

Stateless Stream Processing

Stateless Stream Processing refers to the handling of each data point or event independently from others. In this model, the processing of an event does not depend on the outcomes of previous events or require keeping track of the state between events. Each event is processed based on the information it contains, without the need for historical context or future data points. This approach is simpler and can be highly efficient for tasks that don’t require knowledge beyond the current event being processed.

The implementation could be a stream processor (like Kafka Streams or Flink), functionality in a connector (like Kafka Connect Single Message Transforms), or a Web Assembly (WASM) embedded into a streaming platform.

Stateful Stream Processing

Stateful Stream Processing involves keeping track of information (state) across multiple events to perform computations that depend on data beyond the current event. This model allows for more complex operations like windowing (aggregating events over a specific time frame), joining streams of data based on keys, and tracking sequences of events or patterns over time. Stateful processing is essential for scenarios where the outcome depends on accumulated knowledge or trends derived from a series of data points, not just on a single input.

The implementation is much more complex and challenging than stateless stream processing. A dedicated stream processing implementation is required. Dedicated distributed engines (like Apache Flink) handle stateful computionations, memory usage and scalability better than Kafka-native tools like Kafka Streams or KSQL (because the latter are bound to Kafka Topics).

Low Code, No Code, AND A Lot of Code!

No-code and low-code tools are software platforms that enable users to develop applications quickly and with minimal coding knowledge. These tools provide graphical user interfaces with drag-and-drop capabilities, allowing users to assemble and configure applications visually rather than writing extensive lines of code.

Common features and benefits of visual coding:

Rapid Development: Both types of platforms significantly reduce development time, enabling faster delivery of applications.
User-Friendly Interface: The graphical interface and drag-and-drop functionality make it easy for users to design, build, and iterate on applications.
Cost Reduction: By enabling quicker development with fewer resources, these platforms can lower the cost of software creation and maintenance.
Accessibility: They make application development accessible to a broader range of people, reducing the dependency on skilled developers for every project.

So far, the theory.

Disadvantages of Visual Coding Tools

Actually, StreamBase, Apama, et al., had great visual coding offerings. However, no-code / low-code tools have many drawbacks and disadvantages, too:

Limited Customization and Flexibility: While these platforms can speed up development for standard applications, they may lack the flexibility needed for highly customized solutions. Developers might find it challenging to implement specific functionalities that aren’t supported out of the box.
Dependency on Vendors: Using no-code/low-code platforms often means relying on third-party vendors for the platform’s stability, updates, and security. This dependency can lead to potential issues if the vendor cannot maintain the platform or goes out of business. And often the platform team is the bottleneck for implementing new business or integration logic.
Performance Concerns: Applications built with no-code/low-code platforms may not be as optimized as those developed with traditional coding, potentially leading to lower performance or inefficiencies, especially for complex applications.
Scalability Issues: As businesses grow, applications might need to scale up to support increased loads. No-code/low-code platforms might not always support this level of scalability or might require significant workarounds, affecting performance and user experience.
Over-reliance on Non-Technical Users: While empowering citizen developers is a key advantage of these platforms, it can also lead to governance challenges. Without proper oversight, non-technical users might create inefficient workflows or data structures, leading to technical debt and maintenance issues.
Cost Over Time: Initially, no-code/low-code platforms can reduce development costs. However, as applications grow and evolve, the ongoing subscription costs or fees for additional features and scalability can become significant.

Flexibility is King: Stream Processing for Everyone!

Microservices, domain-driven design, data mesh… All these modern design approaches taught us to provide flexible enterprise architectures. Each business unit and persona should be able to choose its own technology, API, or SaaS. And no matter if you do real-time, near real-time, batch or request response communication.

Apache Kafka provides the true decoupling out-of-the-box. Therefore, low-code or now-code tools is an option. However, a data scientist, data engineer, software developer or citizen integrator can choose its own technology for stream processing.

The past, present and future of stream processing shows different frameworks, visual coding tools and even applied generative AI. One solution does NOT replace but complement the other alternatives:

The Future of Stream Processing: Serverless SaaS, GenAI and Streaming Databases

Stream processing is set to grow exponentially in the future, thanks to advancements in cloud computing, SaaS, and AI. Let’s explore the future of stream processing and look at the expected short, mid and long-term developments.

SHORT TERM: Fully Managed Serverless SaaS for Stream Processing

The cloud’s scalability and flexibility offer an ideal environment for stream processing applications, reducing the overhead and resources required for on-premise solutions. As SaaS models continue to evolve, stream processing capabilities will become more accessible to a broader range of businesses, democratizing real-time data analytics.

For instance, look at the serverless Flink Actions in Confluent Cloud. You can configure and deploy stream processing for use cases like deduplication or masking without any code:

Source: Confluent

MID TERM: Automated Tooling and the Help of GenAI

Integrating AI and machine learning with stream processing will enable more sophisticated predictive analytics. This opens new frontiers for automated decision-making and intelligent applications while continuously processing incoming event streams. The full potential of embedding AI into stream processing has to be learned and implemented in the upcoming years.

For instance, automated data profiling is one instance of stream processing that GenAI can support significantly. Software tools analyze and understand the quality, structure, and content of a dataset without manual intervention as the events flow through the data pipeline in real-time. This process typically involves examining the data to identify patterns, anomalies, missing values, and inconsistencies. A perfect fit for stream processing!

Automated data profiling in the stream processor can provide insights into data types, frequency distributions, relationships between columns, and other metadata information crucial for data quality assessment, governance, and preparation for further analysis or processing.

MID TERM: Streaming Storage and Analytics with Apache Iceberg

Apache Iceberg is an open-source table format for huge analytic datasets that provides powerful capabilities in managing large-scale data in data lakes. Its integration with streaming data sources like Apache Kafka and analytics platforms, such as Snowflake, Starburst, Dremio, AWS Athena or Databricks, can significantly enhance data management and analytics workflows.

Integration between Streaming Data from Kafka and Analytics on Databricks or Snowflake using Apache Iceberg

Supporting the Apache Iceberg table format might be a crucial strategic move by streaming and analytics frameworks, vendors and cloud services. Here are some key benefits from the enterprise architecture perspective:

Unified Batch and Stream Processing: Iceberg tables can serve as a bridge between streaming data ingestion from Kafka and doxwnstream analytic processing. By treating streaming data as an extension of a batch-based table, Iceberg enables a seamless transition from real time to batch analytics, allowing organizations to analyze data with minimal latency.
Schema Evolution: Iceberg supports schema evolution without breaking downstream systems. This is useful when dealing with streaming data from Kafka, where the schema might evolve. Consumers can continue reading data using the schema they understand, ensuring compatibility and reducing the need for data pipeline modifications.
Time Travel and Snapshot Isolation: Iceberg’s time travel feature allows analytics on data as it looked at any point in time, providing snapshot isolation for consistent reads. This is crucial for reproducible reporting and debugging, especially when dealing with continuously updating streaming data from Kafka.
Cross-Platform Compatibility: Iceberg provides a unified data layer accessible by different compute engines, including those used by Databricks and Snowflake. This enables organizations to maintain a single copy of their data that is queryable across different platforms, facilitating a multi-tool analytics ecosystem without data silos.

LONG TERM: Transactional + Analytics = Streaming Database?

Streaming databases, like RisingWave or Materialize, are designed to handle real-time data processing and analytics. This offers a way to manage and query data that is continuously generated from sources like IoT devices, online transactions, and application logs. Traditional databases that are optimized for static data stored on disk. Instead, streaming databases are built to process and analyze data in motion. They provide insights almost instantaneously as the data flows through the system.

Streaming databases offer the ability to perform complex queries and analytics on streaming data, further empowering organizations to harness real-time insights.

The ongoing innovation in streaming databases will probably lead to more advanced, efficient, and user-friendly solutions, facilitating broader adoption and more creative applications of stream processing technologies.

Having said this, we are still in the very early stage. It is not clear yet when you really need a streaming database instead of a mature and scalable stream processor like Apache Flink. The future will show us and competition is great for innovation.

The Future of Stream Processing is Open Source and Cloud

The journey from batch to real-time processing has transformed how businesses interact with their data. The continued evolution couples technologies like Apache Kafka, Kafka Streams, and Apache Flink with the growth of cloud computing and SaaS. Stream processing will redefine the future of data analytics and decision-making.

As we look ahead, the future possibilities for stream processing are boundless, promising more agile, intelligent, and real-time insights into the ever-increasing streams of data.

If you want to learn more, listen to the following on-demand webinar about the past, present and future of stream processing with me joined by the two streaming industry veterans Richard Tibbets (founder of StreamBase) and Michael Benjamin (TimePlus). I had the please work with them for a few years at TIBCO where we deployed StreamBase at many Financial Services companies for stock trading and similar use cases:

How does your stream processing journey look like? In which decade did you join? Or are you just learning with the latest open-source frameworks or cloud services? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post The Past, Present and Future of Stream Processing appeared first on Kai Waehner.

How Lufthansa uses Apache Kafka for Middleware and Analytics

Kai Waehner — Sun, 24 Sep 2023 16:30:22 +0000

Aviation and travel are notoriously vulnerable to social, economic, and political events, as well as the ever-changing expectations of consumers. The coronavirus was just a piece of the challenge. This post explores how Lufthansa leverages data streaming powered by Apache Kafka as cloud-native middleware for mission-critical data integration projects and as data fabric for AI/machine learning scenarios such as real-time predictions in fleet management. An interactive conversation with Lufthansa as an on-demand video is added at the end as a highlight if you want to learn more.

Data streaming in the aviation industry

The future business of airlines and airports will be digitally integrated into the ecosystem of partners and suppliers. Companies will provide more personalized customer experiences and be enabled by a new suite of the latest technologies, including automation, robotics, and biometrics.

The entire aviation industry leverages data streaming powered by Apache Kafka already. This includes airlines, airports, global distribution systems (GDS), aircraft manufacturers, travel agencies, etc. Why? Because real-time data beats slow data across almost all use cases.

Learn more in my blog about “Apache Kafka in the Airline, Aviation and Travel Industry” covering companies like Singapore Airlines, Air France, and Amadeus.

This article focuses on data streaming in critical Lufthansa projects. Lufthansa is a major German airline and one of the largest in Europe. It is known for its extensive network of domestic and international flights. Lufthansa offers services ranging from passenger transportation to cargo logistics and is a member of the Star Alliance, one of the world’s largest airline alliances.

Apache Kafka as next-generation middleware replacing ETL, ESB, and iPaaS

Typically, an enterprise service bus (ESB) or other integration solutions like extract-transform-load (ETL) tools have been used trying to decouple systems. However, the sheer number of connectors, as well as the requirement that applications publish and subscribe to the data at the same time, mean that systems are always intertwined. As a result, development projects depend on other systems, and nothing can be truly decoupled.

Many enterprises leverage the ecosystem of Apache Kafka for successful integration of different legacy and modern applications. Data streaming differs but also complements existing integration solutions like ESB or ETL tools. Apache Kafka is unique because it combines the following characteristics into a single middleware platform:

Real-time messaging at any scale
Event store for true decoupling, backpressure handling, and replayability of historical events
Data integration eliminating the need for additional integration tools
Stream processing for stateless and stateful data correlation of real-time and historical data

“Apache Kafka vs. Enterprise Service Bus (ESB) – Friends, Enemies or Frenemies?” explores how data streaming with Kafka complements legacy middleware. If your workloads run mostly in the public cloud, you need to understand the difference between Integration Platform as a Service (iPaaS) and data streaming powered by fully-managed Kafka infrastructure.

Lufthansa uses Apache Kafka as cloud-native middleware for mission-critical integrations

Lufthansa leverages data streaming with Confluent as cloud-native middleware for its strategic integration project KUSCO (Kafka Unified Streaming Cloud Operations).

The team discussed the benefits of using Apache Kafka instead of traditional messaging queues (TIBCO EMS, IBM MQ) for data processing. My two favorite statements:

“Scaling Kafka is really inexpensive”
“Kafka adopted and integrated within 3 months”

Lufthansa’s Kafka architecture does not have any surprises. A key lesson learned from many companies: The real added value is created when you leverage Kafka not just for messaging, but its entire ecosystem, including different clients/proxies, connectors, stream processing, and data governance.

The result at Lufthansa: A better, cheaper, and faster infrastructure for real-time data processing at scale.

Watch the full talk from Marcos Carballeira Rodríguez from Lufthansa Group recorded at the Confluent Streaming Days 2020 to see all the architectures and quotes from Lufthansa. More and more projects are onboarded on the KUSCO platform. Here are a few statistics on the adoption from 2022 to 2023 of the KUSCO project that System Architect Krzysztof Torunski of Lufthansa Group presented:

I see this typical pattern in customers across industries: The first use case is the hardest to get live. Afterward, new business units tap into the data feeds and build their projects. It has never been easier to access data feeds in real-time and with good data quality at any scale. Just build a downstream application (with your favorite programming language, tool, or SaaS) and start innovating.

Apache Kafka for analytics and AI/machine learning

Apache Kafka serves thousands of enterprises as the mission-critical and scalable real-time data fabric for machine learning infrastructures. The evolution of Generative AI (GenAI) with large language models (LLM) like ChatGPT changed how people think about intelligent software and automation. In various blog posts, I explored the relationship between data streaming with the Kafka ecosystem and AI/machine learning.

My latest article shows the enormous opportunities and some early adopters combining Kafka and GenAI beyond the buzz.

Lufthansa uses Apache Kafka with AI/machine learning for real-time predictions

Lufthansa leverages the KUSCO platform to build new analytics use cases with real-time data for critical workloads. In the webinar, we learned about the following two projects from Lufthansa Groups’s Domain Architect Sebastian Weber: anomaly detection for alerts and fleet management for aircraft operations.

Anomaly detection with Apache Kafka and ksqlDB

Data is fed into the streaming platform from various data sources. Lufthansa consolidates and aggregates the data with stream processing before the analytics applications do real-time alerting.

Machine learning and Apache Kafka for real-time fleet management

Lufthansa leverages the streaming platform as data fabric for data ingestion, data processing, and model scoring.

Embedding analytic models into a Kafka application is a standard best practice. While the data lake or lakehouse (that receives data via Kafka) trains the model in batch, many use cases require real-time model scoring and predictions at scale with critical SLAs and low latency. That’s exactly the sweet spot of the Kafka ecosystem.

You can either directly embed a model into the Kafka app or leverage a model server that supporting streaming interfaces. I blogged about the trade-offs and use cases: “Streaming Machine Learning with Kafka-native Model Deployment“.

Interactive conversation with Lufthansa

Here is an on-demand video of my conversation with Lufthansa. We talk about use cases for data streaming in the aviation industry and how Lufthansa leverages Apache Kafka as cloud-native middleware and as the data fabric for analytics and machine learning:

Data streaming as cloud-native middleware and for mission-critical analytics

Lufthansa showed us how you can innovate in the airline industry with a fast time-to-market while still integrating with traditional technologies. The two projects show very different challenges and use cases solved with data streaming powered by the Apache Kafka ecosystem.

The aviation industry is changing rapidly. A good customer experience, valuable loyalty platforms, and competitive pricing (or better hard and soft products) require digitalization of the end-to-end supply chain. This includes topics like Industrial IoT (e.g., predictive maintenance), B2B communication with partners (like GDS, airports, and retailers), and customer 360 (including great mobile apps and omnichannel experiences).

How do you leverage data streaming with Apache Kafka in your projects and enterprise architecture? Let’s connect on LinkedIn and discuss it! Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

The post How Lufthansa uses Apache Kafka for Middleware and Analytics appeared first on Kai Waehner.

Apache Kafka as Data Hub for Crypto, DeFi, NFT, Metaverse – Beyond the Buzz

Kai Waehner — Fri, 04 Feb 2022 14:29:19 +0000

Decentralized finance with crypto and NFTs is a huge topic these days. It becomes a powerful combination with the coming metaverse platforms from social networks, cloud providers, gaming vendors, sports leagues, and fashion retailers. This blog post explores the relationship between crypto technologies like Ethereum, blockchain, NFTs, and modern enterprise architecture. I discuss how event streaming and Apache Kafka help build innovation and scalable real-time applications of a future metaverse. Let’s skip the buzz (and NFT bubble) and instead review practical examples and existing real-world deployments in the crypto and blockchain world powered by Kafka and its ecosystem.

What are crypto, NFT, DeFi, blockchain, smart contracts, metaverse?

I assume that most readers of this blog post have a basic understanding of the crypto market and event streaming with Apache Kafka. The target audience should be interested in the relationship between crypto technologies and a modern enterprise architecture powered by event streaming. Nevertheless, let’s explain each buzzword in a few words to have the same understanding:

Blockchain: Foundation for cryptocurrencies and decentralized applications (dApp) powered by digital distributed ledgers with immutable records
Smart contracts: dApps running on a blockchain like Ethereum
DeFi (Decentralized Finance): Group of dApps to provide financial services without intermediaries
Cryptocurrency (or crypto): Digital currency that works as an exchange through a computer network that is not reliant on any central authority, such as a government or bank
Crypto coin: Native coin of a blockchain to trade currency or store value (e.g., Bitcoin in the Bitcoin blockchain or Ether for the Ethereum platform)
Crypto token: Similar to a coin, but uses another coin’s blockchain to provide digital assets – the functionality depends on the project (e.g., developers built plenty of tokens for the Ethereum smart contract platform)
NFT (Non-fungible Token): Non-interchangeable, uniquely identifiable unit of data stored on a blockchain (that’s different from Bitcoin where you can “replace” one Bitcoin with another one), covering use cases such as identity, arts, gaming, collectibles, sports, media, etc.
Metaverse: A network of 3D virtual worlds focused on social connections. This is not just about Meta (former Facebook). Many platforms and gaming vendors are creating their metaverses these days. Hopefully, open protocols enable interoperability between different metaverses, platforms, APIs, and AR/VR technologies. Crypto and NFTs will probably be critical factors in the metaverse.

Cryptocurrency and DeFi marketplaces and brokers are used for trading between fiat and crypto or between two cryptocurrencies. Other use cases include long-term investments and staking. The latter compensates for locking your assets in a Proof-of-Stake consensus network that most modern blockchains use instead of the resource-hungry Proof-of-Work used in Bitcoin. Some solutions focus on providing services on top of crypto (monitoring, analytics, etc.).

Don’t worry if you don’t know what all these terms mean. A scalable real-time data hub is required to integrate crypto and non-crypto technologies to build innovative solutions for crypto and metaverse.

What’s my relation to crypto, blockchain, and Kafka?

It might help to share my background with blockchain and cryptocurrencies before I explore the actual topic about its relation to event streaming and Apache Kafka.

I worked with blockchain technologies 5+ years ago at TIBCO. I implemented and deployed smart contracts with Solidity (the smart contract programming language for Ethereum). I integrated blockchains such as Hyperledger and Ethereum with ESB middleware. And yes, I bought some Bitcoin for ~500 dollars. Unfortunately, I sold them afterward for ~1000 dollars as I was curious about the technology, not the investment part.

Middleware in real-time is vital for integration

I gave talks at international conferences and published a few articles about blockchain. For instance, “Blockchain – The Next Big Thing for Middleware” at InfoQ in 2016. The article is still pretty accurate from a conceptual perspective. The technologies, solutions, and vendors developed, though.

I thought about joining a blockchain startup. Coincidentally, one company I was talking about was building a “next-generation blockchain platform for transactions in real-time at scale”, powered by Apache Kafka. No joke.

However, I joined Confluent in 2017. I thought that processing data in motion at any scale for transactional and analytics workloads is the more significant paradigm shift. “Why I Move (Back) to Open Source for Messaging, Integration and Stream Processing” describes my decision in 2017. I think I was right. Today, most enterprises leverage Kafka as an alternative middleware for MQ, ETL, and ESB tools or implement cloud-native iPaaS with serverless Kafka.

Blockchain, crypto, and NFT are here to stay (for some use cases)

Blockchain and crypto are here to stay, but it is not needed for every problem. Blockchain is a niche. For that, it is excellent. TL;DR: You only need a blockchain in untrusted environments. The famous example of supply chain management is valid. Cryptocurrencies and smart contracts are also here to stay. Partly for investment, partly for building innovative new applications.

Today, I work with customers across the globe. Crypto marketplaces, blockchain monitoring infrastructure, and custodian banking platforms are built on Kafka for good reasons (scale, reliability, real-time). The key to success for most customers is integrating crypto and blockchain platforms and the rest of the IT infrastructure, like business applications, databases, and data lakes.

Trust is important, and trustworthy marketplaces (= banks?) are needed (for some use cases)

Privately, I own several cryptocurrencies. I am not a day-trader. My strategy is a long-term investment (but only a fraction of my total investment; only the money I am okay to lose 100%). I invest in several coins and platforms, including Bitcoin, Ethereum, Solana, Polkadot, Chainlink, and a few even more risky ones. I firmly believe that crypto and NFTs are a game-changer for some use cases like gaming and metaverse, but I also think paying hundreds of thousands of dollars for a digital ape is insane (and just an investment bubble).

I sold my Bitcoins in 2016 because of the missing trustworthy marketplaces. I don’t care too much about the decentralization of my long-term investment. I do not want to hold my cold storage, write a long and complex code on paper, and put it into my safe. I want to have a secure, trustworthy custodian that takes over this burden.

For that reason, I use compliant German banks for crypto investments. If a coin is unavailable, I go to an international marketplace that feels trustworthy. For instance, the exchange of crypto.com and the NFT marketplace OpenSea recently did a great job letting their insurance pay for a hack and loss of customer coins and NFTs, respectively. That’s what I expect as a customer and why I am happy to pay a small fee for buying and selling cryptocurrencies or NFTs on such a platform.

“The False Promise of Web3” is a great read to understand why many crypto and blockchain discussions are not indeed about decentralization. As the article says, “the advertised decentralization of power out of the hands of a few has, in fact, been a re-centralization of power into the hands of fewer“. I am a firm believer in the metaverse, crypto, NFT, DeFi, and blockchain. However, I am fine if some use cases are centralized and provide regulation, compliance, security, and other guarantees.

With this long background story, let’s explore the current crypto and blockchain world and how this relates to event streaming and Apache Kafka.

Use cases for the metaverse and non-fungible tokens (NFT)

Let’s start with some history:

1995: Amazon was just an online shop for books. I bought my books in my local store.
2005: Netflix was just a DVD-in-mail service. Technology changed over time, but I also rent HD DVDs, Blu-Rays, and other mediums similarly.
2022: I will not use Zuckerberg’s Metaverse. Never. It will fail like Second Life (which was launched in 2003). And I don’t get the noise around NFT. It is just a jpeg file that you can right-click and save for free.

Well, we are still in the early stages of cryptocurrencies, and in the very early stages of Metaverse, DeFi, and NFT use cases and business models. However, this is not just hype like the Dot-com bubble in the early 2000s. Tech companies have exciting business models. And software is eating every industry today. Profit margins are enormous, too.

Let’s explore a few use cases where Metaverse, DeFi, and NFTs make sense:

Billions of players and massive revenues in the gaming industry

The gaming industry is already bigger than all other media categories combined, and this is still just the beginning of a new era. Millions of new players join the gaming community every month across the globe.

Connectivity and cheap smartphones are sold in less wealthy countries. New business models like “play to earn” change how the next generation of gamers plays a game. More scalable and low latency technologies like 5G enable new use cases. Blockchain and NFT (Non-Fungible Token) are changing the monetization and collection market forever.

NFTs for identity, collections, ticketing, swaggering, and tangible things

Let’s forget that Justin Bieber recently purchased a Bored Ape Yacht Club (BAYC) NFT for $1.29 million. That’s insane and likely a bubble. However, many use case makes a lot of sense for NFT, not just in (future) virtual metaverse but also in the real world. Let’s look at a few examples:

Sports and other professional events: Ticketing for controlled pricing, avoiding fraud, etc.
Luxury goods: Transparency, traceability, and security against tampering; the handbag, the watch, the necklace: for such luxury goods, NFTs can be used as certificates of authenticity and ownership.
Arts: NFTs can be created in such a way that royalty and license fees are donated every time they are resold
Carmakers: The first manufacturers link new cars with NFTs. Initially for marketing, but also to increase their resale value in the medium term. With the help of the blockchain, I can prove the number of previous owners of a car and provide reliable information about the repair and accident history.
Tourism: Proof of attendance. Climbed Mount Everest? Walked at the Grand Canyon? Visited Hawaii for marriage? With a badge on the blockchain, this can be proven once and for all.
Non-Profit: The technology is ideal for fundraising in charities – not only because it is transparent and decentralized. Every NFT can be donated or auctioned off for a good cause so that this new way of creating value generates new funds for the benefit of charitable projects. Vodafone has demonstrated this by selling the world’s first SMS.
Swaggering: Twitter already allows you to connect your account to your crypto wallet to set up an NFT as your profile picture. Steph Curry from the NBA team, Golden State Warriors, presents his 55 ETH (~ USD 180,000) Bored Ape Yacht Club NFT as his Twitter profile picture when writing this blog.

With this in mind, I can think about plenty of other significant use cases for NFTs.

Metaverse for new customer experiences

If I think about the global metaverse (not just the Zuckerberg one), I see plenty so many use cases even I could imagine using:

How many (real) dollars would you pay if your (virtual) house is next to your favorite actor or musician in a virtual world if you can speak with its digital twin (trained by the actual human) every day?
How many (real) dollars would you pay if this actor visits his house once a week so that virtual neighbors or lottery winners can talk to the natural person via virtual reality?
How many (real) dollars would you pay if you could bring your Fortnite shotgun (from a game from another gaming vendor) to this meeting and use it in the clay pigeon shooting competition with the neighbor?
Give me a day, and I will add 1000 other items to this wish list…

I think you get the point. NFTs and metaverse make sense for many use cases. This statement is valid from the perspective of a great customer experience and to build innovative business models (with ridiculous profit margins).

So, finally, we come to the point of talking about the relation to event streaming and Apache Kafka.

Kafka inside a crypto platform

First, let’s understand how to qualify if you need a truly distributed, decentralized ledger or blockchain. Kafka is sufficient most times.

Kafka is NOT a blockchain, but a distributed ledger for crypto

Kafka is not a blockchain, but a distributed commit log. Many concepts and foundations of Kafka are very similar to a blockchain. It provides many characteristics required for real-world “enterprise blockchain” projects:

Real-Time
High Throughput
Decentralized database
Distributed log of records
Immutable log
Replication
High availability
Decoupling of applications/clients
Role-based access control to data

I explored this in more detail in my post “Apache Kafka and Blockchain – Comparison and a Kafka-native Implementation“.

Do you need a blockchain? Or just Kafka and crypto integration?

A blockchain increases the complexity significantly compared to traditional IT projects. Do you need a blockchain or distributed ledger (DLT) at all? Qualify out early and choose the right tool for the job!

Use a Kafka for

Enterprise infrastructure
Real-time data hub for transactional and analytical workloads
Open, scalable, real-time data integration and processing
True decoupling between applications and databases with backpressure handling and data replayability
Flexible architectures for many use cases
Encrypted payloads

Use a real blockchain / DLTs like Hyperledger, Ethereum, Cardano, Solana, et al. for

Deployment over various independent organizations (where participants verify the distributed ledger contents themselves)
Specific use cases
Server-side managed and controlled by multiple organizations
Scenarios where the business value overturns the added complexity and project risk

Use Kafka and Blockchain together to combine the benefits of both for

Blockchain for secure communication over various independent organizations
Reliable data processing at scale in real-time with Kafka as a side chain or off-chain from a blockchain
Integration between blockchain / DLT technologies and the rest of the enterprise, including CRM, big data analytics, and any other custom business applications

The last section shows that Kafka and blockchain, respectively crypto are complementary. For instance, many enterprises use Kafka as the data hub between crypto APIs and enterprise software.

It is pretty straightforward to build a metaverse without a blockchain (if you don’t want or need to offer true decentralization). Look at this augmented reality demo powered by Apache Kafka to understand how the metaverse is built with modern technologies.

Kafka as a component of blockchain vs. Kafka as middleware for blockchain integration

Some powerful DLTs or blockchains are built on top of Kafka. See the example of R3’s Corda in the next section.

Kafka is used to implementing a side chain or off-chain platform in some other use cases, as the original blockchain does not scale well enough (blockchain is known as on-chain data). Not just Bitcoin has the problem of only processing single-digit (!) transactions per second. Most modern blockchain solutions cannot scale even close to the workloads Kafka processes in real-time.

Having said this, more interestingly, I see more and more companies using Kafka within their crypto trading platforms, market exchanges, and token trading marketplaces to integrate between the crypto and the traditional IT world.

Here are both options:

R3 Corda – A distributed ledger for banking and the finance industry powered by Kafka

R3’s Corda is a scalable, permissioned peer-to-peer (P2P) distributed ledger technology (DLT) platform. It enables the building of applications that foster and deliver digital trust between parties in regulated markets.

Corda is designed for the banking and financial industry. The primary focus is on financial services transactions. The architectural designs are simple when compared to true blockchains. Evaluate requirements such as time to market, flexibility, and use case (in)dependence to decide if Corda is sufficient or not.

Corda’s architectural history looks like many enterprise architectures: A messaging system (in this case, RabbitMQ) was introduced years ago to provide a real-time infrastructure. Unfortunately, the messaging solution does not scale as needed. It does not provide all essential features like data integration, data processing, or storage for true decoupling, backpressure handling, or replayability of events.

Therefore, Corda 5 replaces RabbitMQ and migrates to Kafka.

Here are a few reasons for the need to migrate R3’s Corda from RabbitMQ to Kafka:

High availability for critical services
A cost-effective way to scale (horizontally) to deal with bursty and high-volume throughputs
Fully redundant, worker-based architecture
True decoupling and backpressure handling to facilitate communication between the node services, including the process engine, database integration, crypto integration, RPC service (HTTP), monitoring, and others
Compacted topics (logs) as the mechanism to store and retrieve the most recent states

Kafka in the crypto enterprise architecture

While using Kafka within a DLT or blockchain, the more prevalent use cases leverage Kafka as the scalable real-time data hub between cryptocurrencies or blockchains and enterprise applications. Let’s explore a few use cases and real-world examples for that.

Example crypto architecture: Kafka as data hub in the metaverse

My recent post about live commerce powered by event streaming and Kafka transforming the retail metaverse shows how the retail and gaming industry connects virtual and physical things. The retail business process and customer communication happen in real-time, no matter if you want to sell clothes, a smartphone, or a blockchain-based NFT token for your collectible or video game.

The following architecture shows what an NFT sales play could look like by interesting and orchestration the information flow between various crypt and non-crpyto applications in real-time at any scale:

Kafka’s role as the data hub in crypto trading, marketplaces, and the metaverse

Let’s now explore the combination of Kafka and blockchains, respectively cryptocurrencies and decentralized finance (DeFi).

Once again, Kafka is not the blockchain nor the cryptocurrency. The blockchain is a cryptocurrency like Bitcoin or a platform providing smart contracts like Ethereum, where people build new distributed applications (dApps) like NFTs for the gaming or art industry. Kafka is the data hub in between to connect these blockchains with other Oracles (= the non-blockchain apps = traditional IT infrastructure) like the CRM system, data lake, data warehouse, business applications, and so on.

Let’s look at an example and explore a few technical use cases where Kafka helps:

A Bitcoin transaction is executed from the mobile wallet. A real-time application monitors the data off-chain, correlates it, shows it in a dashboard, and sends push notifications. Another completely independent department replays historical events from the Kafka log in a batch process for a compliance check with dedicated analytics tools.

The Kafka ecosystem provides so many capabilities to use the data from blockchains and the crypto world with other data from traditional IT.

Holistic view in a data mesh across typical enterprise IT and blockchains

Measuring the health of blockchain infrastructure, cryptocurrencies, and dApps to avoid downtime, secure the infrastructure, and make the blockchain data accessible.
Kafka provides an agentless and scalable way to present that data to the parties involved and ensure that the relevant information is exposed to the right teams before a node is lost. This is relevant for innovative Web3 IoT projects like Helium or simpler closed distributed ledgers (DLT) like R3 Corda.
Stream processing via Kafka Streams or ksqlDB to interpret the data to get meaningful information
Processors that focus on helpful block metrics – with information related to average gas price (gas refers to the cost necessary to perform a transaction on the network), number of successful or failed transactions, and transaction fees
Monitoring blockchain infrastructure and telemetry log events in real-time
Regulatory monitoring and compliance
Real-time cybersecurity (fraud, situational awareness, threat intelligence)

Continuous data integration at any scale in real-time

Integration + true decoupling + backpressure handling + replayability of events
Kafka as Oracle integration point (e.g. Chainlink -> Kafka -> Rest of the IT infrastructure)
Kafka Connect connectors that incorporate blockchain client APIs like OpenEthereum (leveraging the same concept/architecture for all clients and blockchain protocols)
Backpressure handling via throttling and nonce management over the Kafka backbone to stream transactions into the chain
Processing multiple chains at the same time (e.g., monitoring and correlating transactions on Ethereum, Solana, and Cardano blockchains in parallel)

Continuous stateless or stateful data processing

Stream processing with Kafka Streams or ksqlDB enables real-time data processing in DeFi / trading / NFT / marketplaces
Most blockchain and crypto use cases require more than just data ingestion into a database or data lake – continuous stream processing adds enormous value to many problems
Aggregate chain data (like Bitcoin or Ethereum), for instance, smart contract states or price feeds like the price of cryptos against USD
Specialized ‘processors’ that take advantage of Kafka Streams’ utilities to perform aggregations, reductions, filtering, and other practical stateless or stateful operations

Building new business models and solutions on top of crypto and blockchain infrastructure

Custody for crypto investments in a fully integrated, end-to-end solution
Deployment and management of smart contracts via a blockchain API and user interface
Customer 360 and loyalty platforms, for instance, NFT integration into retail, gaming, social media for new customer experiences by sending a context-specific AirDrop to a wallet of the customer
These are just a few examples – the list goes on and on…

The following section shows a few real-world examples. Some are relatively simple monitoring tools. Others are complex and powerful banking platforms.

Real-world examples of Kafka in the crypto and DeFi world

I have already explored how some blockchain and crypto solutions (like R3’s Corda) use event streaming with Kafka under the hood of their platform. Contrary, the following focuses on several public real-world solutions that leverage Kafka as the data hub between blockchains / crypto / NFT markets and new business applications:

TokenAnalyst: Visualization of crypto markets
EthVM: Blockchain explorer and analytics engine
Kaleido: REST API Gateway for blockchain and smart contracts
Nash: Cloud-native trading platform for cryptocurrencies
Swisscom’s Custodigit: Crypto banking platform
Chainlink: Oracle network for connecting smart contracts from blockchains to the real world

TokenAnalyst – Visualization of crypto markets

TokenAnalyst is an analytics tool to visualize and analyze the crypto market. TokenAnalyst is an excellent example that leverages the Kafka stack (Connect, Streams, ksqlDB, Schema Registry) to integrate blockchain data from Bitcoin and Ethereum with their analytics tools.

Kafka Connect helps with integrating databases and data lakes. The integration with Ethereum and other cryptocurrencies is implemented via a combination of the official crypto APIs and the Kafka producer client API.

Kafka Streams provides a stateful streaming application to prevent invalid blocks in downstream aggregate calculations. For example, TokenAnalyst developed a block confirmer component that resolves reorganization scenarios by temporarily keeping blocks and only propagates them when a threshold of some confirmations (children to that block are mined) is reached.

EthVM – A blockchain explorer and analytics engine

The beauty of public, decentralized blockchains like Bitcoin and Ethereum is transparency. The tamper-proof log enables Blockchain explorers to monitor and analyze all transactions.

EthVM is an open-source Ethereum blockchain data processing and analytics engine powered by Apache Kafka. The tool enables blockchain auditing and decision-making. EthVM verifies the execution of transactions and smart contracts, checks balances, and monitors gas prices. The infrastructure is built with Kafka Connect, Kafka Streams, and Schema Registry. A client-side visual block explorer is included, too.

Kaleido – A Kafka-native Gateway for crypto and smart contracts

Kaleido provides enterprise-grade blockchain APIs to deploy and manage smart contracts, send Ethereum transactions, and query blockchain data. It hides the blockchain complexities of Ethereum transaction submission, thick Web3 client libraries, nonce management, RLP encoding, transaction signing, and smart contract management.

Kaleido offers REST APIs for on-chain logic and data. It is backed by a fully-managed high throughput Apache Kafka infrastructure.

One exciting aspect in the above architecture: Kaleido also provides a native direct Kafka connection from the client-side besides the API (= HTTP) gateway. This is a clear trend I discussed before already. Check out:

Nash – Cloud-native trading platform for cryptocurrencies

Nash is an excellent example of a modern trading platform for cryptocurrencies using blockchain under the hood. The heart of Nash’s platform leverages Apache Kafka. The following quote from their community page says:

“Nash is using Confluent Cloud, google cloud platform to deliver and manage its services. Kubernetes and apache Kafka technologies will help it scale faster, maintain top-notch records, give real-time services which are even hard to imagine today.”

Nash provides the speed and convenience of traditional exchanges and the security of non-custodial approaches. Customers can invest in, make payments, and trade Bitcoin, Ethereum, NEO, and other digital assets. The exchange is the first of its kind, offering non-custodial cross-chain trading with the full power of a real order book. The distributed, immutable commit log of Kafka enables deterministic replayability in its exact order.

Swisscom’s Custodigit – A crypto banking platform powered by Kafka Streams

Custodigit is a modern banking platform for digital assets and cryptocurrencies. It provides crucial features and guarantees for seriously regulated crypto investments:

Secure storage of wallets
Sending and receiving on the blockchain
Trading via brokers and exchanges
Regulated environment (a key aspect and no surprise as this product is coming from the Swiss – a very regulated market)

Kafka is the central nervous system of Custodigit’s microservice architecture and stateful Kafka Streams applications. Use cases include workflow orchestration with the “distributed saga” design pattern for the choreography between microservices. Kafka Streams was selected because of:

lean, decoupled microservices
metadata management in Kafka
unified data structure across microservices
transaction API (aka exactly-once semantics)
scalability and reliability
real-time processing at scale
a higher-level domain-specific language for stream processing
long-running stateful processes

Architecture diagrams are only available in Germany, unfortunately. But I think you get the points:

Custodigit microservice architecture – some microservices integrate with brokers and stock markets, others with blockchain and crypto:

Custodigit Saga pattern for stateful orchestration – stateless business logic is truly decoupled, while the saga orchestrator keeps the state for choreography between the other services:

Chainlink – Oracle network for connecting smart contracts from blockchains to the real world

Chainlink is the industry standard oracle network for connecting smart contracts to the real world. “With Chainlink, developers can build hybrid smart contracts that combine on-chain code with an extensive collection of secure off-chain services powered by Decentralized Oracle Networks. Managed by a global, decentralized community of hundreds of thousands of people, Chainlink introduces a fairer contract model. Its network currently secures billions of dollars in value for smart contracts across decentralized finance (DeFi), insurance, and gaming ecosystems, among others. The full vision of the Chainlink Network can be found in the Chainlink 2.0 white paper.”

Unfortunately, I could not find any public blog post or conference talks about Chainlink’s architecture. Hence, I can only let Chainlink’s job offering speak about their impressive Kafka usage for real-time observability at scale in a critical, transactional financial environment.

Chainlink is transitioning from traditional time series-based monitoring toward an event-driven architecture and alerting approach.

This job offer sounds very interesting, doesn’t it? And it is a colossal task to solve cybersecurity challenges in this industry. If you look for a blockchain-based Kafka role, this might be for you.

Serverless Kafka enables focusing on the business logic in your crypto data hub infrastructure!

This article explored practical use cases for crypto marketplaces and the coming metaverse. Many enterprise architectures already leverage Apache Kafka and its ecosystem to build a scalable real-time data hub for crypto and non-crypto technologies.

This combination is the foundation for a metaverse ecosystem and innovative new applications, customer experiences, and business models. Don’t fear the metaverse. This discussion is not just about Meta (former Facebook), but about interoperability between many ecosystems to provide fantastic new user experiences (of course, with its drawbacks and risks, too).

A clear trend across all these fancy topics and buzzwords is the usage of serverless cloud offerings. This way, project teams can spend their time on the business logic instead of operating the infrastructure. Check out my articles about “serverless Kafka and its relation to cloud-native data lakes and lake houses” and my “comparison of Kafka offerings on the market” to learn more.

How do you use Apache Kafka with cryptocurrencies, blockchain, or DeFi applications? Do you deploy in the public cloud and leverage a serverless Kafka SaaS offering? What other technologies do you combine with Kafka? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka as Data Hub for Crypto, DeFi, NFT, Metaverse – Beyond the Buzz appeared first on Kai Waehner.

When to use Apache Camel vs. Apache Kafka?

Kai Waehner — Fri, 28 Jan 2022 06:31:02 +0000

Should I use Apache Camel or Apache Kafka for my next integration project? The question is very valid and comes up regularly. This blog post explores both open-source frameworks and explains the difference between application integration and event streaming. The comparison discusses when to use Kafka or Camel, when to combine them, when not to use them at all. A decision tree shows how you can quickly qualify out one for the other.

The history of application integration and event streaming

My personal history and experience in application integration and event streaming are the following. It shows my background and how I see the integration and data streaming markets.

A discussion that started over a decade ago…

With my background of work in the last decade at Talend, TIBCO, and Confluent, the comparison between Camel and Kafka is very exciting as I have spent a lot of time with both open-source frameworks:

Apache Camel powered Talend ESB. Talend had a visual coding tool to design Camel routes with code generation. Unfortunately, the tool’s primary focus was Talend Data Integration (ETL and batch). The Camel-powered ESB code was integrated, but it was neither perfect nor complete.

TIBCO BusinessWorks competed with Talend ESB while TIBCO StreamBase competed with other stream processing solutions. The Kafka ecosystem came up more and more in conversations with customers.

I posted about “When to use Apache Camel” in 2011 already. In 2012, I did my first talk at an international software conference in the US. The name of the conference? CamelOne! A forum only about Apache Camel. What an exciting time. Claus Ibsen, THE Camel guy, wrote an excellent summary of CamelOne 2012 in Boston.

In my conference summary, I talked about my two talks. One of them covered a comparison between Apache Camel, Spring Integration, and Mulesoft ESB. The presentation has over 35000 views, and the number still goes up today.

… from application integration to event streaming

Over time, the buzzword “big data” came up more and more. I spent some time at Talend and TIBCO to learn new programming concepts such as Map-Reduce and Shuffling, mainly powered by Apache Hadoop and Apache Spark. The big data ecosystem snowballed with tens of frameworks such as Hive, HBase, Pig, and many more.

However, the first people realized that real-time data beats slow data in almost all use cases. The Lambda architecture was invented to separate real-time workloads from batch workloads. Event Streaming was born. Apache Kafka became the de facto standard for data streaming. Like CamelOne a decade ago, Kafka Summit is the one-stop-show for Kafka use cases, architectures, and success stories. Contrary to the small CamelOne, Kafka Summit is a global event with events across the globe, plus online events.

Data in motion with the Kappa architecture replacing Lambda

In 2014, a guy called Jay Kreps (few people knew him) was already questioning the Lambda architecture. Instead, he proposed to provide a single real-time layer to provide data for real-time and batch consumers. The Kappa architecture was born. Today, the Kappa architecture is mainstream, replacing Lambda. Various vendors adopt Kafka in the meantime.

Confluent became the clear leader in the event streaming software category. Confluent Platform is powered by Apache Kafka. The focus is on event streaming. That’s different from most other vendors like Cloudera; they focus on 10-20 frameworks or products and try to combine and integrate them somehow. Today, Confluent Cloud is a complete game-changer providing Apache Kafka and its ecosystem for application integration and stream processing as a serverless cloud offering.

This is where we are today in 2022. Application integration (= Camel) and event streaming (= Kafka) play a critical role in every modern enterprise architecture. Open-source is widely adopted and usually preferred compared to proprietary solutions for various reasons, including avoiding vendor lock-in. That’s true for self-managed and serverless cloud offerings.

Hence, the question arises: Should I use Apache Camel for application integration or Apache Kafka for event streaming? Or both? Or does one solve the other, too? These questions will be answered in the following sections, concluding with a decision tree to help you make the right choice for your project.

Let’s look at the similarities between Camel and Kafka, when to use which framework, when and how to combine them, and when not to use them at all.

Features in Apache Camel AND Apache Kafka

Camel and Kafka have many positive and negative characteristics in common. Hence, it is no surprise that people compare the two frameworks:

Open source under Apache 2.0 license
Vibrant community and adoption in the industry
Mature framework with deployments in enterprises across the globe
Fixing point-to-point spaghetti architectures with a central integration backbone
Open architecture and extensibility with custom functions and connectors
Small and big deployments possible, plus single-node deployments for non-mission-critical use cases
Re-engineered and optimized for cloud-native deployments (container, Kubernetes, cloud)
Connectivity to any technology, API, communication paradigm, and SaaS
Transformation of any data types and formats
Processes transactional and analytical workloads
Domain-specific language (DSL) for message at a time processing, with similar logic such as aggregation, filtering, conditional processing
Relative complex frameworks because of their robust feature set, hence not suitable for solving a minor problem
Not a replacement of a database, data warehouse, or data lake

Beyond the similarities, Kafka and Camel have very different sweet spots built to solve distinct problems. Hence, comparing these two tools is a bit comparison of apples and oranges. Some minor projects might use one or the other to solve the problem, but critical enterprise projects show the differences more quickly.

When to use Apache Camel?

The mission of Camel

Apache Camel is an integration framework. It solves a particular problem: Data integration between different applications, APIs, protocols, and communication paradigms. This concept is often called application integration or enterprise integration. Camel implements the famous Enterprise Integration Patterns (EIP). EIPs are based on messaging principles.

Camel’s strengths

Event-based backbone based on well-known and adopted EIP concepts
Connectivity to almost any API
Integration, processing, and routing of information with an intuitive domain-specific language (DSL) with a focus on integration; providing the ability of composability in a programming context for finer grain control in code for doing conditional logic or transformations/reformatting
Powerful routing capabilities with many built-in EIPs
Many deployment options (standalone, web container, application server, Spring, OSGi, Kubernetes via the Camel K sub-project) – okay, I guess some options are not relevant in this decade anymore
Lightweight alternative to proprietary ETL and ESB tools

Camel’s weaknesses

Only a “routing machine”, i.e., not built for long-term storage (additional cache or storage needed), for that reason, Camel is not the right choice for a central nervous system like Kafka
No stream processing (like you know it from Kafka Streams or Apache Flink)
Limited scalability, not built for massive volumes of data
No powerful visual coding like you know it from proprietary ETL/ESB/iPaaS tools
No serverless cloud offering, with that also not competing with other iPaaS offerings
Red Hat is the only vendor supporting it
Built to be deployed in a single data center or cloud region, not across hybrid or multi-cloud scenarios

The evolution of Apache Camel

Camel is widely adopted and has a strong community. Unfortunately, from a vendor and support perspective, the offerings declined in the last few years. One of the most significant pain points: I still don’t see a serverless cloud offering anywhere today:

Camel TL;DR

Camel is an application integration framework to connect different applications and interfaces. Camel is NOT built for processing data in motion continuously, i.e., stream processing. Hence, it should be compared to ETL and ESB tools, not data streaming technologies like Kafka, Kinesis, or Flink. If you look for a serverless cloud offering, you are out of luck. If you look for vendor support, Red Hat is the only option.

When to use Apache Kafka?

The mission of Kafka

Real-time data beats slow data at any scale. The event streaming platform enables processing data in motion. Kafka is the de facto standard for event streaming, including messaging, data integration, stream processing, and storage. Kafka provides all capabilities in one infrastructure at scale. It is reliable and allows to process analytics and transactional workloads.

Kafka’s strengths

Event-based streaming platform
A unique combination of pub/sub messaging, data processing, data integration, and storage in a single framework
Built for massive volumes of data and extreme scale from the beginning, with that a single framework can be used for transactional (low volume) and analytics (high volume) use cases
True decoupling between producers and consumers because of its storage component makes it the de facto standard for microservice architectures
Guaranteed ordering of events in the distributed commit log
Distributed data processing with fault-tolerance and recoverability built-in
Replayability of events
The de facto standard for event streaming
Built with hybrid and multi-cloud data replication in mind (with included tools like MirrorMaker and separate, more advanced, and more straightforward tools like Confluent Cluster Linking)
Support from many vendors, including Confluent, Cloudera, IBM, Red Hat, Amazon, Microsoft, and many more
Paradigm shift: Built to process data in motion end-to-end from source to one or more sinks

Kafka’s weaknesses

Paradigm shift: Enterprises need to learn and understand the added value of event streaming, a new software category that enables new use cases but also requires different design patterns and operations approaches
No powerful visual coding like you know it from proprietary ETL/ESB/iPaaS tools
Limited out-of-the-box routing capabilities (Kafka Connect SMT or Kafka Streams / ksqlDB app do the job very well, but not as simple as Camel)
Complex operations (if you run it by yourself instead of using 3rd party tools or even better a serverless cloud offering)

The evolution of Apache Kafka

Kafka was built at LinkedIn to process high volumes of data, as no other open-source framework could do this. Kafka found quick adoption after LinkedIn open-sourced it. Several vendors adopted Kafka and added it to their product portfolio. Some vendors just added Kafka for the sake of having it. Others innovated and used additional tools to make Kafka cloud-native for the next generation of event streaming. Kafka as a serverless cloud offering is a critical piece of many modern enterprise architectures today:

Kafka TL;DR

Kafka is an event streaming platform to process data in motion continuously. If you “just” need an integration framework to route data from a source to one or more sinks (= ETL / ESB), then Camel can be used, too. However, Kafka kills two birds with one stone (= integrating data AND processing it in motion where needed).

Plenty of Kafka offerings are available on the market. Check out the Apache Kafka landscape and comparison to understand the differences between offerings from Confluent, Cloudera, IBM, Red Hat, Amazon, Microsoft, and others.

Decision tree – Camel or Kafka?

The above sections explored when to use Camel and Kafka. So far, so good. Nevertheless, both frameworks overlap with their capabilities. Let’s get some help to choose the right one in that case.

Qualify out – the easiest way to start an evaluation!

The easiest way to decide on a specific option is to qualify out other frameworks that cannot fulfill the requirements.

Therefore, do you need

Big data processing?
A storage component for true decoupling and replayability of events?
Stateless or stateful stream processing?
A serverless cloud offering?

The above section discussed these differentiators of Kafka. In all these cases, you can qualify out Camel. It does not fulfill these requirements. These requirements are not necessarily a complete list. And you might also find a few aspects to qualify out Kafka from the beginning. Hence, you could also start from the Camel perspective and ask yourself: When should I not use Kafka. But I think it is easier the other way round.

Qualifying out solutions because of their limitations makes the decision tree and evaluation process much easier from the beginning.

Decision Tree for Camel and Kafka

Here is my decision tree to find out if Camel or Kafka is the right choice and what vendors you could evaluate:

When to use Camel and Kafka together?

It is possible to use Camel and Kafka together in a single integration architecture. Should you do that? Two options exist. One makes more sense than the other:

Kafka for event streaming and Camel for ETL

Camel and Kafka integrate well with each other. The native Kafka component of Camel is the best native integration point as a bridge between both environments:

The above architecture shows how Camel and Kafka live next to each other. Camel is used in a business domain for application integration. Kafka is the central nervous system between the Camel integration application and many other applications. I also added Kong as API Gateway to clarify that Camel or Kafka is not a silver bullet to solve every problem.

Once again, the vast advantage of Kafka as central integration layer is its unique combination of characteristics within a single infrastructure, including:

Real-time messaging at any scale
Storage for true decoupling between different applications and communication paradigms
Built-in backpressure handling and replayability of events
Data integration
Stream processing

Real-time data replication across hybrid and multi-cloud is not shown in the above picture but is also part of the enterprise architecture out-of-the-box leveraging take Kafka protocol.

With true decoupling within modern microservice architecture, each business team can decide whether they need application integration (using Camel) or event streaming (using Kafka). Often, both could be used. Additional questions around single vs. multi frameworks and APIs, vendor support, scalability needs, and other characteristics need to be evaluated to make the right choice for your business problem.

Camel connectors embedded into Kafka Connect

There is another way to combine Kafka and Camel: The “Camel Kafka Connector” sub-project of Apache Camel. Don’t get confused. This feature is not the Kafka component (= connector) of Camel! Instead, it is a relatively new initiative to deploy camel components into the Kafka Connect infrastructure.

The obvious benefit: This way, you get hundreds of new connectors “for free” within the Kafka ecosystem. This capability sounds excellent. And it is!

However, consider the total cost of ownership and the overall efforts using this approach. Application integration is one of the most challenging problems in computer science – especially if you talk about transactional data sets that require zero data loss, exactly-once semantics, and no downtime. The more components you combine in the end-to-end data flow, the harder it gets to keep your performance and reliability SLAs.

Hence, using Camel components within Kafka Connect has a considerable disadvantage: Combining two frameworks with complexities and different design concepts. Just a few examples:

Kafka world: Partitions, Offsets, Leader and Follower, Key/Value/Header, connectors (based on Kafka Connect), Bootstrap Server, ConsumerRecord, Retention Time, etc.
Camel world: Routes, RouteBuilder CamelContext, Exchange, Processor, components (Camel connectors), Endpoints, Type Converters, Registry, etc.

Please think twice before mixing two integration tools that are powerful but complex on their own. Getting this running is just one piece of the puzzle (the simple part). Don’t forget end-to-end testing, resiliency, SLAs, support across technologies and APIs. Even buying support for Camel and Kafka from Red Hat (i.e., a single vendor) does not improve this approach.

It is likely better to take the business logic and API calls out of the Camel component and copy it into a Kafka Connect connector template to run the integration natively with only Kafka code. This workaround allows a clean architecture, end-to-end integration with a single framework, a single vendor behind it, and much easier testing / debugging / monitoring.

TL;DR: I recommend only using the “Camel Kafka Connector” sub-project if the following options do not work:

Use only Apache Camel for application integration
Leverage Apache Kafka for event streaming and application integration
Choose separate deployments of Camel and Kafka and use the Camel-Kafka-Bridge

When NOT to use Camel or Kafka at all?

Once again, the easiest way for your evaluation to start is qualifying out tools that do not work to solve the problem.

Both Camel and Kafka are NOT built for the following scenarios:

A proxy for millions of clients (like mobile apps) – but native proxies (like a REST or MQTT Proxy for Kafka) exist for some use cases.
An API Management platform – but these tools are usually complementary and used to create life cycle management or monetize APIs deployed with Camel or Kafka.
A database for complex queries and batch analytics workloads
an IoT platform with features such as device management – but direct native integration with (some) IoT protocols such as MQTT or OPC-UA is possible and the approach for (some) use cases.
A technology for hard real-time applications such as safety-critical or deterministic systems – but that’s true for any other IT framework, too. Embedded systems are a different software than Camel or Kafka!

I wrote a very detailed post about this topic from a Kafka perspective. It maps almost 1:1 to the Camel world, too (and any related technology such as Flink, Spark, Pulsar, etc.): “When NOT to use Apache Kafka?”

Apache Camel vs. Apache Kafka – Who is the winner?

Simple answer: Both!

When you compare apples and oranges, you might become happy when you are hungry as both are good to eat. The same is true for Camel and Kafka. Both can do application integration. But they serve very different needs.

Many integration scenarios can use Camel or Kafka.

Camel is the right tool if you need to integrate data within an application context or business unit (with no need for stream processing, true decoupling, replayability, large scale, replication across data centers or cloud regions).

Kafka is the central event-based nervous system across business units, regions, and hybrid clouds. Kafka is all about event streaming. Application integration is just a piece of this puzzle. On the other side, I have seen plenty of integration projects powered by Apache Kafka. It is often replacing other middleware. That’s true for ETL/ESB legacy modernization and in discussions about using a cloud-native iPaaS.

Do you use Camel or Kafka today? What use cases? How do you decide which one to choose? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post When to use Apache Camel vs. Apache Kafka? appeared first on Kai Waehner.

Streaming Data Exchange with Kafka and a Data Mesh in Motion

Kai Waehner — Sun, 14 Nov 2021 13:25:45 +0000

Data Mesh is a new architecture paradigm that gets a lot of buzzes these days. Every data and platform vendor describes how to build the best Data Mesh with their platform. The Data Mesh story includes cloud providers like AWS, data analytics vendors like Databricks and Snowflake, and Event Streaming solutions like Confluent. This blog post looks into this principle deeper to explore why no single technology is the perfect fit to build a Data Mesh. Examples show why an open and scalable decentralized real-time platform like Apache Kafka is often the heart of the Data Mesh infrastructure, complemented by many other data platforms, to solve business problems.

Data at Rest vs. Data in Motion

Before we get into the Data Mesh discussion, it is crucial to clarify the difference and relevance of Data at Rest and Data in Motion:

Data at Rest: Data is ingested and stored in a storage system (database, data warehouse, data lake). Business logic and queries execute against the storage. Everyday use cases include reporting with business intelligence tools, model training in machine learning, and complex batch analytics like shuffling or map and reduce. As the data is at rest, the processing is too late for real-time use cases.
Data in Motion: Data is processed and correlated continuously while new events are fed into the platform. Business logic and queries execute in real-time. Common real-time use cases include inventory management, order processing, fraud detection, predictive maintenance, and many other use cases.

Real-time Data Beats Slow Data

Real-time beats slow data in almost all use cases across industries. Hence, ask yourself or your business team how they want or need to consume and process the data in the next project. Data at Rest and Data in Motion have trade-offs. Therefore, both concepts are complementary. For this reason, modern cloud infrastructures leverage both in their architecture. Serverless Event Streaming with Kafka combined with the AWS Lakehouse is a great resource to learn more.

However, while connecting a batch system to a real-time nervous system is possible, the other way round – connecting a real-time consumer to batch storage – is not possible. The Kappa vs. Lambda Architecture discussion gives more insights into this.

Kafka is a database. So, it is also possible to use it for data at rest. For instance, the replayability of historical events in guaranteed ordering is essential and helpful for many use cases. However, long-term storage in Kafka has several limitations, like limited query capabilities. Hence, for many use cases, event streaming and other storage systems are complementary, not competitive.

Data Mesh – An Architecture Paradigm

Data mesh is an implementation pattern (not unlike microservices or domain-driven design) but applied to data. Thoughtworks coined the term. You will find tons of resources on the web. Zhamak Dehghani gave a great introduction about “How to build the Data Mesh Foundation and its Relation to Event Streaming” at the Kafka Summit Europe 2021.

Domain-driven + Microservices + Event Streaming

Data Mesh is not an entirely new paradigm. It has several historical influences:

The architectural paradigm unlocks analytical data at scale, rapidly unlocking access to an ever-growing number of distributed domain data sets for a proliferation of consumption scenarios such as machine learning, analytics, or data-intensive applications across the organization. A data mesh addresses the common failure modes of the traditional centralized data lake or data platform architecture.

Data Mesh is a Logical View, not Physical!

Data mesh shifts to a paradigm that draws from modern distributed architecture: considering domains as the first-class concern, applying platform thinking to create a self-serve data infrastructure, treating data as a product, and implementing open standardization to enable an ecosystem of interoperable distributed data products.

Here is an example of a Data Mesh:

TL;DR: Data Mesh combines existing paradigms, including Domain-driven Design, Data Marts, Microservices, and Event Streaming.

Data as the Product

However, the differentiating aspect focuses on product thinking (“Microservice for Data”) with data as a first-class product. Data products are a perfect fit for Event Streaming with Data in Motion to build innovative new real-time use cases.

A Data Mesh with Event Streaming

Why is Event Streaming a good fit for data mesh?

Streams are real-time, so you can propagate data throughout the mesh immediately, as soon as new information is available. Streams are also persisted and replayable, so they let you capture both real-time AND historical data with one infrastructure. And because they are immutable, they make for a great source of record, which is helpful for governance.

Data in Motion is crucial for most innovative use cases. And as discussed before, real-time data beats slow data in almost all scenarios. Hence, it makes sense that the heart of a Data Mesh architecture is an Event Streaming platform. It provides true decoupling, scalable real-time data processing, and highly reliable operations across the edge, data center, and multi-cloud.

Kafka Streaming API – The De Facto Standard for Data in Motion

The Kafka API is the de facto standard for Event Streaming. I won’t explore this discussion again and again. Here are a few references before we move to the “Kafka + Data Mesh” content…

A Kafka-powered Data Mesh

I highly recommend watching Ben Stopford’s and Michael Noll’s talk about “Apache Kafka and the Data Mesh“. Several of the screenshots in this post are from that presentation, too. Kudos to my two colleagues! The talk explores the key concepts of a Data Mesh and how they are related to Event Streaming:

Domain-driven Decentralization
Data as a Self-serve Product
First-class Data Platform
Federated Governance

Let’s now explore how Event Streaming with Kafka fits into the Data Mesh architecture and how other solutions like a database or data lake complement it.

Data product, a “microservice for the data world”:

A node on the data mesh, situated within a domain.
Produces and possibly consumes high-quality data within the mesh.
Encapsulates all the elements required for its function, namely data plus code plus infrastructure.

A Data Mesh is not just one Technology!

The heart of a Data Mesh infrastructure must be real-time, decoupled, reliable, and scalable. Kafka is a modern cloud-native enterprise integration platform (also often called iPaaS today). Therefore, Kafka provides all the capabilities for the foundation of a Data Mesh.

However, not all components can or should be Kafka-based. Choose the right tool for a problem. Let’s explore in the following subsections how Kafka-native technologies and other solutions are used in a Data Mesh together.

Stream Processing within the Data Product with Kafka Streams and ksqlDB

An event-based data product aggregates and correlates information from one or more data sources in real-time. Stateless and stateful stream processing is implemented with Kafka-native tools such as Kafka Streams or ksqlDB:

Variety of Protocols and Communication Paradigms within the Data Product – HTTP, gRPC, MQTT, and more

Obviously, not every application uses just Event Streaming as a technology and communication paradigm. The above picture shows how one consumer application could also be a request/response technology like HTTP or gRPC to do a pull query. In contrast, another application continuously consumes the streaming push query with a native Kafka consumer in any programming language, such as Java, Scala, C, C++, Python, Go, etc.

The data product often includes complementary technologies. For instance, if you built a connected car infrastructure, you likely use MQTT for the last-mile integration, ingest the data into Kafka, and further processing with Event Streaming. The “Kafka + MQTT Blog Series” is an excellent example from the IoT space to learn about building a data product with complementary technologies.

Variety of Solutions within the Data Product – Event Streaming, Data Warehouse, Data Lake, and more

The beauty of microservice architectures is that every application can choose the right technologies. An application might or might not include databases, analytics tools, or other complementary components. The input and output data ports of the data product should be independent of the chosen solutions:

Kafka Connect is the right Kafka-native technology to connect other technologies and communication paradigms with the Event Streaming platform. Evaluate if you need another integration middleware (like an ETL or ESB) or if the Kafka infrastructure is the better enterprise integration platform (iPaaS) for your data product within the data mesh.

A Global Streaming Data Exchange

The Data Mesh concept is relevant for global deployments, not just within a single project or region. Multiple Kafka clusters are the norm, not an exception. I wrote about customers using Event Streaming with Kafka in global architectures a long time ago.

Various architectures exist to deploy Kafka across data centers and multiple clouds. Some use cases require low latency and deploy some Kafka instances at the edge or in a 5G zone. Other use cases replicate data between regions, countries, or continents across the globe for disaster recovery, aggregation, or analytics use cases.

Here is one example spanning a streaming Data Mesh across multiple cloud providers like AWS, Azure, GCP, or Alibaba, and on-premise / edge sites:

This example shows all the characteristics discussed in the above sections for a Data Mesh:

Decentralized real-time infrastructure across domains and infrastructures
True decoupling between domains within and between the clouds
Several communication paradigms, including data streaming, RPC, and batch
Data integration with legacy and cloud-native technologies
Continuous stream processing where it adds value, and batch processing in some analytics sinks

Example: A Streaming Data Exchange across Domains in the Automotive Industry

The following example from the automotive industry shows how independent stakeholders (= domains in different enterprises) use a cross-company streaming data exchange:

Innovation does not stop at the own border. Streaming replication is relevant for all use cases where real-time is better than slow data (valid for most scenarios). A few examples:

End-to-end supply chain optimization from suppliers to the OEM to the intermediary to the aftersales
Track and trace across countries
Integration of 3rd party add-on services to the own digital product
Open APIs for embedding and combining external services to build a new product

I could go on and on with the list. Many data products need to be accessible by 3rd party in real-time at scale. Some API gateway or API management tool comes into play in such a situation.

A real-world example of a streaming data exchange powered by Kafka is the mobility service Here Technologies. They expose the Kafka API to directly consume streaming data from their mapping services (as an alternative option to their HTTP API):

However, even if all collaborating partners use Kafka under the hood in their architecture, exposing the Kafka API directly to the outside world does not always make sense. Some technical capabilities (e.g., access control or connectivity to thousands of devices) and missing business functions (e.g., for monetization or reporting) of the Kafka ecosystem bring an API layer on top of the Event Streaming infrastructure into play in many real-world deployments.

Open API for 3rd Party Integration and Streaming API Management

API Gateways and API Management tools exist in many varieties, including open-source frameworks, commercial products, and SaaS cloud offerings. Features include technical routing, access control, monetization, and reporting.

However, most people still implement the Open API concept with RPC in mind. I guess 95+% still use HTTP(S) to make APIs accessible to other stakeholders (e.g., other business units or external parties). RPC makes little sense in a streaming Data Mesh architecture if the data needs to be processed at scale in real-time.

There is still an impedance mismatch between Event Streaming and API Management. But it gets better these days. Specifications like AsyncAPI, calling itself the “industry standard for defining asynchronous APIs”, and similar approaches bring Open API to the data streaming world. My post “Kafka versus API Management with tools like MuleSoft, Kong, or Apigee” is still pretty much accurate if you want to dive deeper into this discussion. IBM API Connect was one of the first vendors that integrated Kafka via Async API.

A great example of the evolution from RPC to streaming APIs is the machine learning space. “Streaming Machine Learning with Kafka-native Model Deployment” explores how model servers such as Seldon enhance their product with a native Kafka API besides HTTP and gRPC request-response communication:

Journey to the Streaming Data Mesh with Kafka

The paradigm shift is enormous. Data Mesh is not a free lunch. The same was and still is true for microservice architectures, domain-driven design, Event Streaming, and other modern design principles.

In analogy to Confluent’s maturity model for Event Streaming, our team has described the journey for deploying a streaming Data Mesh:

The efforts likely take a few years in most scenarios. The shift is not just about technologies, but, as necessary are adjustments to organizations and business processes. I guess most companies are still in a very early stage. Let me know where you are on this journey!

Streaming Data Exchange as Foundation for a Data Mesh

A Data Mesh is an implementation pattern, not a specific technology. However, most modern enterprise architectures require a decentralized streaming data infrastructure to build valuable and innovative data products in independent, truly decoupled domains. Hence, Kafka, being the de facto standard for Event Streaming, comes into play in many Data Mesh architectures.

Many Data Mesh architectures span across many domains in various regions or even continents. The deployments run at the edge, on-prem, and multi-cloud. The integration connects to many solutions, technologies with different communication paradigms.

A cloud-native Event Streaming infrastructure with the capability to link clusters with each other out-of-the-box enables building a modern Data Mesh. No Data Mesh will use just one technology or vendor. Learn from the inspiring posts from your favorite data products vendors like AWS, Snowflake, Databricks, Confluent, and many more to define and build your custom Data Mesh successfully. Data Mesh is a journey, not a big bang.

Did you already start building your Data Mesh? How does the enterprise architecture look like? What frameworks, products, and cloud services do you use? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Streaming Data Exchange with Kafka and a Data Mesh in Motion appeared first on Kai Waehner.

Apache Kafka in the Public Sector – Part 3: Government and Citizen Services

Kai Waehner — Fri, 15 Oct 2021 06:54:26 +0000

The public sector includes many different areas. Some groups leverage cutting-edge technology, like military leverage. Others like the public administration are years or even decades behind. This blog series explores how the public sector leverages data in motion powered by Apache Kafka to add value for innovative new applications and modernizing legacy IT infrastructures. This post is part 3: Use cases and architectures for Government and Citizen Services.

Blog series: Apache Kafka in the Public Sector and Government

This blog series explores why many governments and public infrastructure sectors leverage event streaming for various use cases. Learn about real-world deployments and different architectures for Kafka in the public sector:

Subscribe to my newsletter to get updates immediately after the publication. Besides, I will also update the above list with direct links to this blog series’s posts once published.

As a side note: If you wonder why healthcare is not on the above list. Healthcare is another blog series on its own. While the government can provide public health care through national healthcare systems, it is part of the private sector in many other cases.

Government and Citizen Services powered by Apache Kafka

We talk a lot about customer 360 and customer experience in the private sector. The public sector is different. Increasing revenue is usually not the primary motivation. For that reason, most countries’ citizen-related processes are terrible experiences.

Nevertheless, the same fact is actual as in the private sector: Real-time data beats slow data! It increases efficiency and makes the citizens happy. I want to share some real-world examples from the US and Europe, where data in motion improved processes and reduced bureaucracy.

Norwegian Work and Welfare Department (NAV) – Personal Life as a Stream of Events

The Norwegian Work and Welfare Department (NAV) supports unemployment benefits, health insurance, social security, pensions, parental benefits. The organization has 23,000 employees and provides USD 50Bn disbursements per year. They assist people through all phases of life within work, family, health, retirement, and social security.

NAV’s had an impressive vision: Imagine a government that knows just enough about you to provide services without you applying for them first:

This vision is a reality today. NAV presented the implementation at the Kafka Summit 2018 already. NAV Implemented the life is a stream of events by leveraging event streaming technology. Citizens have a tremendous real-time experience while the public administration has optimized processes and reduced cost:

NAV’s real-time event streaming infrastructure provides several vast benefits to the enterprise architecture:

The integration infrastructure enables proper decoupling between the applications with domain-driven design (DDD) powered by Apache Kafka. This approach supports diverging forces in the government.
Data as the product enables to announce the arrival of events so the various business domains can trigger processes and update domain-specific data. You might know this concept from the new buzzword “data mesh“.
Digitalization reduced bureaucracy, not by turning paper into digital forms, but by reengineering and optimizing the business processes (by avoiding spaghetti architectures).
Data privacy and accuracy are ensured, including compliance to laws like GDPR via data minimization, purpose limitation, and data portability.

U.S. Department of Veterans Affairs (VA) – Improved Benefit Services

The United States Department of Veterans Affairs (VA) is a department of the federal government charged with integrating life-long healthcare services to eligible military veterans.

Event streaming powered by Apache Kafka improved the government’s veteran benefit services for ratings, awards, and claims through data streaming. The business value is enormous for the veterans and their families and the government:

Assess, route, and verify the service request in real-time.
Improve the experience for veterans and their families.
Reduce management and operations complexity.

Government providing Omnichannel Claim Management leveraging Kafka

Let’s take a look at the claim example. Several ways exist to file a claim: In-Person (VSO), internet, or postal mail into VA. Such a challenge is called omnichannel customer experience in the retail industry. The IT infrastructure needs to handle changes in the status of a claim, requests for additional documentation, context when calling into the call center, due benefits when checking in at a hospital, and so on.

Event Streaming powered by Apache Kafka is a great approach to implement omnichannel requirements due to the unique combination of real-time data processing and storage, not just in retail but also in government services. VA chose Confluent for true decoupling, real-time integration, and omnichannel communication. Components include Kafka Streams, ksqlDB, Oracle CDC, Tiered Storage, etc.

Here is an excellent quote from the Kafka Summit presentation: “Implementing Confluent enables our agile teams to create reusable solutions that unlock Veteran data; provide real-time actionable information all while reducing complexity and load on existing data sources.”

University of California, San Diego – Integration Platform as a Service (iPaaS)

The University of California, San DieUC (UC) is one of the world’s leading public research universities, located in beautiful La Jolla, California. The covid pandemic forced them to do a “once in a life transition”.

UC had to build out their online learning platform due to the pandemic, plus adding the new #1 priority: A comfortable and reliable student testing process. The biggest challenge in this process is the integration between many legacy applications and modern technologies.

Apache Kafka is the de facto standard for modern integration and middleware projects today. You might have seen my content discussing the difference between traditional middleware like MQ, ETL, ESB, iPaaS, and the Apache Kafka ecosystem. For similar reasons, the University of California, San Diego chose Confluent as the cloud-native Integration-platform-as-a-service (iPaaS) middleware layer to set data in motion for 90 million records a day.

iPaaS “Swiss Army Knife” of Integration for the Government with Kafka

A key benefit is the increased time to market with agile development and decoupled applications. Additionally, this opens up new revenue streams for other UC campuses – including the UC Office of President, and tracing student health.

Government Benefit: From Cost Center to Profit Center with Kafka

A modern, scalable, and reliable real-time middleware layer enables new use cases beyond integration. A great example from the UC is their use case to provide the next best action and contextual knowledge in real-time:

Continuous data processing from different sources in motion (aka stream processing or streaming analytics) makes these use cases possible. UC leverages Kafka Streams. Kafka-native tools such as Kafka Streams or ksqlDB enable end-to-end data processing at scale in real-time without the need for yet another big data framework (like Storm, Flink, or Spark Streaming).

Data in Motion for Comfortable Citizen Services and Reduced Government Bureaucracy

Real-time data beats slow data. That’s not just true for the private sector. This post showed several real-world examples of how the government can improve processes, reduce costs and bureaucracy, and improve citizens’ experience.

Apache Kafka and its ecosystem provide the capabilities to implement a modern, scalable, reliable real-time middleware layer. Additionally, stream processing allows building new innovative applications that have not been possible before.

How do you leverage event streaming in the public sector? Are you working on citizen services or other government projects? What technologies and architectures do you use? What projects did you already work on or are in the planning? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka in the Public Sector – Part 3: Government and Citizen Services appeared first on Kai Waehner.

Apache Kafka in the Financial Services Industry

Kai Waehner — Mon, 18 Jan 2021 13:24:41 +0000

The rise of event streaming in financial services is growing like crazy. Continuous real-time data integration and processing are mandatory for many use cases. Many business departments in the financial services sector deploy Apache Kafka for mission-critical transactional workloads and big data analytics. High scalability, high reliability, and an elastic open infrastructure are the key reasons for Kafka’s success. This blog post explores different use cases, architectures, and real-world examples in the FinServ sector.

FinServ Enterprise Reality: Innovate or be disrupted!

There is no way around it: The new business reality is very different from the last decades:

Technology was a support function in the past.
Innovation required for growth.
“Good enough” to run on yesterday’s data.
Technology is the business.
Innovation is required for survival.
Yesterday’s data = failure.
Modern, real-time data infrastructure is required.

Only two options exist for enterprises in the finance sector: Innovate or be disrupted!

The New FinServ Enterprise Reality – Every Company is a Software Company

Please take a look at your favorite traditional bank and how its market cap looks like compared to new FinTech companies such as Robinhood, Stripe, Square, or Revolut.

Some traditional companies re-invented themselves to focus on innovative new products and great customer experience to stay competitive. Software is eating the world, including the finance sector.

Here are a few examples:

Capital One: 10,000 of 40,000 employees are software engineers
Goldman Sachs: 1.5B (billion!) lines of code across 7,000+ applications
JPMorgan Chase & Co: Employs over 50,000 people in technology and has $10B+ technology spend.

Most successful post-modern companies in the finance sector heavily rely on Apache Kafka. This is true for emerging fintechs but also for (some) traditional banks.

Apache Kafka in Financial Services

Various use cases emerged to deploy event streaming with Apache Kafka in the finance industry. This includes mission-critical transactional workloads like payment processing or regulatory reporting and big data analytics projects leveraging Machine Learning, data lakes, etc.

Examples for Real-World Deployments

Here are a few companies leveraging Apache Kafka for banking projects:

Check past Kafka Summit video recordings and slides for details about use cases and architectures of these companies from the finance sector.

Here are a few concrete examples:

Capital One: Becoming truly event-driven – offering a service other parts of the bank can use.
ING: Significantly improved customer experience – as a differentiator + Fraud detection and cost savings.
Nordea: Able to meet strict regulatory requirements around real-time reporting + cost savings.
Paypal: Processing 400+ Billion events per day for user behavioral tracking, merchant monitoring, risk & compliance, fraud detection, and other use cases.
Royal Bank of Canada (RBC): Mainframe off-load, better CX & fraud detection – brought many parts of the bank together
10X Banking: Cloud-native and open core-banking platform to implement a next-generation FinServ platform
Robinhood: Commission-free stock trading using a mobile app and website.

This is just a concise list of companies in the financial sector using Apache Kafka as an event streaming platform for their business’s heart. Plenty of other examples are available by tens of global banks leveraging Apache Kafka for many use cases.

Kafka makes your Business Real-Time

The huge advantage is that Kafka allows decoupling your applications and infrastructure in a domain-driven design (DDD). Each microservice can use its own technology or product but leverage the same data (with security and privacy in mind, of course):

It is great to see that many FinServ companies do not just leverage Kafka in their applications but also contribute to the community. For instance, Robinhood published Faust: A stream processing library, porting Kafka Streams’ ideas from Java to Python.

This is a great example of a microservice architecture and the freedom of technology choice: Robinhood did not want to use Java for (some) applications and chose Python instead. No problem with Kafka as the brokers are dumb. The data processing and business logic happen in the clients in your favorite programming language.

And to be clear: Financial services are important in every company! Payments, orders, fraud, and similar transactional and analytical data rely on FinServ applications and the integration with partners.

Slides – The Rise of Event Streaming in FinServ

The following slide deck goes into more detail. Learn about the rise of event streaming in the financial services industry. Kafka is adopted in more and more scenarios:

Kafka in Banking for Middleware, Mainframe, Machine Learning, Open API, and more

The following links share additional content related to many banking and FinServ use cases and architectures:

Please check them out to learn more about the usage of event streaming with Kafka and its ecosystem across various business units in the Finserv sector.

Last but not least, please be aware that the term “real-time” is used in many contexts and can have different meanings. Read “Kafka is NOT hard real-time” to understand why Kafka is used in most banking projects, but not for the specific use case of trading in microseconds.

Software is Eating the Banks and FinServ Industry

Software is eating the world, including financial services. Continuous real-time data integration and processing are mandatory for many use cases. Apache Kafka is deployed across industries for mission-critical transactional workloads and big data analytics. No matter if you need to integrate with legacy systems, process mission-critical payment data, or build batch reports and analytic models, Kafka is a predominant choice as part of the architecture. Hybrid, edge, and multi-cloud deployments of Kafka are the new black.

What are your experiences and plans for event streaming in the financial services industry? Did you already build applications with Apache Kafka? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka in the Financial Services Industry appeared first on Kai Waehner.

Kafka and XML Messages – Transformation, Connector, Middleware

Kai Waehner — Fri, 25 Sep 2020 07:20:15 +0000

XML messages and XML Schema are not very common in the Apache Kafka and Event Streaming world! Why? Many people call XML legacy. It is complex, verbose, and often associated with the ugly WS-* Hell (SOAP, WSDL, etc). On the other side, every company older than five years uses XML. It is well understood, provides a good structure, and is human- and machine-readable.

This post does not want to start another flame war between XML and other technologies such as JSON (which also provides JSON Schema now), Avro, or Protobuf. Instead, I will walk you through the three main approaches to integrate between Kafka and XML messages as there is still a vast demand for implementing this integration today (often for integrating legacy applications and middleware).

XML and XML Schema

Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium’s XML 1.0 Specification of 1998 and several other related specifications – all of them free open standards – define XML.

The design goals of XML emphasize simplicity, generality, and usability across the Internet. It is a textual data format with strong support via Unicode for different human languages. Although the design of XML focuses on documents, the language is widely used for the representation of arbitrary data structures such as those used in web services. Several schema systems exist to aid in defining XML-based languages, while programmers have developed many application programming interfaces (APIs) to assist the processing of XML data.

SOAP / WSDL Web Services – The WS-* Hell

Web Services use XML messages that follow the SOAP standard and have been popular with traditional enterprises for many years. In such systems, there is often a machine-readable description of the operations offered by the service written in the Web Services Description Language (WSDL). Web Services are one of the predominant use cases for XML integration. Some people call this the “WS-* Hell”:

Kafka for any Data Format (JSON, XML, Avro, Protobuf, …)

Kafka can store and process anything, including XML. The Kafka brokers are dumb. They don’t care about data formats. The implementation of Kafka under the hood stores and processes only byte arrays. This approach follows the design principle of dumb pipes and smart endpoints (coined by Martin Fowler for microservice architectures). Dumb brokers are one of the architectural reasons why Kafka scales and performs so well.

As Kafka supports any data format, XML is no problem at all. It accepts any serializable input format. XML is just text so that plain string serializers can be used. However, if you want additional validation before pushing messages into Kafka (like checking the content is actually XML or doing Schema Validation using XML Schema), then you need to write your own XML Serializer/Deserializer implementation.

XML mappings can be very complex, including referencing other documents and using plenty of ugly open standards. XML includes generic standards such as WS-Security or industry-specific standards such as XBRL for regulatory processing or HL7 for healthcare. It is a mess. Period. Even mature tools struggle as soon as any parts of such a standard are adjusted to their own needs (even though the standards support customization).

Kafka works with any programming language, and Confluent also provides a REST Proxy for HTTP(S) communication with Kafka. But these clients require the developer to implement all the complex XML mapping and processing. Hence, let’s now talk about the most common approaches to implement the integration between any XML-based application and Apache Kafka: 3rd party middleware (ETL / ESB tools) and Kafka-native Kafka Connect.

Kafka and XML via Middleware (ETL, ESB)

Apache Kafka and traditional middleware (ETL, ESB) are frenemies (friends and enemies). Check out this blog and video recording/slide deck for a more in-depth discussion and comparison. If these legacy integration tools such as TIBCO BusinessWorks or Software AG webMethods do one thing well, then it is graphical mappings of complex XML structures, including good and mature (but not 100%) support of the ugly WS-* web service standards.

Pros of 3rd Party Middleware for XML-Kafka Integration

Visual coding for a more straightforward mapping experience (especially crucial for very complex structures) – for all coders: Trust me, this is really easier and more time-efficient than writing, testing, and debugging source code!
Mature (10-20 years old technologies don’t have many bugs anymore – if they are still alive)
Support of complex XML Schema structures (but yet often issues with import, UI, and export)
(Often) already in place (implemented, tested, and deployed)
Kafka integration exists for any middleware which is still alive (i.e., maintained and supported by the vendor)

Cons of 3rd Party Middleware for XML-Kafka Integration

Products are legacy – as old as the source systems
Monolithic, inflexible architecture
Separate infrastructure to operate, test, maintain and pay
End-to-end integration is more challenging (from a technical and support perspective) as two systems in the middle instead of just one
Licensing
Point-to-point and tight coupling, and not event-based streaming with real decoupling
(Often) proprietary solution

Traditional middleware (such as TIBCO, IBM, Software AG, or Mulesoft) complements Kafka deployments. If you have the middleware already running and licensed (and do not plan to migrate away from it), then this is a viable approach to integrate between XML messages from legacy systems and Kafka.

Kafka and XML via Kafka Connect

The open Kafka ecosystem provides Kafka-native support for XML integration leveraging Kafka Connect. Kafka Connect is a Kafka-native tool for scalably and reliably streaming data integration between Apache Kafka and other data systems. It makes it simple to quickly define connectors that move large data sets into and out of Kafka. Think about it as ESB-on-Kafka.

Use cases include

Messaging integration (MQ)
Mainframe offloading
File outputs from batch processes
Web services (SOAP / WSDL / WS-*)
Legacy applications

Pros

Kafka-native (leveraging Kafka under the hood for scalability, throughput, high availability, exactly-once semantics, low latency, etc.)
Decoupled design (Domain-driven Microservice approach instead of tight coupling)
Open ecosystem and flexible integration with any data sources and sinks – check out Confluent Hub for hundreds of open source and commercial connectors
Cloud-native to be deployed in any edge / on-premise or cloud infrastructure such as Kubernetes

Cons

Limited support for complex XML Schemas and standards – not all ugly documents will work well
No visual coding – unfortunately, no Kafka-native visual coding tools exist in 2020. Let’s go, Confluent!

Let’s take a look at two Kafka Connect approaches in more detail: A dedicated XML Connector and an SMT (Single Message Transformation) embedded into any Kafka Connect source or sink connector.

Kafka Connect Connector for XML Files

An XML connector directly accesses the XML file to parse and transform the content:

Connect FilePulse is an open-source Kafka Connect connector built by streamthoughts to parse, transform and stream any XML file. Other file formats are also supported. But as many other tools support the modern data formats such as JSON, CSV, Avro, or Protobuf, I really think the highlight of this connector is the XML support.

Features:

Support for recursive scanning of local directories.
Reading and writing files into Kafka line by line.
Support multiple input file formats (e.g: CSV, JSON, AVRO, XML).
Parsing and transforming data using built-in or custom processing filters
Error handler definition
Monitoring files while they are written into Kafka
Support pluggable strategies to cleanup up completed files

Here is an excellent tutorial for using this XML connector for Kafka Connect: Streaming data into Kafka – Loading an XML file.

The Connect FilePulse Kafka Connector is the right choice for direct integration between XML files and Kafka.

SMT for Embedding XML Transformations into ANY Kafka Connect Connector

An SMT (Single Message Transformation) is part of the Kafka Connect framework. SMTs are applied to messages as they flow through Kafka Connect. They transform inbound messages after a source connector has produced them, but before they are written to Kafka. SMTs transform outbound messages before they are sent to a sink connector.

An SMT can be embedded into any Kafka Connect source or sink connector. Hence, the XML SMT for Kafka Connect allows direct integration with any interface and mapping XML messages without the need for storing the file or using a specific XML connector.

SMTs even allow to add or change metadata, e.g., by adding a new header in addition to the key and value of the message.

Here is an example: Receive XML messages from JMS-based messaging platforms and convert the XML payload to JSON, AVRO, or Protobuf for further processing and integration into the rest of the (modern) enterprise architecture. For instance, Confluent provides a generic JMS connector but also dedicated connectors for various legacy MQ products such as IBM MQ (often running on the mainframe), TIBCO EMS, and ActiveMQ. The XML SMT allows on-the-fly transformation of the incoming XML messages. I have seen integration and later replacement of these MQ tools across the globe in any kind of industry.

Dead Letter Queue (DLQ) for Handling Bad XML Messages

Just transforming messages is often not sufficient. A Dead Letter Queue (DLQ), aka Dead Letter Channel, is an Enterprise Integration Pattern (EIP) to handle bad messages. This design pattern is complementary for XML integration. For instance, a DLQ can store badly processed XML that didn’t fit the XSD in the transform. Here is an example of how to implement the rerouting to a DLQ using the above SMT.

Confluent Schema Registry for Data Governance

The Confluent Schema Registry is a complimentary (optional) tool. It provides a smart implementation of data format and content validation (including enforcement, versioning, and other features). I see it used in ~70% of Kafka projects across the globe. As soon as you do more than just data ingestion into a data lake like HDFS or S3, the added value is enormous. Today, Confluent Schema Registry supports JSON Schema, Avro, and Protobuf.

The Schema Registry provides an open interface and is pluggable. For example, some users have asked for Schema Registry to support XML. Now, you can add XML support to Schema Registry directly, and use the Schema Registry to store both XML and Avro at the same time. For more on how to add your schema formats, please refer to the documentation. The workaround is to do the discussed XML-to-another-format transformation first and start your event streaming data governance from that point.

Summary

XML is predominant in most enterprises and mostly used for legacy applications, batch processing, and SOAP / WSDL web services. A digital transformation can only be successful if the old world is connected well to the new world (as you might have learned in my example about how to integrate between Kafka and Mainframes).

This post explored the three most common options for integration between Kafka and XML:

XML integration with a 3rd party middleware
Kafka Connect connector for integration with XML files
Kafka Connect SMT to embed the transformation into ANY other Kafka Connect source or sink connector to transform XML messages on the flight

What are your experiences with XML integration for Kafka? Which implementation did you choose? What challenges did you face, and how did you or do you plan to solve this? What is your strategy? Let’s connect on LinkedIn and discuss it!

The post Kafka and XML Messages – Transformation, Connector, Middleware appeared first on Kai Waehner.

Middleware Archives - Kai Waehner

Modernizing OT Middleware: The Shift to Open Industrial IoT Architectures with Data Streaming

Why Replace Legacy OT Middleware?

Challenges: Proprietary, Monolithic, Expensive

The Data Streaming Approach: Kafka & Flink as the Foundation for Modern OT Middleware

Apache Kafka: The Backbone of Real-Time OT Data Streaming

Apache Flink: The Real-Time Processing Engine for OT Data

Unifying Operational (OT) and Analytical (IT) Workloads

The Shift-Left Architecture: Bringing Advanced Analytics Closer to Industrial IoT

Open Table Format with Apache Iceberg / Delta Lake for Unified Workloads and Single Storage Layer

The Result: An Open, Cloud-Native, Future-Proof Data Historian for Industrial IoT

Key Benefits of Offloading OT Middleware to Data Streaming

A Step-by-Step Approach: Offloading vs. Replacing OT Middleware with Data Streaming

1. Hybrid Data Streaming: Process Once for OT and IT

Why?

How?

Result?

2. Lift-and-Shift: Reduce Costs While Keeping Existing OT Integrations

Why?

How?

Result?

3. Full OT Middleware Replacement: Cloud-Native, Scalable, and Future-Proof

Why?

How?

Result?

Why This Matters: The Future of OT is Real-Time & Open for Data Sharing

It’s Time to Move Beyond Legacy OT Middleware to Open Standards like MQTT, OPC-UA, Kafka

Industrial IoT Middleware for Edge and Cloud OT/IT Bridge powered by Apache Kafka and Flink

Industrial Automation – The OT/IT Bridge

OT/IT Hierarchy – Different Layers based on ISA-95 and the Purdue Model

Industrial IoT Middleware

Relevant Industries for IIoT Middleware

Industrial IoT Middleware Layers in OT/IT

Edge and Cloud Vendors for Industrial IoT

Traditional “Legacy” Solutions

Emerging Cloud Solutions

The IIoT Enterprise Architecture is a Mix of Vendors and Platforms

Data Streaming for Industrial IoT in the OT/IT World with Kafka and Flink

Apache Kafka and Flink in IT Environments

Apache Kafka as an OT/IT Bridge

Apache Kafka and Flink in OT Edge Applications

IoT Success Story: Industrial Edge Intelligence with Helin and Confluent

Business Value: Fuel Reduction, Increased Revenue, Saving Human Lives

Data Streaming with Kafka and Flink in the Cloud

Helin’s Data Streaming Journey from Self-Managed Kafka to Serverless Confluent Cloud

Data Streaming with Kafka and Flink Empowers Edge-to-Cloud Industrial IoT Middleware

The Past, Present and Future of Stream Processing

The Past of Stream Processing: The Move from Batch to Real-Time

Stream Processing is a Journey Over Decades…

TIBCO StreamBase, Software AG Apama, IBM Streams

Open Source Event Streaming with Apache Kafka

The Present of Stream Processing: Architectural Evolution and Mass Adoption

From Lambda Architecture to Kappa Architecture

Kafka Streams and Apache Flink Become Mainstream

Stateless and Stateful Stream Processing

Stateless Stream Processing

Stateful Stream Processing

Low Code, No Code, AND A Lot of Code!

Disadvantages of Visual Coding Tools

Flexibility is King: Stream Processing for Everyone!

The Future of Stream Processing: Serverless SaaS, GenAI and Streaming Databases

SHORT TERM: Fully Managed Serverless SaaS for Stream Processing

MID TERM: Automated Tooling and the Help of GenAI

MID TERM: Streaming Storage and Analytics with Apache Iceberg

Integration between Streaming Data from Kafka and Analytics on Databricks or Snowflake using Apache Iceberg

LONG TERM: Transactional + Analytics = Streaming Database?

The Future of Stream Processing is Open Source and Cloud

How Lufthansa uses Apache Kafka for Middleware and Analytics

Data streaming in the aviation industry

Apache Kafka as next-generation middleware replacing ETL, ESB, and iPaaS

Lufthansa uses Apache Kafka as cloud-native middleware for mission-critical integrations

Apache Kafka for analytics and AI/machine learning

Lufthansa uses Apache Kafka with AI/machine learning for real-time predictions

Anomaly detection with Apache Kafka and ksqlDB

Machine learning and Apache Kafka for real-time fleet management

Interactive conversation with Lufthansa

Data streaming as cloud-native middleware and for mission-critical analytics

Apache Kafka as Data Hub for Crypto, DeFi, NFT, Metaverse – Beyond the Buzz

What are crypto, NFT, DeFi, blockchain, smart contracts, metaverse?

What’s my relation to crypto, blockchain, and Kafka?