middleware Archives - Kai Waehner

Shift Left Architecture for AI and Analytics with Confluent and Databricks

Kai Waehner — Fri, 09 May 2025 06:03:07 +0000

Modern enterprise architectures are evolving. Traditional batch data pipelines and centralized processing models are being replaced by more flexible, real-time systems. One of the driving concepts behind this change is the Shift Left approach. This blog compares Databricks’ Medallion Architecture with a Shift Left Architecture popularized by Confluent. It explains where each concept fits best—and how they can work together to create a more complete, flexible, and scalable architecture.

About the Confluent and Databricks Blog Series

This article is part of a blog series exploring the growing roles of Confluent and Databricks in modern data and AI architectures:

Blog 1: The Past, Present and Future of Confluent (The Kafka Company) and Databricks (The Spark Company)
Blog 2: Confluent Data Streaming Platform vs. Databricks Data Intelligence Platform for Data Integration and Processing
Blog 3 (THIS ARTICLE): Shift-Left Architecture for AI and Data Warehousing with Confluent and Databricks
Blog 4: Databricks and Confluent in Enterprise Software Environments (with SAP as Example)
Blog 5: Databricks and Confluent Leading Data and AI Architectures – and How They Compare to Competitors

Learn how these platforms will affect data use in businesses in future articles. Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including more details about the shift left architecture with data streaming and lakehouses.

Medallion Architecture: Structured, Proven, but Not Always Optimal

The Medallion Architecture, popularized by Databricks, is a well-known design pattern for organizing and processing data within a lakehouse. It provides structure, modularity, and clarity across the data lifecycle by breaking pipelines into three logical layers:

Bronze: Ingest raw data in its original format (often semi-structured or unstructured)
Silver: Clean, normalize, and enrich the data for usability
Gold: Aggregate and transform the data for reporting, dashboards, and machine learning

Source: Databricks

This layered approach is valuable for teams looking to establish governed and scalable data pipelines. It supports incremental refinement of data and enables multiple consumers to work from well-defined stages.

Challenges of the Medallion Architecture

The Medallion Architecture also introduces challenges:

Pipeline delays: Moving data from Bronze to Gold can take minutes or longer—too slow for operational needs
Infrastructure overhead: Each stage typically requires its own compute and storage footprint
Redundant processing: Data transformations are often repeated across layers
Limited operational use: Data is primarily at rest in object storage; using it for real-time operational systems often requires inefficient reverse ETL pipelines.

For use cases that demand real-time responsiveness and/or critical SLAs—such as fraud detection, personalized recommendations, or IoT alerting—this traditional batch-first model may fall short. In such cases, an event-driven streaming-first architecture, powered by a data streaming platform like Confluent, enables faster, more cost-efficient pipelines by performing validation, enrichment, and even model inference before data reaches the lakehouse.

Importantly, this data streaming approach doesn’t replace the Medallion pattern—it complements it. It allows you to “shift left” critical logic, reducing duplication and latency while still feeding trusted, structured data into Delta Lake or other downstream systems for broader analytics and governance.

In other words, shifting data processing left (i.e., before it hits a data lake or Lakehouse) is especially valuable when the data needs to serve multiple downstream systems—operational and analytical alike—because it avoids duplication, reduces latency, and ensures consistent, high-quality data is available wherever it’s needed.

In a Shift Left Architecture, data processing happens earlier—closer to the source, both physically and logically. This often means:

Transforming and validating data as it streams in
Enriching and filtering in real time
Sharing clean, usable data quickly across teams AND different technologies/applications

This is especially useful for:

Reducing time to insight
Improving data quality at the source
Creating reusable, consistent data products
Operational workloads with critical SLAs

How Confluent Enables Shift Left with Databricks

In a Shift Left setup, Apache Kafka provides scalable, low-latency, and truly decoupled ingestion of data across operational and analytical systems, forming the backbone for unified data pipelines.

Schema Registry and data governance policies enforce consistent, validated data across all streams, ensuring high-quality, secure, and compliant data delivery from the very beginning.

Apache Flink enables early data processing — closer to where data is produced. This reduces complexity downstream, improves data quality, and allows real-time decisions and analytics.

Data Quality Governance via Data Contracts and Schema Validation

Flink can enforce data contracts by validating incoming records against predefined schemas (e.g., using JSON Schema, Apache Avro or Protobuf with Schema Registry). This ensures structurally valid data continues through the pipeline. In cases where schema violations occur, records can be automatically routed to a Dead Letter Queue (DLQ) for inspection.

Additionally, data contracts can enforce policy-based rules at the schema level—such as field-level encryption, masking of sensitive data (PII), type coercion, or enrichment defaults. These controls help maintain compliance and reduce risk before data reaches regulated or shared environments.

Apache Flink for Continuous Stream Processing

Flink can perform the following tasks before data ever lands in a data lake or warehouse:

Filtering and Routing

Events can be filtered based on business rules and routed to the appropriate downstream system or Kafka topic. This allows different consumers to subscribe only to relevant data, optimizing both performance and cost.

Metric Calculation

Use Flink to compute rolling aggregates (e.g., counts, sums, averages, percentiles) over windows of data in motion. This is useful for business metrics, anomaly detection, or feeding real-time dashboards—without waiting for batch jobs.

Real-Time Joins and Enrichment

Flink supports both stream-stream and stream-table joins. This enables real-time enrichment of incoming events with contextual information from reference data (e.g., user profiles, product catalogs, pricing tables), often sourced from Kafka topics, databases, or external APIs.

By shifting this logic to the beginning of the pipeline, teams can reduce duplication, avoid unnecessary storage and compute costs in downstream systems, and ensure that data products are clean, policy-compliant, and ready for both operational and analytical use—as soon as they are created.

Example: A financial application might use Flink to calculate running balances, detect anomalies, and enrich records with reference data before pushing to Databricks for reporting and training analytic models.

In addition to enhancing data quality and reducing time-to-insight in the lakehouse, this approach also makes data products immediately usable for operational workloads and downstream applications—without building separate pipelines.

Learn more about stateless and stateful stream processing in real-time architectures using Apache Flink in this in-depth blog post.

Combining Shift Left with Medallion Architecture

These architectures are not mutually exclusive. Shift Left is about processing data earlier. Medallion is about organizing data once it arrives.

You can use Shift Left principles to:

Pre-process operational data before it enters the Bronze layer
Ensure clean, validated data enters Silver with minimal transformation needed
Reduce the need for redundant processing steps between layers

Confluent’s Tableflow bridges the two worlds. It converts Kafka streams into Delta tables, integrating cleanly with the Medallion model while supporting real-time flows.

Shift Left with Delta Lake, Iceberg, and Tableflow

Confluent Tableflow makes it easy to publish Kafka streams into Delta Lake or Apache Iceberg formats. These can be discovered and queried inside Databricks via Unity Catalog.

This integration:

Simplifies integration, governance and discovery
Enables live updates to AI features and dashboards
Removes the need to manage Spark streaming jobs

This is a natural bridge between a data streaming platform and the lakehouse.

Source: Confluent

AI Use Cases for Shift Left with Confluent and Databricks

The Shift Left model benefits both predictive and generative AI:

Model training: Real-time data pipelines can stream features to Delta Lake
Model inference: In some cases, predictions can happen in Confluent (via Flink) and be pushed back to operational systems instantly
Agentic AI: Real-time event-driven architectures are well suited for next-gen, stateful agents

Databricks supports model training and hosting via MosaicML. Confluent can integrate with these models, or run lightweight inference directly from the stream processing application.

Data Warehouse Use Cases for Shift Left with Confluent and Databricks

Batch reporting: Continue using Databricks for traditional BI
Real-time analytics: Flink or real-time OLAP engines (e.g., Apache Pinot, Apache Druid) may be a better fit for sub-second insights
Hybrid: Push raw events into Databricks for historical analysis and use Flink for immediate feedback

Where you do the data processing depends on the use case.

Architecture Benefits Beyond Technology

Shift Left also brings architectural benefits:

Cost Reduction: Processing early can lower storage and compute usage
Faster Time to Market: Data becomes usable earlier in the pipeline
Reusability: Processed streams can be reused and consumed by multiple technologies/applications (not just Databricks teams)
Compliance and Governance: Validated data with lineage can be shared with confidence

These are important for strategic enterprise data architectures.

Bringing in New Types of Data

Shift Left with a data streaming platform supports a wider range of data sources:

Operational databases (like Oracle, DB2, SQL Server, Postgres, MongoDB)
ERP systems (SAP et al)
Mainframes and other legacy technologies
IoT interfaces (MQTT, OPC-UA, proprietary IIoT gateway, etc.)
SaaS platforms (Salesforce, ServiceNow, and so on)
Any other system that does not directly fit into the “table-driven analytics perspective” of a Lakehouse

With Confluent, these interfaces can be connected in real time, enriched at the edge or in transit, and delivered to analytics platforms like Databricks.

This expands the scope of what’s possible with AI and analytics.

Shift Left Using ONLY Databricks

A shift left architecture only with Databricks is possible, too. A Databricks consultant took my Shift Left slide and adjusted it that way:

Relying solely on Databricks for a “Shift Left Architecture” can work if all workloads (should) stay within the platform — but it’s a poor fit for many real-world scenarios.

Databricks focuses on ELT, not true ETL, and lacks native support for operational workloads like APIs, low-latency apps, or transactional systems. This forces teams to rely on reverse ETL tools – a clear anti-pattern in the enterprise architecture – just to get data where it’s actually needed. The result: added complexity, latency, and tight coupling.

The Shift Left Architecture is valuable, but in most cases it requires a modular approach, where streaming, operational, and analytical components work together — not a monolithic platform.

That said, shift left principles still apply within Databricks. Processing data as early as possible improves data quality, reduces overall compute cost, and minimizes downstream data engineering effort. For teams that operate fully inside the Databricks ecosystem, shifting left remains a powerful strategy to simplify pipelines and accelerate insight.

Meesho: Scaling a Real-Time Commerce Platform with Confluent and Databricks

Many high-growth digital platforms adopt a shift-left approach out of necessity—not as a buzzword, but to reduce latency, improve data quality, and scale efficiently by processing data closer to the source.

Meesho, one of India’s largest online marketplaces, relies on Confluent and Databricks to power its hyper-growth business model focused on real-time e-commerce. As the company scaled rapidly, supporting millions of small businesses and entrepreneurs, the need for a resilient, scalable, and low-latency data architecture became critical.

To handle massive volumes of operational events — from inventory updates to order management and customer interactions — Meesho turned to Confluent Cloud. By adopting a fully managed data streaming platform using Apache Kafka, Meesho ensures real-time event delivery, improved reliability, and faster application development. Kafka serves as the central nervous system for their event-driven architecture, connecting multiple services and enabling instant, context-driven customer experiences across mobile and web platforms.

Alongside their data streaming architecture, Meesho migrated from Amazon Redshift to Databricks to build a next-generation analytics platform. Databricks’ lakehouse architecture empowers Meesho to unify operational data from Kafka with batch data from other sources, enabling near real-time analytics at scale. This migration not only improved performance and scalability but also significantly reduced costs and operational overhead.

With Confluent managing real-time event processing and ingestion, and Databricks providing powerful, scalable analytics, Meesho is able to:

Deliver real-time personalized experiences to customers
Optimize operational workflows based on live data
Enable faster, data-driven decision-making across business teams

By combining real-time data streaming with advanced lakehouse analytics, Meesho has built a flexible, future-ready data infrastructure to support its mission of democratizing online commerce for millions across India.

Shift Left: Reducing Complexity, Increasing Value for the Lakehouse (and other Operational Systems)

Shift Left is not about replacing Databricks. It’s about preparing better data earlier in the pipeline—closer to the source—and reducing end-to-end complexity.

Use Confluent for real-time ingestion, enrichment, and transformation
Use Databricks for advanced analytics, reporting, and machine learning
Use Tableflow and Delta Lake to govern and route high-quality data to the right consumers

This architecture not only improves data quality for the lakehouse, but also enables the same real-time data products to be reused across multiple downstream systems—including operational, transactional, and AI-powered applications.

The result: increased agility, lower costs, and scalable innovation across the business.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including more details about the shift left architecture with data streaming and lakehouses.

The post Shift Left Architecture for AI and Analytics with Confluent and Databricks appeared first on Kai Waehner.

Apache Flink: Overkill for Simple, Stateless Stream Processing and ETL?

Kai Waehner — Tue, 14 Jan 2025 07:48:04 +0000

When discussing stream processing engines, Apache Flink often takes center stage for its advanced capabilities in stateful stream processing and real-time data analytics. However, a common question arises: is Flink too heavyweight for simple, stateless stream processing and ETL tasks? The short answer for open-source Flink is often yes. But the story evolves significantly when looking at SaaS Flink products such as Confluent Cloud’s Flink offering, with its serverless architecture, multi-tenancy, consumption-based pricing, and no-code/low-code capabilities like Flink Actions. This post explores the considerations and trade-offs to help you decide when Flink is the right tool for your data streaming needs, and when Kafka Streams or Single Message Transform (SMT) within Kafka Connect are the better choice.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch.

The Nature of Stateless Stream Processing

Stateless stream processing, as the name implies, processes each event independently, with no reliance on prior events or context. This simplicity lends itself to use cases such as filtering, transformations, and simple ETL operations. Stateless tasks are:

Efficient: They don’t require state management, reducing overhead.
Scalable: Easily parallelized since there is no dependency between events.
Minimalistic: Often achievable with simpler, lightweight frameworks like Kafka Streams or Kafka Connect’s Single Message Transforms (SMT).

For example, filtering transactions above a certain amount or transforming event formats for downstream systems are classic stateless tasks that demand minimal computational complexity.

In these scenarios, deploying a robust and feature-rich framework like open-source Apache Flink might seem excessive. Flink’s rich API and state management features are unnecessary for such straightforward use cases. Instead, tools with smaller footprints, and simpler deployment models, such as Kafka Streams, often suffice.

When Flink Feels Like Overkill

Apache Flink is a powerhouse. It’s designed for advanced analytics, stateful processing, and complex event patterns. But this sophistication of the open source framework comes with complexity:

Operational Overhead: Setting up and maintaining Flink in an open-source environment can require significant infrastructure and expertise.
Resource Intensity: Flink’s distributed architecture and stateful processing capabilities are resource-hungry, often overkill for tasks that don’t require stateful processing.
Complexity in Development: The Flink API is robust but comes with a steeper learning curve. The combination with Kafka (or another streaming engine) requires understanding two frameworks. In contrast, Kafka Streams is Kafka-native, offering a single, unified framework for stream processing, which can reduce complexity for basic tasks.

For organizations that need to perform straightforward stateless operations, investing in the full Flink stack can feel like using a sledgehammer to crack a nut. Having said this, FlinkSQL simplifies development for certain personas, providing a more accessible interface beyond just Java and Python.

The Cloud Advantage: Serverless Flink for Everyone

The conversation shifts dramatically with Serverless Flink Cloud offerings, such as Confluent Cloud, which address many of the challenges associated with running open-source Flink. Let’s unpack how Serverless Flink makes a more attractive choice, even for simpler use cases.

1. Serverless Architecture

With a Serverless stream processing service, Flink operates on a fully serverless model, eliminating the need for heavy infrastructure management. This means:

No Setup Hassles: Developers focus purely on application logic, not cluster provisioning or tuning.
Elastic Scaling: Resources automatically scale with the workload, ensuring efficient handling of varying traffic patterns without manual intervention. One of the biggest challenges of self-managing Flink is over-provisioning resources to handle anticipated peak loads. Elastic Scaling mitigates this inefficiency.

2. Multi-Tenancy

Multi-tenant design allows multiple applications, teams or organizations to share the same infrastructure securely. This reduces operational costs and complexity compared to managing isolated clusters for each workload.

3. Consumption-Based Pricing

One of the key barriers to adopting Flink for simple tasks is cost. A truly Serverless Flink offering mitigates this with a pay-as-you-go pricing model:

You only pay for the resources you use, making it cost-effective for both lightweight and high-throughput workloads.
It aligns with the scalability of stateless stream processing, where workloads may spike temporarily and then taper off.

4. Bridging the Gap with No-Code and Low-Code Solutions

The rise of citizen integrators and the demand for low-code/no-code solutions have reshaped how organizations approach data streaming. Less-technical users, such as business analysts or operational teams, often face challenges when trying to engage with technical platforms designed for developers.

Low-code/no-code tools address this by providing intuitive interfaces that allow users to build, deploy, and monitor pipelines without deep programming knowledge. These solutions empower business users to take charge of simple workflows and integrations, significantly reducing time-to-value while minimizing the reliance on technical teams.

For example, capabilities like Flink Actions in Confluent Cloud offer a user-friendly approach to deploying stream processing pipelines without coding. By simplifying the process and making it accessible to non-technical stakeholders, these tools enhance collaboration and ensure faster outcomes without compromising performance or scalability. For instance, you can do ETL functions such as transformation, deduplication or masking field:

Source: Confluent

Fully Managed (SaaS) vs. Partially Managed (PaaS) Cloud Products

When choosing between SaaS and PaaS for data streaming, it’s essential to understand the key differences.

SaaS solutions, like Confluent Cloud, offer a fully managed, serverless experience with automatic scaling, low operational overhead, and pay-as-you-go pricing.

In contrast, PaaS requires users to manage infrastructure, configure scaling policies, and handle more operational complexity.

While many products are marketed as “serverless,” not all truly abstract infrastructure or eliminate idle costs—so scrutinize claims carefully.

SaaS is ideal for teams focused on rapid deployment and simplicity, while PaaS suits those needing deep customization and control. Ultimately, SaaS ensures scalability and ease of use, making it a compelling choice for most modern streaming needs. Always dive into the technical details to ensure the platform aligns with your goals. Don’t trust the marketing slogans of the vendors!

Stateless vs. Stateful Stream Processing: Blurring the Lines

Even if your current use case is stateless, it’s worth considering the potential for future needs. Stateless pipelines often evolve into more complex systems as businesses grow, requiring features like:

State Management: For event correlation and pattern detection.
Windows and Aggregations: To derive insights from time-series data.
Joins: To enrich data streams with contextual information.
Integrating Multiple Data Sources: To seamlessly combine information from various streams for a comprehensive and cohesive analysis.
AI/ML Integration: Incorporating machine learning models for real-time inference, enabling intelligent decision-making directly within data streams.

With a SaaS Flink service such as Confluent Cloud, you can start small with stateless tasks and seamlessly scale into stateful operations as needed, leveraging Flink’s full capabilities without a complete overhaul.

When Does Apache Flink Shine?

While Flink may feel like overkill for simple, stateless tasks in its open-source form, its potential is unmatched in these scenarios:

Enterprise Workloads: Scalable, reliable, and fault-tolerant systems for mission-critical applications.
Data Integration and Preparation (Streaming ETL): Flink enables preprocessing, cleansing, and enriching data at the streaming layer, ensuring high-quality data reaches downstream systems like data lakes and warehouses.
Complex Event Processing (CEP): Detecting patterns across events in real time.
Advanced Analytics: Stateful stream processing for aggregations, joins, and windowed computations.
AI/ML Integration: Incorporating machine learning models for real-time inference, enabling intelligent decision-making directly within data streams.

Apache Flink for Stateless Stream Processing instead of Kafka Streams or Kafka Connect’s Single Message Transform (SMT)

Stateless stream processing is often achieved using lightweight tools like Kafka Streams or Single Message Transforms (SMTs) within Kafka Connect. SMTs enable inline transformations, such as normalization, enrichment, or filtering, as events pass through the integration framework. This functionality is available in Kafka Connect (provided by Confluent, IBM/Red Hat, Amazon MSK and others) and tools like Benthos for Redpanda. SMTs are particularly useful for quick adjustments and filtering data before it reaches the Kafka cluster, optimizing resource usage and data flow.

While Kafka Streams and Kafka Connect’s SMTs handle many stateless workloads effectively, Apache Flink offers significant advantages for all types of workloads—whether simple or complex, stateless or stateful.

Stream processing in Flink enables true decoupling within the enterprise architecture (as it is not bound to the Kafka cluster like Kafka Streams and Kafka Connect). The benefits are separation of concerns with a domain-driven design (DDD), and improved data governance. And Flink provides interfaces for Java, Python and SQL. Something for (almost) everyone. This makes ideal Flink for ensuring clean, modular architectures and easier scalability.

By processing events from diverse sources and preparing them for downstream consumption, Flink supports both lightweight and comprehensive workflows while aligning with domain boundaries and governance requirements. This brings us to the shift left architecture.

The Shift Left Architecture

No matter what specific use cases you have in mind: The Shift Left Architecture brings data processing upstream with real-time stream processing, transforming raw data into high-quality data products early in the pipeline.

Apache Flink plays a key role as part of a complete data streaming platform by enabling advanced streaming ETL, data curation, and on-the-fly transformations, ensuring consistent, reliable, and ready-to-use data for both operational and analytical workloads, while reducing costs and accelerating time-to-market.

Making the Right Choice for Apache Flink or Not

The decision to use Flink boils down to your use case, expertise, and growth trajectory:

For basic stateless tasks, consider lightweight options like Kafka Streams or SMTs within Kafka Connect unless you’re already invested in a SaaS such as Confluent Cloud where Flink is also the appropriate choice for simple ETL processes.
For evolving workloads or scenarios requiring scalability and advanced analytics, a Flink SaaS offers unparalleled flexibility and ease of use.
For on-premise or edge deployments, Flink’s flexibility makes it an excellent choice for environments where data processing must occur locally due to latency, security, or compliance requirements.

Understanding the deployment environment—cloud, on-premise, or edge— and the capabilities of the Flink product is crucial to choosing the right streaming technology. Flink’s adaptability ensures it can serve diverse needs across these contexts.

Kafka Streams is another excellent, Kafka-native stream processing alternative. Most importantly for this discussion, Kafka Streams is “just” a lightweight Java library, not a server infrastructure like Flink. Hence, it brings different trade-offs with it. I wrote a dedicated article about the trade-offs between Apache Flink and Kafka Streams for stream processing.

Apache Flink’s Role in Your Data Streaming Strategy

In its open-source form, Flink can seem excessive for simple, stateless tasks. However, a serverless Flink SaaS like Confluent Cloud changes the equation. Multi-tenancy and pay-as-you-go pricing make it suitable for a wider range of use cases, from basic ETL to advanced analytics. Serverless features like Confluent’s Flink Actions further reduce complexity, allowing non-technical users to harness the power of stream processing without coding.

Whether you’re just beginning your journey into stream processing or scaling up for enterprise-grade applications, Flink—as part of a complete data streaming platform such as Confluent Cloud—is a future-proof investment that adapts to your needs.

The Data Streaming Landscape 2025 highlights how data streaming has evolved into a key software category, moving from niche adoption to a fundamental part of modern data architecture.

With frameworks like Apache Kafka and Flink at its core, the landscape now spans self-managed, BYOC, and fully managed SaaS solutions, driving real-time use cases, unifying transactional and analytical workloads, and enabling innovation across industries.

Stay ahead of the curve! Subscribe to my newsletter for insights into data streaming and connect with me on LinkedIn to continue the conversation.

The post Apache Flink: Overkill for Simple, Stateless Stream Processing and ETL? appeared first on Kai Waehner.

Virgin Australia’s Journey with Apache Kafka: Driving Innovation in the Airline Industry

Kai Waehner — Tue, 07 Jan 2025 06:48:17 +0000

Data streaming with Apache Kafka and Flink has revolutionized the aviation industry, enabling airlines and airports to improve efficiency, reliability, and customer experience. The airline Virgin Australia exemplifies how leveraging an event-driven architecture can address operational challenges and drive innovation. This article explores how Virgin Australia successfully implemented data streaming to modernize its flight operations and enhance its loyalty program.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch.

Data Streaming with Kafka and Flink in the Aviation and Airline Industry

Data streaming with Apache Kafka and Flink is revolutionizing aviation by enabling real-time data processing and integration across complex airline and airport ecosystems. Airlines rely on diverse systems for flight tracking, crew scheduling, baggage handling, and passenger services, all of which generate vast volumes of data.

Kafka’s event-driven architecture ensures seamless communication between these systems, allowing real-time updates and consistent data flows. Flink then processes this data to provide useful information.

IT Modernization, Cloud-native Middleware and Analytics with Apache Kafka at Lufthansa

For instance, Lufthansa leverages Apache Kafka as a cloud-native middleware to modernize its data integration and enable real-time analytics. Through its KUSCO platform, Kafka replaces legacy tools like TIBCO EMS, offering scalable, cost-efficient, and seamless data sharing across systems. Kafka also powers Lufthansa’s advanced analytics use cases, including:

Anomaly Detection: Real-time alerts using ksqlDB to enhance safety and efficiency.
Fleet Management: Machine learning models embedded in Kafka pipelines for real-time operational predictions.

This shift enables Lufthansa to decouple systems, accelerate innovation, and reduce costs, positioning the airline to meet the demands of a rapidly evolving industry with greater efficiency and agility.

Business Value of Data Streaming at Amsterdam Airport Schiphol Group)

The business value of data streaming in aviation is immense. Airlines gain operational efficiency by reducing delays and optimizing resource allocation. Real-time insights enhance the passenger experience with timely updates, better baggage handling, and personalized interactions.

Airport modernization and digitalization, with consistent real-time information, is another excellent trend. This includes data sharing with partners, such as GDS systems and airlines. Schiphol Group (Amsterdam Airport) presented various use cases for data streaming with Apache Kafka and Flink.

Scalable platforms like Kafka allow airlines to integrate new technologies, future-proofing their operations in an increasingly competitive industry. By leveraging data streaming, aviation companies are not just keeping pace—they’re redefining what’s possible in airline and airport management.

Virgin Australia: Business Overview and IT Strategy

Founded in 2000, Virgin Australia is a leading airline connecting Australia to key global destinations through domestic and international flights. Known for exceptional service and innovation, the airline serves a diverse range of passengers, from leisure travelers to corporate clients.

Virgin Australia’s IT strategy drives its success, focusing on digital transformation to modernize legacy systems and integrate real-time data solutions. The airline uses the latest technology, such as Apache Kafka, and focuses on efficiency to offer good value and new ideas in the airline industry.

This enables the airline to optimize operations, enhance on-time performance, and quickly adapt to disruptions. A customer-first approach is central, leveraging data insights to personalize every stage of the passenger journey and build lasting loyalty.

Virgin Australia partnered with Confluent and the IT consulting firm 4impact to implement Apache Kafka for event streaming, ensuring their systems could meet the airline’s evolving demands. The following is a summary of 4impact’s published success stories:

Success Story 1: Real-Time Flight Schedule Updates with the Flight State Engine (FSE)

Virgin Australia’s Flight State Engine (FSE) creates a central, authoritative view of flight status and streams real-time updates to multiple internal and external systems. Initially built on Oracle SOA, the legacy FSE faced significant limitations:

High costs and slow implementation of new features.
Limited monitoring capabilities.
Lack of scalability for additional event-streaming use cases.

The Solution

4impact replatformed the FSE with a Kafka-based architecture, introducing:

Modern Event Streaming: Kafka replaced Oracle SOA, enabling real-time, high-throughput updates.
Phased Rollout: To minimize disruption, the new FSE ran parallel to the legacy system during implementation.
Future-Proofing: Patterns, templates, and blueprints were developed for future event-streaming applications.

Key Outcomes

The new FSE went live in late 2022, delivering zero outages and exceeding performance expectations.
Speed and cost efficiency for adding new features improved significantly.
The platform became the foundation for other business units, enabling faster delivery of new services and innovations.

Source: 4impact

By replacing the legacy FSE with Kafka, Virgin Australia ensured real-time reliability and created a scalable event-streaming platform to support future projects.

Success Story 2: Transforming the Virgin Business Rewards Program

Virgin Business Rewards is a loyalty program designed to engage small and medium-sized enterprises (SMEs). Previously, the program relied on manual workflows and siloed systems, leading to:

Inefficient processes prone to errors.
Delayed updates on reward earnings and redemptions.
High costs due to the lack of automated communication between systems like Salesforce, Amadeus, and iFly.

The Solution

To address these challenges, 4impact implemented Kafka to automate the program’s workflows:

Event-Driven Architecture: Kafka topics handled asynchronous messaging between systems, avoiding point-to-point integrations.
Custom Microservices: Developed to transform messages and interact with APIs on target systems.
Monitoring and Logging: A centralized mechanism captured business events and system logs, ensuring observability.

Key Outcomes

The new reward and loyalty system went live in Q1 2023, processing thousands of messages daily with a minimal load on endpoint systems.

Reward data was synchronized across all systems, eliminating manual intervention.
Other business units began exploring Kafka’s potential to leverage data for faster, more cost-effective service enhancements.

Source: 4impact

With Apache Kafka, Virgin Australia transformed its loyalty program, ensuring real-time updates and creating a scalable platform for future business needs.

IT Modernization with Data Streaming using Apache Kafka: A Blueprint for Innovation in the Airline Industry

Virgin Australia’s success stories illustrate how data streaming with Apache Kafka, implemented with the help of Confluent and 4impact, can address critical challenges in the aviation industry. By replacing legacy systems with modern event-streaming architectures, the airline achieved:

Real-Time Reliability: Ensuring up-to-date flight information and seamless customer interactions.
Scalability: Creating platforms that support new features and services without high costs or delays.
Customer-Centric Solutions: Enhancing loyalty programs and operational efficiency.

The blog post “Customer Loyalty and Rewards Platform with Apache Kafka” explores how enterprises across various industries use Apache Kafka to enhance customer retention and drive revenue growth through real-time data streaming. It presents case studies from companies like Albertsons, Globe Telecom, Virgin Australia, Disney+ Hotstar, and Porsche to show the value of data streaming in improving customer loyalty programs.

Stay ahead of the curve! Subscribe to my newsletter for insights into data streaming and connect with me on LinkedIn to continue the conversation.

The post Virgin Australia’s Journey with Apache Kafka: Driving Innovation in the Airline Industry appeared first on Kai Waehner.

Snowflake Data Integration Options for Apache Kafka (including Iceberg)

Kai Waehner — Mon, 22 Apr 2024 05:40:32 +0000

The integration between Apache Kafka and Snowflake is often cumbersome. Options include near real-time ingestion with a Kafka Connect connector, batch ingestion from large files, or leveraging a standard table format like Apache Iceberg. This blog post explores the alternatives and discusses its trade-offs. The end shows how data streaming helps with hybrid architectures where data needs to be ingested from the private data center into Snowflake in the public cloud.

Blog Series: Snowflake and Apache Kafka

Snowflake is a leading cloud-native data warehouse. Its usability and scalability made it a prevalent data platform in thousands of companies. This blog series explores different data integration and ingestion options, including traditional ETL / iPaaS and data streaming with Apache Kafka. The discussion covers why point-to-point Zero-ETL is only a short term win, why Reverse ETL is an anti-pattern for real-time use cases and when a Kappa Architecture and shifting data processing “to the left” into the streaming layer helps to build transactional and analytical real-time and batch use cases in a reliable and cost-efficient way.

Blog series:

Snowflake Integration Patterns: Zero ETL and Reverse ETL vs. Apache Kafka
THIS POST: Snowflake Data Integration Options for Apache Kafka (including Iceberg)
Apache Kafka + Flink + Snowflake: Cost Efficient Analytics and Data Governance

Subscribe to my newsletter to get an email about the next publications.

Data Ingestion from Apache Kafka into Snowflake (Batch vs. Streaming)

Several options exist to ingest data into Snowflake. Criteria to evaluate the options include complexity, latency, throughout and cost.

The article “Streaming on Snowflake” by Paul Needleman explored the three common architecture patterns for data ingestion from any data source into Snowflake:

Source: Paul Needleman (Snowflake)

Paul’s article described the architecture options without and with Kafka. The numbered list below follows the numbers in the upper diagram:

Snowpipe — This solution provides Cloud storage (e.g., Amazon S3, Azure Blob Storage, Google Cloud Storage) the ability for serverless alerting Snowflake to auto-ingest data upon arrival. Once a file lands, Snowflake is alerted to pick up and process the file. Snowpipe is used for micro-batch file transfer, not real-time message ingestion.
Kafka Connector — This connector provides a simple yet elegant solution to connect Kafka Topics with Snowflake, abstracting the complexity through Snowpipe. The Kafka Topics write the data to a Snowflake-managed internal stage, which is auto-ingested to the table using Snowpipe. The internal stage and pipe objects are created automatically as part of the process.
Kafka with Snowpipe Streaming — This builds upon the first two approaches and allows for a more native connection between Snowflake and Kafka through a new channel object. This object seamlessly streams message data into a Snowflake table without needing first to store the data. The data is also stored in an optimized format to support the low-latency data interval.

Read the article “Streaming on Snowflake” for more details about these options.

Snowflake = SaaS => Integration Layer Should Be SaaS!

Snowflake is one of the first most successful true cloud data warehouses, i.e. fully managed with no need to operate and worry about the infrastructure. SaaS, Snowflake offers benefits such as scalability, ease of use, vendor-managed updates and maintenance, multi-cloud support, enhanced security, cost-effectiveness with consumption-based pricing, and global accessibility. These advantages make it an attractive choice for organizations looking to leverage a modern and efficient data warehousing solution.

The same benefits exist for fully managed data integration solutions. It does not matter if you use open source-based technologies (e.g., Apache Camel), a traditional iPaaS middleware, or a data streaming solution like Kafka.

I wrote a detailed article comparing iPaaS offerings like Dell Boomi, SnapLogic, Informatica, and fully managed data streaming cloud platforms like Confluent Cloud or Amazon MSK. Read this article to understand why your next integration platform should be fully managed the same way Snowflake is.

Example: Data Ingestion with Confluent Cloud and Snowpipe Streaming

Confluent Cloud and Snowflake are a perfect combination for fully managed end-to-end data pipelines. For instance, connecting to a data source like Salesforce CRM via CDC, streaming data through Kafka, and ingesting the events into Snowflake is entirely fully managed.

Source: Confluent

Using Kafka Connect with Snowpipe Streaming has several advantages:

Faster, more efficient data pipelines
Reduced architectural complexity
Support for exactly-once delivery
Ordered ingestion
Error handling with dead-letter queue (DLQ) support

Streaming ingestion is not meant to replace file-based ingestion. Rather, it augments the existing integration architecture for data-loading scenarios where it makes sense, such as

Low-latency telemetry analytics of user-application interactions for clickstream recommendations
Identification of security issues in real-time streaming log analytics to isolate threats
Stream processing of information from IoT devices to monitor critical assets

Why should you NOT only use Snowpipe Streaming mode? Cost. Snowflake has different pricing models for the ingestion modes.

Processing Large Files in Kafka before Snowflake Ingestion?

A last aspect of data ingestion options via Kafka into Snowflake: What to do with large files?

One of the most common use cases for data ingestion into Snowflake is large CSV, XML or JSON files generated from batch legacy analytics systems.

Option 1: Send the large files via Kafka into Snowflake and process it in the data warehouse. Apache Kafka was never built for large messages. Nevertheless, more and more projects send and process 1Mb, 10Mb, and even bigger files and other large payloads via Kafka into Snowflake. Why? Because it just works.

Option 2: Apache Kafka splits up and chunks large messages into small pieces.

For the latter approach, ideally, events are processed line by line, if possible. The enormous benefit of this approach is bringing even batch-based monolithic systems into an event-driven architecture. Snowflake and other downstream applications consume the events in near real-time. This architecture leverages the Composed Message Processor Enterprise Integration Pattern (EIP):

For a deep dive including various use cases and customer stories, check out the article “Handling Large Messages With Apache Kafka“.

Bi-Directional Integration between Apache Kafka and Snowflake with Apache Iceberg

After covering batch, file, and streaming integration from Kafka to Snowflake, let’s move to the latest innovation that is more compelling than old the “legacy approaches”: Native integration between Apache Kafka and Snowflake using Apache Iceberg.

Apache Iceberg is the leading open-source table format for storing large-scale structured data in cloud object stores or distributed file systems, designed for high-performance querying and analytics. It provides features such as schema evolution, time travel, and data versioning, making it well-suited for data lakes and modern data architectures.

Snowflake Support for Apache Iceberg

Snowflake already supports Apache Iceberg.

Source: Snowflake

Augusto Kiniama Rosa points out in his Overview of Snowflake Apache Iceberg Tables:

“Iceberg will always use customer-controlled external storage, like an AWS S3 or Azure Blog Storage. Snowflake Iceberg Tables support Iceberg in two ways: an Internal Catalog (Snowflake-managed catalog) or an externally managed catalog (AWS Glue or Objectstore).”

I won’t start a flame war of Apache Iceberg vs. Apache Hudi and Databricks’ Delta Lake here. It reminds me about the containers wars with Kubernetes, Cloud Foundry and Apache Mesos. In the end, Kubernetes won. The competitors adopted it. The same seems to be happening with Iceberg. If not, the as principles and benefits will be the same, no matter if the future is Iceberg or a competing technology. As it seems today like Iceberg will win this war, I focus on this technology in the following sections.

Kafka and Iceberg to Unify Transactional and Analytical Workloads

Any data source can feed data via Apache Kafka directly into Snowflake (or any other analytics engine) as Apache Iceberg table. This solves the challenges of the above described integration options between Kafka and Snowflake. Operational data is accessible to the analytical world without a complex, expensive, and fragile process.

Source: Confluent

Confluent Tableflow: Fully Managed Kafka-Iceberg Integration

Confluent announced Tableflow at Kafka Summit 2024 in London, UK, to demonstrate its fully managed out-of-the-box integration between a Kafka Topic and Schema and an Iceberg Tables in Confluent Cloud. Confluent’s Marc Selwan writes:

“In the past, there has been a tight coupling of tables (storage) and query engines. In recent years, we’ve witnessed the rise of ‘headless’ data infrastructure where companies are building a more open lakehouse in cloud object storage that is accessible by many tools.

Just like the Apache Kafka API has evolved to be the de facto open standard for data streaming, we’re seeing Apache Iceberg grow into the de facto open-table standard for large-scale datasets stored in lakehouses. We’ve seen its ecosystem grow with robust tooling and support from compute engines such as Apache Spark, Snowflake, Amazon Athena, Dremio, Trino, Apache Druid, and many others.

Source: Confluent

We believe the rise of open-table formats and the ‘headless’ data infrastructure is being driven by the needs of data engineers evolving beyond the tight coupling of table to computing platform. These factors made Apache Iceberg support a natural first choice for us.”

Check out Confluent’s blog post Announcing Tableflow. Other Kafka vendors will likely provide Apache Iceberg support in the near future, too. I am really excited about this development of unifying operational and analytics with a standardized interface across open source frameworks and cloud solutions.

Hybrid Architectures with Kafka On-Premise and in the Public Cloud for Snowflake Integration

Snowflake is only available in the public cloud on AWS, GCP or Azure. Most companies across industries follow a cloud-first strategy for new applications. However, as existing companies exist for years or decades, they are typically not born in the cloud. Therefore, hybrid cloud architectures are the new black for most companies. Apache Kafka is the best approach to synchronize and replicate data in a single pipeline with low latency, reliability and guaranteed ordering from the data center to the public cloud.

Legacy infrastructure has to be maintained, integrated, and (maybe) replaced over time. Data Streaming with the Apache Kafka ecosystem is a perfect technology for building hybrid synchronization in real-time at scale. This enables bidirectional integration for transactional and analytical workloads without creating a spaghetti architecture with various point-to-point connections between on-prise and the cloud.

There is no Silver Bullet for Kafka to Snowflake Integration!

Various data integration options are available between Apache Kafka and Snowflake. Kafka Connect connectors are a great option, no matter if you do batch or near real-time ingestion. Even large files can be ingested via data streaming using the right enterprise integration patterns.

A new and innovative approach is Apache Iceberg as the integration interface. The standard table format allows connecting from Snowflake; and any other analytics engine. But data needs to be stored only once. Kafka to Iceberg integration is even more interesting as it unifies transactional and analytical workloads.

Data Streaming also helps with hybrid integrations where data needs to be replicated from the on-premise data center into the public cloud in near real-time with consistent near real-time synchronization.

How do you integrate between Kafka and Snowflake? Do you already look at Apache Iceberg? Or maybe even another Table Format like Apache Hudi or Databricks’ Delta Lake? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Snowflake Data Integration Options for Apache Kafka (including Iceberg) appeared first on Kai Waehner.

Top 5 Data Streaming Trends for 2023

Kai Waehner — Thu, 15 Dec 2022 06:14:03 +0000

Data Streaming is one of the most relevant buzzwords in tech to build scalable real-time applications in the cloud and innovative business models. Do you wonder about my predicted TOP 5 data streaming trends in 2023 to set data in motion? Check out the following presentation and learn what role Apache Kafka plays. Learn about decentralized Data Mesh, cloud-native lakehouse, data sharing, improved user experience, and advanced data governance.

Some followers might notice that this became a series with past posts about the top 5 data streaming trends for 2021 and the top 5 for 2022. Data streaming with Apache Kafka is a journey and evolution to set data in motion. Trends change over time, but the core value of a scalable real-time infrastructure as the central data hub stays.

Gartner Top Strategic Technology Trends for 2023

The research and consulting company Gartner defines the top strategic technology trends every year. This time, the trends are more focused on particular niche concepts. On a higher level, it is all about optimizing, scaling, and pioneering. Here is what Gartner expects for 2023:

Source: Gartner

It is funny (but not surprising): Gartner’s predictions overlap and complement the five trends I focus on for data streaming with Apache Kafka looking forward to 2023. I explore how data streaming enables better time to market with decentralized optimized architectures, cloud-native infrastructure for elastic scale, and pioneering innovative use cases to build valuable data products.

Hence, here you go with the top 5 trends in data streaming for 2023.

The top 5 data streaming trends for 2023

I see the following topics coming up more regularly in conversations with customers, prospects, and the broader Kafka community across the globe:

Cloud-native lakehouses
Decentralized data mesh
Data sharing in real-time
Improved developer and user experience
Advanced data governance and policy enforcement

The following sections describe each trend in more detail. The end of the article contains the complete slide deck. The trends are relevant for various scenarios. No matter if you use open source Apache Kafka, a commercial platform, or a fully-managed cloud service like Confluent Cloud.

Kafka as data fabric for cloud-native lakehouses

Many data platform vendors pitch the lakehouse vision today. That’s the same story as the data lake in the Hadoop era with few new nuances. Put all your data into a single data store to save the world and solve every problem and use case:

In the last ten years, most enterprises realized this strategy did not work. The data lake is great for reporting and batch analytics, but not the right choice for every problem. Besides technical challenges, new challenges emerged: data governance, compliance issues, data privacy, and so on.

Applying a best-of-breed enterprise architecture for real-time and batch data analytics using the right tool for each job is a much more successful, flexible, and future-ready approach:

Data platforms like Databricks, Snowflake, Elastic, MongoDB, BigQuery, etc., have their sweet spots and trade-offs.

Data streaming increasingly becomes the real-time data fabric between all the different data platforms and other business applications leveraging the real-time Kappa architecture instead of the much more batch-focused Lamba architecture.

Decentralized data mesh with valuable data products

Focusing on business value by building data products in independent domains with various technologies is key to success in today’s agile world with ever-changing requirements and challenges. Data mesh came to the rescue and emerged as a next-generation design pattern, succeeding service-oriented architectures and microservices.

Two main proposals exist by vendors for building a data mesh: Data integration with data streaming enables fully decentralized business products. On the other side, data virtualization provides centralized queries:

Centralized queries are simple but do not provide a clean architecture and decoupled domains and applications. It might work well to solve a single problem in a project. However, I highly recommend building a decentralized data mesh with data streaming to decouple the applications, especially for strategic enterprise architectures.

Collaboration within and across organizations in real-time

Collaborating within and outside the organization with data sharing using Open APIs, streaming data exchange, and cluster linking enable many innovative business models:

The difference between data streaming to a database, data warehouse, or data lake is crucial: All these platforms enable data sharing at rest. The data is stored on a disk before it is replicated and shared within the organization or with partners. This is not real-time. You cannot connect a real-time consumer to data at rest.

However, real-time data beats slow data. Hence, sharing data in real-time with data streaming platforms like Apache Kafka or Confluent Cloud enables accurate data as soon as a change happens. A consumer can be real-time, near real-time, or batch. A streaming data exchange puts data in motion within the organization or for B2B data sharing and Open API business models.

AsyncAPI spec for Apache Kafka API schemas

AsyncAPI allows developers to define the interfaces of asynchronous APIs. It is protocol agnostic. Features include:

Specification of OpenAPI contracts (= schemas in the data streaming world)
Documentation of APIs
Code generation for many programming languages
Data governance
And much more…

Confluent Cloud recently added a feature for generating an AsyncAPI specification for Apache Kafka clusters.

We don’t know yet where the market is going. Will AsynchAPI become the standard for OpenAPI in data streaming? Maybe. I see increasing demand for this specification by customers. Let’s review the status of AsynchAPI in a few quarters or years. But it has the potential.

Improved developer experience with low-code / no-code tools for Apache Kafka

Many analysts and vendors pitch low code/no code tools. Visual coding is nothing new. Very sophisticated, powerful, and easy-to-use solutions exist as IDE or cloud applications. The significant benefit is time-to-market for developing applications and easier maintenance. At least in theory.

These tools support various personas like developers, citizen integrators, and data scientists. At least in theory.

The reality is that:

Code is king
Development is about evolution
Open platforms win

Low code/no code is great for some scenarios and personas. But it is just one option of many. Let’s look at a few alternatives for building Kafka-native applications:

These Kafka-native technologies have their trade-offs. For instance, the Confluent Stream Designer is perfect for building streaming ETL pipelines between various data sources and sinks. Just click the pipeline and transformations together. Then deploy the data pipeline into a scalable, reliable, and fully-managed streaming application. The difference to separate tools like Apache Nifi is that the generated code run in the same streaming platform, i.e., one infrastructure end-to-end. This makes ensuring SLAs and latency requirements much more manageable and the whole data pipeline more cost-efficient.

However, the simpler a tool is, the less flexible it is. It is that easy. No matter which product or vendor you look at. This is not just true for Kafka-native tools.

And you are flexible with your tool choice per project or business problem. Add your favorite non-Kafka stream processing engine to the stack, for instance, Apache Flink. Or use a separate iPaaS middleware like Dell Boomi or SnapLogic.

Domain-driven design with dumb pipes and smart endpoints

The real benefit of data streaming is the freedom of choice for your favorite Kafka-native technology, open-source stream processing framework, or cloud-native iPaaS middleware.

Choose the proper library, tool, or SaaS for your project. Data streaming enables a decoupled domain-driven design with dumb pipes and smart endpoints:

Data streaming with Apache Kafka is perfect for domain-driven design (DDD). On the contrary, often used point-to-point microservice architecture HTTP/REST web service or push-based message brokers like RabbitMQ create much stronger dependencies between applications.

Data governance across the data streaming pipeline

An enterprise architecture powered by data streaming enables easy access to data in real-time. Many enterprises leverage Apache Kafka as the central nervous system between all data sources and sinks.

The consequence of being able to access all data easily across business domains is two conflicting pressures on organizations: Unlock the data to enable innovation versus Lock up the data to keep it safe.

Achieving data governance across the end-to-end data streams with data lineage, event tracing, policy enforcement, and time travel to analyze historical events is critical for strategic data streaming in the enterprise architecture. Data governance on top of the streaming platform is required for end-to-end visibility, compliance, and security:

Policy enforcement with schemas and API contracts

The foundation for data governance is the management of API contracts (so-called schemas in data streaming platforms like Apache Kafka). Solutions like Confluent enforce schemas along the data pipeline, including data producer, server, and consumer:

Additional data governance tools like data lineage, catalog, or police enforcement are built on this foundation. The recommendation for any serious data streaming project is to use schema from the beginning. It is unnecessary for the first pipeline. But the following producers and consumers need a trusted environment with enforced policies to establish a decentralized data mesh architecture with independent but connected data products.

Slides and Video for Data Streaming Use Cases in 2023

Here is the slide deck from my presentation:

Fullscreen Mode

And here is the free on-demand video recording:

Data streaming goes up in the maturity curve in 2023

It is still an early stage for data streaming in most enterprises. But the discussion goes beyond questions like “when to use Kafka?” or “which cloud service to use?”… In 2023, most enterprises look at more sophisticated challenges around their numerous data streaming projects.

The new trends are often related to each other. A data mesh enables the building of independent data products that focus on business value. Data sharing is a fundamental requirement for a data mesh. New personas access the data stream. Often, citizen developers or data scientists need easy tools to pioneer new projects. The enterprise architecture requires and enforces data governance across the pipeline for security, compliance, and privacy reasons.

Scalability and elasticity need to be there out of the box. Fully-managed data streaming is a brilliant opportunity for getting started in 2023 and moving up in the maturity curve from single projects to a central nervous system of real-time data.

What are your most relevant and exciting trends for data streaming and Apache Kafka in 2023 to set data in motion? What are your strategy and timeline? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Top 5 Data Streaming Trends for 2023 appeared first on Kai Waehner.

When to use Request-Response with Apache Kafka?

Kai Waehner — Fri, 03 Jun 2022 07:35:00 +0000

How can I do request-response communication with Apache Kafka? That’s one of the most common questions I get regularly. This blog post explores when (not) to use this message exchange pattern, the differences between synchronous and asynchronous communication, the pros and cons compared to CQRS and event sourcing, and how to implement request-response within the data streaming infrastructure.

Message Queue Patterns in Data Streaming with Apache Kafka

Before I go into this post, I want to make you aware that this content is part of a blog series about “JMS, Message Queues, and Apache Kafka”:

10 Comparison Criteria for JMS Message Broker vs. Apache Kafka Data Streaming
Alternatives for Error Handling via a Dead Letter Queue (DLQ) in Apache Kafka
THIS POST – Implementing the Request-Reply Pattern with Apache Kafka
UPCOMING – A Decision Tree for Choosing the Right Messaging System (JMS vs. Apache Kafka)
UPCOMING – From JMS Message Broker to Apache Kafka: Integration, Migration, and/or Replacement

I will link the other posts here as soon as they are available. Please follow my newsletter to get updated in real-time abo t new posts. (no spam or ads)

What is the Request-Response (Request-Reply) Message Exchange Pattern?

Request-response (sometimes called request-reply) is one of the primary methods computers use to communicate with each other in a network.

The first application sends a request for some data. The second application responds to the request. It is a message exchange pattern in which a requestor sends a request message to a replier system, which receives and processes the request, ultimately returning a message in response.

Request-reply is inefficient and can suffer a lot of latency depending on the use case. HTTP or better gRPC is suitable for some use cases. Request-reply is “replaced” by the CQRS (Command and Query Responsibility Segregation) pattern with Kafka for streaming data. CQRS is not possible with JMS API, since JMS provides no state capabilities and lacks event sourcing capability. Let’s dig deeper into these statements.

Request-Response (HTTP) vs. Data Streaming (Kafka)

Prior to discussing synchronous and asynchronous communication, let’s explore the concepts behind request-response and data streaming. Traditionally, these are two different paradigms:

Request-response (HTTP):

Typically synchronous
Point to point
High latency (compared to data streaming)
Pre-defined API

Data streaming (Kafka):

Continuous processing
Often asynchronous
Event-driven
Low latency
General-purpose events

Most architectures need request-response for point-to-point communication (e.g., between a server and mobile app) and data streaming for continuous data processing. With this in mind, let’s look at use cases where HTTP is used with Kafka.

Synchronous vs. Asynchronous Communication

The request-response message exchange pattern is often implemented purely synchronously. However, request-response may also be implemented asynchronously, with a response being returned at some unknown later time.

Let’s look at the most prevalent examples of message exchanges: REST, message queues, and data streaming.

Synchronous Restful APIs (HTTP)

A web service is the primary technology behind synchronous communication in application development and enterprise application integration. While WSDL and SOAP were dominant many years ago, REST / HTTP is the communication standard in almost all web services today.

I won’t go into the “HTTP vs. REST” debate in this post. In short, REST (Representational state transfer) has been employed throughout the software industry and is a widely accepted set of guidelines for creating stateless, reliable web APIs. A web API that obeys the REST constraints is informally described as RESTful. RESTful web APIs are typically loosely based on HTTP methods.

Synchronous web service calls over HTTP hold a connection open and wait until the response is delivered or the timeout period expires.

The latency of HTTP web services is relatively high. It requires setting up and tearing down a TCP connection for each request-response iteration when using HTTP. To be clear: The latency is still good enough for many use cases.

Another possible drawback is that the HTTP requests might block waiting for the head of the queue request to be processed and may require HTTP circuit breakers set up on the server if there are too many outstanding HTTP requests.

Asynchronous Message Queue (IBM MQ, RabbitMQ)

The message queue paradigm is a sibling of the publisher/subscriber design pattern and is typically one part of a more extensive message-oriented middleware system. Most messaging systems support both the publisher/subscriber and message queue models in their API, e.g., Java Message Service (JMS). Read the “JMS Message Queue vs. Apache Kafka” article if you are new to this discussion.

Producers and consumers are decoupled from each other and communicate asynchronously. The message queue stores events until they are consumed successfully.

Most message queue middleware products provide built-in request-response APIs. Its communication is asynchronous. The implementation uses correlation IDs.

The request-response API (for example, in JMS) creates a temporary queue or topic that is referenced in the request to be used by the consumer by taking the reply-to endpoint from the request. The ID is used to separate the requests from the single requestor. These queues or topics are also only available while the requestor is alive with a session to the reply.

Such an implementation with a temporary queue or topic does not make sense in Kafka. I have actually seen enterprises trying to do this. Kafka does not work like that. The consequence was way too many partitions and topics in the Kafka cluster. Scalability and performance issues were the consequence.

Asynchronous Data Streaming (Apache Kafka)

Data streaming continuously processes ingested events from data sources. Such data should be processed incrementally using stream processing techniques without having access to all the data.

The asynchronous communication paradigm is like message queues. Contrary to message queues, data streaming provides long-term storage of events and replayability of historical information. The consequence is a true decoupling between producers and consumers. In most Apache Kafka deployments, several producers and consumers with very different communication paradigms and latency capabilities send and read events.

Apache Kafka does not provide request-response APIs built-in. This is not necessarily a bad thing, as some people think. Data streaming provides different design patterns. That’s the main reason for this blog post! Let’s explore trade-offs of the request-response pattern in messaging systems and understand alternative approaches that suit better into a data streaming world. But this post also explores how to implement asynchronous or synchronous request-reply with Kafka.

But keep in mind: Don’t re-use your “legacy knowledge” about HTTP and MQ and try to re-build the same patterns with Apache Kafka. Having said this, request-response is possible with Kafka, too. More on this in the following sections.

Request-Reply vs. CQRS and Event Sourcing

CQRS (Command Query Responsibility Segregation) states that every method should either be a command that performs an action or a query that returns data to the caller, but not both. Services become truly decoupled from each other.

Martin Fowler has a nice diagram for CQRS:

Event sourcing is an architectural pattern in which entities do not track their internal state using direct serialization or object-relational mapping, but by reading and committing events to an event store.

When event sourcing is combined with CQRS and domain-driven design, aggregate roots are responsible for validating and applying commands (often by having their instance methods invoked from a Command Handler) and then publishing events.

With CQRS, the state is updated against every relevant event message. Therefore, the state is always known. Querying the state that is stored in the materialized view (for example, a KTable in Kafka Streams) is efficient. With request-response, the server must calculate or determine the state of every request. With CQRS, it is calculated/updated only once regardless of the number of state queries at the time the relevant occurs.

These principles fit perfectly into the data streaming world. Apache Kafka is a distributed storage that appends incoming events to the immutable commit log. True decoupling and replayability of events are built into the Kafka infrastructure out-of-the-box. Most modern microservice architectures with domain-driven design are built with Apache Kafka, not REST or MQ.

Don’t use Request-Response in Kafka if not needed!

If you build a modern enterprise architecture and new applications, apply the natural design patterns that work best with the technology. Remember: Data streaming is a different technology than web services and message queues! CQRS with event sourcing is the best pattern for most use cases in the Kafka world:

Do not use the request-response concept with Kafka if you do not really have to! Kafka was built for streaming data and true decoupling between various producers and consumers.

This is true even for transactional workloads. A transaction does NOT need synchronous communication. The Kafka API supports mission-critical transactions (although it is asynchronous and distributed by nature). Think about making a bank payment. It is never synchronous, but a complex business process with many independent events within and across organizations.

Synchronous vs. Asynchronous Request-Response with Apache Kafka

After I explained that request-response should not be the first idea when building a new Kafka application, it does not mean it is not possible. And sometimes, it is the better, simpler, or faster approach to solve a problem. Hence, let’s look at examples of synchronous and asynchronous request-response implementations with Kafka.

The request-reply pattern can be implemented with Kafka. But differently. Trying to do it like in a JMS message broker (with temporary queues etc.) will ultimately kill the Kafka cluster (because it works differently). Nevertheless, the used concepts are the same under the hood as in the JMS API, like a correlation ID.

Asynchronous Request-Response with Apache Kafka

The Spring project and its Kafka Spring Boot Kafka Template libraries have a great example of the asynchronous request-reply pattern built with Kafka.

Check out “org.springframework.kafka.requestreply.ReplyingKafkaTemplate“. It creates request/reply applications using the Kafka API easily. The example is interesting since it implements the asynchronous request/reply, which is more complicated to write if you are using, for example, JMS API).

What’s great about Spring is the availability of easy-to-use templates and method signatures. The framework enables using design patterns without custom code to implement them. For instance, here are the two Java methods for request-reply with Kafka:

RequestReplyFuture sendAndReceive(ProducerRecord record);

RequestReplyFuture sendAndReceive(ProducerRecord record, Duration replyTimeout);

The result is a ListenableFuture that is asynchronously populated with the result (or an exception, for a timeout). The result also has a sendFuture property, which is the result of calling KafkaTemplate.send(). You can use this future to determine the result of the send operation.

Synchronous Request-Response with Apache Kafka

Another excellent DZone article talks about synchronous request/reply using Spring Kafka templates. The example shows a Kafka service to calculate the sum of two numbers with synchronous request-response behavior to return the result:

Spring automatically sets a correlation ID in the producer record. This correlation ID is returned as-is by the @SendTo annotation at the consumer end.

Check out the DZone post for the complete code example.

The Spring documentation for Kafka Templates has a lot of details and code examples about the Request/Reply pattern for Kafka. Using Spring, the request/reply pattern is pretty simple to implement with Kafka. If you are not using Spring, you can learn how to do request-reply with Kafka in your framework. That’s the beauty of open-source…

Combination of Data Streaming and REST API

The above examples showed how you can implement the request-response pattern with Apache Kafka. Nevertheless, it is still only the second-best approach and is often an anti-pattern for streaming data.

Data streaming and request-response REST APIs are often combined to get the best out of both worlds. I wrote a dedicated blog post about “Use Cases and Architectures for HTTP and REST APIs with Apache Kafka“.

Apache Kafka and API Management

A very common approach is to implement applications in real-time at scale with the Kafka ecosystem, but then put an API Management layer on top to expose the events as API to the outside world (either another internal business domain or a B2B 3rd party application).

Here is an example of connecting SAP data. SAP has tens of options for integrating with Kafka, including Kafka Connect connectors, REST / HTTP, proprietary API, or 3rd party middleware.

No matter how you get data into the streaming data hub, on the right side, the Kafka REST API is used to expose events via HTTP. An API Management solution handles the security and monetization/billing requirements on top of the Kafka interface:

Read more about this discussion in the blog post “Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies?“. It covers the relation between Apache Kafka and API Management platforms like Kong, MuleSoft Anypoint, or Google’s Apigee.

After discussing the relationship between APIs, request-response communication, and Kafka, let’s explore a significant trend in the market: Data Mesh (the buzzword) and stream exchange for real-time data sharing (the problem solver).

Data Mesh is a new architecture paradigm that gets a lot of buzz these days. No single technology is the perfect fit to build a Data Mesh. An open and scalable decentralized real-time platform like Apache Kafka is often the heart of the Data Mesh infrastructure, complemented by many other data platforms to solve business problems.

Stream-native data sharing instead of using request-response and REST APIs in the middle is the natural evolution for many use cases:

Learn more in the post “Streaming Data Exchange with Kafka and a Data Mesh in Motion“.

Use Data Streaming and Request-Response Together!

Most architectures need request-response for point-to-point communication (e.g., between a server and mobile app) and data streaming for continuous data processing.

Synchronous and asynchronous request-response communication can be implemented with Apache Kafka. However, CQRS and event sourcing is the better and more natural approach for data streaming most times. Understand the different options and their trade-offs, and use the right tool (in this case, the correct design pattern) for the job.

What is your strategy for using request-response with data streaming? How do you implement the communication in your Apache Kafka applications? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post When to use Request-Response with Apache Kafka? appeared first on Kai Waehner.

Error Handling via Dead Letter Queue in Apache Kafka

Kai Waehner — Mon, 30 May 2022 11:34:22 +0000

Recognizing and handling errors is essential for any reliable data streaming pipeline. This blog post explores best practices for implementing error handling using a Dead Letter Queue in Apache Kafka infrastructure. The options include a custom implementation, Kafka Streams, Kafka Connect, the Spring framework, and the Parallel Consumer. Real-world case studies show how Uber, CrowdStrike, Santander Bank, and Robinhood build reliable real-time error handling at an extreme scale.

Apache Kafka became the favorite integration middleware for many enterprise architectures. Even for a cloud-first strategy, enterprises leverage data streaming with Kafka as a cloud-native integration platform as a service (iPaaS).

Message Queue Patterns in Data Streaming with Apache Kafka

Before I go into this post, I want to make you aware that this content is part of a blog series about “JMS, Message Queues, and Apache Kafka”:

10 Comparison Criteria for JMS Message Broker vs. Apache Kafka Data Streaming
THIS POST – Alternatives for Error Handling via a Dead Letter Queue (DLQ) in Apache Kafka
Implementing the Request-Reply Pattern with Apache Kafka
UPCOMING – A Decision Tree for Choosing the Right Messaging System (JMS vs. Apache Kafka)
UPCOMING – From JMS Message Broker to Apache Kafka: Integration, Migration, and/or Replacement

I will link the other posts here as soon as they are available. Please follow my newsletter to get updated in real-time abo t new posts. (no spam or ads)

What is the Dead Letter Queue Integration Pattern (in Apache Kafka)?

The Dead Letter Queue (DLQ) is a service implementation within a messaging system or data streaming platform to store messages that are not processed successfully. Instead of passively dumping the message, the system moves it to a Dead Letter Queue.

The Enterprise Integration Patterns (EIP) call the design pattern Dead Letter Channel instead. We can use both as synonyms.

This article focuses on the data streaming platform Apache Kafka. The main reason for putting a message into a DLQ in Kafka is usually a bad message format or invalid/missing message content. For instance, an application error occurs if a value is expected to be an Integer, but the producer sends a String. In more dynamic environments, a “Topic does not exist” exception might be another error why the message cannot be delivered.

Therefore, as so often, don’t use the knowledge from your existing middleware experience. Message Queue middleware, such as JMS-compliant IBM MQ, TIBCO EMS, or RabbitMQ, works differently than a distributed commit log like Kafka. A DLQ in a message queue is used in message queuing systems for many other reasons that do not map one-to-one to Kafka. For instance, the message in an MQ system expires because of per-message TTL (time to live).

Hence, the main reason for putting messages into a DLQ in Kafka is a bad message format or invalid/missing message content.

Alternatives for a Dead Letter Queue in Apache Kafka

A Dead Letter Queue in Kafka is one or more Kafka topics that receive and store messages that could not be processed in another streaming pipeline because of an error. This concept allows continuing the message stream with the following incoming messages without stopping the workflow due to the error of the invalid message.

The Kafka Broker is Dumb – Smart Endpoints provide the Error Handling

The Kafka architecture does not support DLQ within the broker. Intentionally, Kafka was built on the same principles as modern microservices using the ‘dumb pipes and smart endpoints‘ principle. That’s why Kafka scales so well compared to traditional message brokers. Filtering and error handling happen in the client applications.

The true decoupling of the data streaming platform enables a much more clean domain-driven design. Each microservice or application implements its logic with its own choice of technology, communication paradigm, and error handling.

In traditional middleware and message queues, the broker provides this logic. The consequence is worse scalability and less flexibility in the domains, as only the middleware team can implement integration logic.

Custom Implementation of a Kafka Dead Letter Queue in any Programming Language

A Dead Letter Queue in Kafka is independent of the framework you use. Some components provide out-of-the-box features for error handling and Dead Letter Queues. However, it is also easy to write your Dead Letter Queue logic for Kafka applications in any programming language like Java, Go, C++, Python, etc.

The source code for a Dead Letter Queue implementation contains a try-cath block to handle expected or unexpected exceptions. The message is processed if no error occurs. Send the message to a dedicated DLQ Kafka topic if any exception occurs.

The failure cause should be added to the header of the Kafka message. The key and value should not be changed so that future re-processing and failure analysis of historical events are straightforward.

Out-of-the-box Kafka Implementations for a Dead Letter Queue

You don’t always need to implement your Dead Letter Queue. Many components and frameworks provide their DLQ implementation already.

With your own applications, you can usually control errors or fix code when there are errors. However, integration with 3rd party applications does not necessarily allow you to deal with errors that may be introduced across the integration barrier. Therefore, DLQ becomes more important and is included as part of some frameworks.

Built-in Dead Letter Queue in Kafka Connect

Kafka Connect is the integration framework of Kafka. It is included in the open-source Kafka download. No additional dependencies are needed (besides the connectors themselves that you deploy into the Connect cluster).

By default, the Kafka Connect task stops if an error occurs because of consuming an invalid message (like when the wrong JSON converter is used instead of the correct AVRO converter). Dropping invalid messages is another option. The latter tolerates errors.

The configuration of the DLQ in Kafka Connect is straightforward. Just set the values for the two configuration options ‘errors.tolerance’ and ‘errors.deadletterqueue.topic.name’ to the right values:

The blog post ‘Kafka Connect Deep Dive – Error Handling and Dead Letter Queues‘ shows a detailed hands-on code example for using DLQs.

Kafka Connect can even be used to process the error message in the DLQ. Just deploy another connector that consumes from t e DLQ topic. For instance, if your application processes Avro messages and an incoming message is in JSON format. A connector then consumes the JSON message and transforms it into an AVRO message to be re-processed successfully:

Note that Kafka Connect has no Dead Letter Queue for source connectors.

Error Handling in a Kafka Streams Application

Kafka Streams is the stream processing library of Kafka. It is comparable to other streaming frameworks, such as Apache Flink, Storm, Beam, and similar tools. However, it is Kafka-native. This means you build the complete end-to-end data streaming within a single scalable and reliable infrastructure.

If you use Java, respectively, the JVM ecosystem, to build Kafka applications, the recommendation is almost always to use Kafka Streams instead of the standard Java client for Kafka. Why?

Kafka Streams is “just” a wrapper around the regular Java producer and consumer API, plus plenty of additional features built-in.
Both are just a library (JAR file) embedded into your Java application.
Both are part of the open-source Kafka download – no additional dependencies or license changes.
Many problems are already solved out-of-the-box to build mature stream processing services (streaming functions, stateful embedded storage, sliding windows, interactive queries, error handling, and much more).

One of the built-in functions of Kafka Streams is the default deserialization exception handler. It allows you to manage record exceptions that fail to deserialize. Corrupt data, incorrect serialization logic, or unhandled record types can cause the error. The feature is not called Dead Letter Queue but solves the same problem out-of-the-box.

Error Handling with Spring Kafka and Spring Cloud Stream

The Spring framework has excellent support for Apache Kafka. It provides many templates to avoid writing boilerplate code by yourself. Spring-Kafka and Spring Cloud Stream Kafka support various retry and error handling options, including time / count-based retry, Dead Letter Queues, etc.

Although the Spring framework is pretty feature-rich, it is a bit heavy and has a learning curve. Hence, it is a great fit for a greenfield project or if you are already using Spring for your projects for other scenarios.

Plenty of great blog posts exist that show different examples and configuration options. There is also the official Spring Cloud Stream example for dead letter queues. Spring allows building logic, such as DLQ, with simple annotations. This programming approach is a beloved paradigm by some developers, while others dislike it. Just know the options and choose the right one for yourself.

Scalable Processing and Error Handling with the Parallel Consumer for Apache Kafka

In many customer conversations, it turns out that often the main reason for asking for a dead letter queue is handling failures from connecting to external web services or databases. Time-outs or the inability of Kafka to send various requests in parallel brings down some applications. There is an excellent solution to this problem:

The Parallel Consumer for Apache Kafka is an open-source project under Apache 2.0 license. It provides a parallel Apache Kafka client wrapper with client-side queueing, a simpler consumer/producer API with key concurrency, and extendable non-blocking IO processing.

This library lets you process messages in parallel via a single Kafka Consumer, meaning you can increase Kafka consumer parallelism without increasing the number of partitions in the topic you intend to process. For many use cases, this improves both throughput and latency by reducing the load on your Kafka brokers. It also opens up new use cases like extreme parallelism, external data enrichment, and queuing.

A key feature is handling/repeating web service and database calls within a single Kafka consumer application. The parallelization avoids the need for a single web request sent at a time:

The Parallel Consumer client has powerful retry logic. This includes configurable delays and dynamic er or handling. Errors can also be sent to a dead letter queue.

Consuming Messages from a Dead Letter Queue

You are not finished after sending errors to a dead letter queue! The bad messages need to be processed or at least monitored!

Dead Letter Queue is an excellent way to take data error processing out-of-band from the event processing which means the error handlers can be created or evolved separately from the event processing code.

Plenty of error-handling strategies exist for using dead letter queues. DOs and DONTs explore the best practices and lessons learned.

Error handling strategies

Several options are available for handling messages stored in a dead letter queue:

Re-process: Some messages in the DLQ need to be re-processed. However, first, the issue needs to be fixed. The solution can be an automatic script, human interaction to edit the message, or returning an error to the producer asking for re-sending the (corrected) message.
Drop the bad messages (after further analysis): Bad messages might be expected depending on your setup. However, before dropping them, a business process should examine them. For instance, a dashboard app can consume the error messages and visualize them.
Advanced analytics: Instead of processing each message in the DLQ, another option is to analyze the incoming data for real-time insights or issues. For instance, a simple ksqlDB application can apply stream processing for calculations, such as the average number of error messages per hour or any other insights that help decide on the errors in your Kafka applications.
Stop the workflow: If bad messages are rarely expected, the consequence might be stopping the overall business process. The action can either be automated or decided by a human. Of course, stopping the workflow could also be done in the Kafka application that throws the error. The DLQ externalizes the problem and decision-making if needed.
Ignore: This might sound like the worst option. Just let the dead letter queue fill up and do nothing. However, even this is fine in some use cases, like monitoring the overall behavior of the Kafka application. Keep in mind that a Kafka topic has a retention time, and messages are removed from the topic aft r that time. Just set this up the right way for you. And monitor the DLQ topic for unexpected behavior (like filling up way too quickly).

Best Practices for a Dead Letter Queue in Apache Kafka

Here are some best practices and lessons learned for error handling using a Dead Letter Queue within Kafka applications:

Define a business process for dealing with invalid messages (automated vs. human)
- Reality: Often, nobody handles DLQ messages at all
- Alternative 1: The data owners need to receive the alerts, not just the infrastructure team
- Alternative 2: An alert should notify the system of record team that the data was bad, and they will need to re-send/fix the data from the system of record.
- If nobody cares or complains, consider questioning and reviewing the need for the existence of the DLQ. Instead, these messages could also be ignored in the initial Kafka application. This saves a lot of network load, infrastructure, and money.
Build a dashboard with proper alerts and integrate the relevant teams (e.g., via email or Slack alerts)
Define the error handling priority per Kafka topic (stop vs. drop vs. re-process)
Only push non-retryable error messages to a DLQ – connection issues are the responsibility of the consumer application.
Keep the original messages and store them in the DLQ (with additional headers such as the error message, time of the error, application name where the error occurred, etc.) – this makes re-processing and troubleshooting much more accessible.
Think about how many Dead Letter Queue Kafka topics you need. There are always trade-offs. But storing all errors in a single DLQ might not make sense for further analysis and re-processing.

Remember that a DLQ kills processing in guaranteed order and makes any sort of offline processing much harder. Hence, a Kafka DLQ is not perfect for every use case.

When NOT to use a Dead Letter Queue in Kafka?

Let’s explore what kinds of messages you should NOT put into a Dead Letter Queue in Kafka:

DLQ for backpressure handling? Using the DLQ for throttling because of a peak of a high volume of messages is not a good idea. The storage behind the Kafka log handles backpressure automatically. The consumer pulls data in the way it can take it at its pace (or it is misconfigured). Scale consumers elastically if possible. A DLQ does not help, even if your storage gets full. That’s its problem, independent of whether or not to use a DLQ.
DLQ for connection failures? Putting messages into a DLQ because of failed connectivity does not help (even after several retries). The following message also can not connect to that system. You need to fix the connection issue instead. The messages can be stored in the regular topic as long as necessary (depending on the retention time).

Schema Registry for Data Governance and Error Prevention

Last but not least, let’s explore the possibility to reduce or even eliminate the need for a Dead Letter Queue in some scenarios.

The Schema Registry for Kafka is a way to ensure data cleansing to prevent errors in the payload from producers. It enforces the correct message structure in the Kafka producer:

Schema Registry is a client-side check of the schema. Some implementations like Confluent Server provide an additional schema check on the broker side to reject invalid or malicious messages that come from a producer which is not using the Schema Registry.

Case Studies for a Dead Letter Queue in Kafka

Let’s look at four case studies from Uber, CrowdStrike, Santander Bank, and Robinhood for real-world deployment of Dead Letter Queues in a Kafka infrastructure. Keep in mind that those are very mature examples. Not every project needs that much complexity.

Uber – Building Reliable Reprocessing and Dead Letter Queues

In distributed systems, retries are inevitable. From network errors to replication issues and even outages in downstream dependencies, services operating at a massive scale must be prepared to encounter, identify, and handle failure as gracefully as possible.

Given the scope and pace at which Uber operates, its systems must be fault-tolerant and uncompromising when failing intelligently. Uber leverages Apache Kafka for various use cases at an extreme scale to accomplish this.

Using these properties, the Uber Insurance Engineering team extended Kafka’s role in their existing event-driven architecture by using non-blocking request reprocessing and Dead Letter Queues to achieve decoupled, observable error handling without disrupting real-time traffic. This strategy helps their opt-in Driver Injury Protection program run reliably in over 200 cities, deducting per-mile premiums per trip for enrolled drivers.

Here is an example of Uber’s error handling. Errors trickle-down levels of retry topics until landing in the DLQ:

For more information, read Uber’s very detailed technical article: ‘Building Reliable Reprocessing and Dead Letter Queues with Apache Kafka‘.

CrowdStrike – Handling Errors for Trillions of Events

CrowdStrike is a cybersecurity technology company based in Austin, Texas. It provides cloud workload and endpoint security, threat intelligence, and cyberattack response services.

CrowdStrike’s infrastructure processes trillions of events daily with Apache Kafka. I covered related use cases for creating situational awareness and threat intelligence in real-time at any scale in my ‘Cybersecurity with Apache Kaka blog series‘.

CrowdStrike defines three best practices to implement Dead Letter Queues and error handling successfully:

Store error message in the right system: Define the infrastructure and code to capture and retrieve dead letters. CrowdStrike uses an S3 object store for their potentially vast volumes of error messages. Note that Tiered Storage for Kafka solves this problem out-of-the-box without needing another storage interface (for instance, leveraging Infinite Storage in Confluent Cloud).
Use automation: Put tooling in place to make remediation foolproof, as error handling can be very error-prone when done manually.
Document the business process and engage relevant teams: Standardize and document the process to ensure ease of use. Not all engineers will be familiar with the organization’s strategy for dealing with dead letter messages.

In a cybersecurity platform like CrowdStrike, real-time data processing at scale is crucial. This requirement is valid for error handling, too. The next cyberattack might be a malicious message that intentionally includes inappropriate or invalid content (like a JavaScript exploit). Hence, handling errors in real-time via a Dead Letter Queue is a MUST.

Santander Bank – Mailbox 2.0 for a Combination of Retry and DLQ

Santander Bank had enormous challenges with their synchronous data processing in their mailbox application to process mass volumes of data. They rearchitected their infrastructure and built a decoupled and scalable architecture called “Santander Mailbox 2.0”.

Santander’s workloads and moved to Event Sourcing powered by Apache Kafka:

A key challenge in the new asynchronous event-based architecture was error handling. Santander solved the issues using error-handling built with retry and DLQ Kafka topics:

Check out the details in the Kafka Summit talk “Reliable Event Delivery in Apache Kafka Based on Retry Policy and Dead Letter Topics” from Santander’s integration partner Consdata.

Robinhood – Postgresql Database and GUI for Error Handling

Blog post UPDATE October 2022:

Robinhood, a financial services company (famous for its trading app), presented another approach for handling errors in Kafka messages at Current 2022. Instead of using only Kafka topics for error handling, they insert failed messages in a Postgresql database. A client application including CLI fixes the issues and republishes the messages to the Kafka topic:

Real-world use cases at Robinhood include:

Accounting issue that needs manual fixes
Back office operations uploads documents, and human error results in duplicate entries
Runtime checks for users have failed after an order was placed before it was executed

Currently, the error handling application is “only” usable via the command line and is relatively inflexible. New features will improve the DQL handling in the future:

UI + Operational ease of use: Access controls around dead letter management (possible sensitive info in Kafka messages). Easier ordering requirements.
Configurable data stores: Drop-in replacements for Postgres (e.g. DynamoDB). Direct integration of DLQ Kafka topics.

Robinhood’s DLQ implementation shows that error handling is worth investing in a dedicated project in some scenarios.

Reliable and Scalable Error Handling in Apache Kafka

Error handling is crucial for building reliable data streaming pipelines and platforms. Different alternatives exist for solving this problem. The solution includes a custom implementation of a Dead Letter Queue or leveraging frameworks in use anyway, such as Kafka Streams, Kafka Connect, the Spring framework, or the Parallel Consumer for Kafka.

The case studies from Uber, CrowdStrike, Santander Bank, and Robinhood showed that error handling is not always easy to implement. It needs to be thought through from the beginning when you design a new application or architecture. Real-time data streaming with Apache Kafka is compelling but only successful if you can handle unexpected behavior. Dead Letter Queues are an excellent option for many scenarios.

Do you use the Dead Letter Queue design pattern in your Apache Kafka applications? What are the use cases and limitations? How do you implement error handling in your Kafka applications? When do you prefer a message queue instead, and why? Let’s connect on LinkedIn and discuss it! S ay informed about new blog posts by subscribing to my newsletter.

The post Error Handling via Dead Letter Queue in Apache Kafka appeared first on Kai Waehner.

Comparison: JMS Message Queue vs. Apache Kafka

Kai Waehner — Thu, 12 May 2022 05:13:19 +0000

Comparing JMS-based message queue (MQ) infrastructures and Apache Kafka-based data streaming is a widespread topic. Unfortunately, the battle is an apple-to-orange comparison that often includes misinformation and FUD from vendors. This blog post explores the differences, trade-offs, and architectures of JMS message brokers and Kafka deployments. Learn how to choose between JMS brokers like IBM MQ or RabbitMQ and open-source Kafka or serverless cloud services like Confluent Cloud.

Motivation: The battle of apples vs. oranges

I have to discuss the differences and trade-offs between JMS message brokers and Apache Kafka every week in customer meetings. What annoys me most is the common misunderstandings and (sometimes) intentional FUD in various blogs, articles, and presentations about this discussion.

I recently discussed this topic with Clement Escoffier from Red Hat in the “Coding over Cocktails” Podcast: JMS vs. Kafka: Technology Smackdown. A great conversation with more agreement than you might expect from such an episode where I picked the “Kafka proponent” while Clement took over the role of the “JMS proponent”.

These aspects motivated me to write a blog series about “JMS, Message Queues, and Apache Kafka”:

THIS POST – 10 Comparison Criteria for JMS Message Broker vs. Apache Kafka Data Streaming
Alternatives for a Dead Letter Queue (DLQ) in Apache Kafka
Implementing the Request-Reply Pattern with Apache Kafka
UPCOMING – A Decision Tree for Choosing the Right Messaging System (JMS vs. Apache Kafka)
UPCOMING – From JMS Message Broker to Apache Kafka: Integration, Migration, and/or Replacement

I will link the other posts here as soon as they are available. Please follow my newsletter to get updated in real-time about new posts. (no spam or ads)

Special thanks to my colleague and long-term messaging and data streaming expert Heinz Schaffner for technical feedback and review of this blog series. He has worked for TIBCO, Solace, and Confluent for 25 years.

10 comparison criteria: JMS vs. Apache Kafka

This blog post explores ten comparison criteria. The goal is to explain the differences between message queues and data streaming, clarify some misunderstandings about what an API or implementation is, and give some technical background to do your evaluation to find the right tool for the job.

The list of products and cloud services is long for JMS implementations and Kafka offerings. A few examples:

JMS implementations of the JMS API (open source and commercial offerings): Apache ActiveMQ, Apache Qpid (using AMQP), IBM MQ (formerly MQSeries, then WebSphere MQ), JBoss HornetQ, Oracle AQ, RabbitMQ, TIBCO EMS, Solace, etc.
Apache Kafka products, cloud services, and rewrites (beyond the valid option of using just open-source Kafka): Confluent, Cloudera, Amazon MSK, Red Hat, Redpanda, Azure Event Hubs, etc.

Here are the criteria for comparing JMS message brokers vs. Apache Kafka and its related products/cloud services:

Message broker vs. data streaming platform
API Specification vs. open-source protocol implementation
Transactional vs. analytical workloads
Push vs. pull message consumption
Simple vs. powerful and complex API
Storage for durability vs. true decoupling
Server-side data-processing vs. decoupled continuous stream processing
Complex operations vs. serverless cloud
Java/JVM vs. any programming language
Single deployment vs. multi-region (including hybrid and multi-cloud) replication

Let’s now explore the ten comparison criteria.

1. Message broker vs. data streaming platform

TL;DR: JMS message brokers provide messaging capabilities to produce and consume messages. Apache Kafka is a data streaming platform that combines messaging, storage, data integration, and stream processing capabilities.

The most important aspect first: The comparison of JMS and Apache Kafka is an apple to orange comparison for several reasons. I would even further say that not both can be fruit, as they are so different from each other.

JMS API (and implementations like IBM MQ, RabbitMQ, et al)

JMS (Java Message Service) is a Java application programming interface (API) that provides generic messaging models. The API handles the producer-consumer problem, which can facilitate the sending and receiving of messages between software systems.

Therefore, the central capability of JMS message brokers (that implement the JMS API) is to send messages from a source application to another destination in real-time. That’s it. And if that’s what you need, then JMS is the right choice for you! But keep in mind that projects must use additional tools for data integration and advanced data processing tasks.

Apache Kafka (open source and vendors like Confluent, Cloudera, Red Hat, Amazon, et al)

Apache Kafka is an open-source protocol implementation for data streaming. It includes:

Apache Kafka is the core for distributed messaging and storage. High throughput, low latency, high availability, secure.
Kafka Connect is an integration framework for connecting external sources/destinations to Kafka.
Kafka Streams is a simple Java library that enables streaming application development within the Kafka framework.

This combination of capabilities enables the building of end-to-end data pipelines and applications. That’s much more than what you can do with a message queue.

2. JMS API specification vs. Apache Kafka open-source protocol implementation

TL;DR: JMS is a specification that vendors implement and extend in their opinionated way. Apache Kafka is the open-source implementation of the underlying specified Kafka protocol.

It is crucial to clarify the terms first before you evaluate JMS and Kafka:

Standard API: Specified by industry consortiums or other industry-neutral (often global) groups or organizations specify standard APIs. Requires compliance tests for all features and complete certifications to become standard-compliant. Example: OPC-UA.
De facto standard API: Originates from an existing successful solution (an open-source framework, a commercial product, or a cloud service). Examples: Amazon S3 (proprietary from a single vendor). Apache Kafka (open source from the vibrant community).
API Specification: A specification document to define how vendors can implement a related product. There are no complete compliance tests or complete certifications for the implementation of all features. The consequence is a “standard API” but no portability between implementations. Example: JMS. Specifically for JMS, note that in order to be able to use the compliance suite for JMS, a commercial vendor has to sign up to very onerous reporting requirements towards Oracle.

The alternative kinds of standards have trade-offs. If you want to learn more, check out how Apache Kafka became the de facto standard for data streaming in the last few years.

Portability and migrations became much more relevant in hybrid and multi-cloud environments than in the past decades where you had your workloads in a single data center.

JMS is a specification for message-oriented middleware

JMS is a specification currently maintained under the Java Community Process as JSR 343. The latest (not yet released) version JMS 3.0 is under early development as part of Jakarta EE and rebranded to Jakarta Messaging API. Today, JMS 2.0 is the specification used in prevalent message broker implementations. Nobody knows where JMS 3.0 will go at all. Hence, this post focuses on the JMS 2.0 specification to solve real-world problems today.

I often use the term “JMS message broker” in the following sections as JMS (i.e., the API) does not specify or implement many features you know in your favorite JMS implementation. Usually, when people talk about JMS, they mean JMS message broker implementations, not the JMS API specification.

JMS message brokers and the JMS portability myth

The JMS specification was developed to provide a common Java library to access different messaging vendor’s brokers. It was intended to act as a wrapper to the messaging vendor’s proprietary APIs in the same way JDBC provided similar functionality for database APIs.

Unfortunately, this simple integration turned out not to be the case. The migration of the JMS code from one vendor’s broker to another is quite complex for several reasons:

Not all JMS features are mandatory (security, topic/queue labeling, clustering, routing, compression, etc.)
There is no JMS specification for transport
No specification to define how persistence is implemented
No specification to define how fault tolerance or high availability is implemented
Different interpretations of the JMS specification by different vendors result in potentially other behaviors for the same JMS functions
No specification for security
There is no specification for value-added features in the brokers (such as topic to queue bridging, inter-broker routing, access control lists, etc.)

Therefore, simple source code migration and interoperability between JMS vendors is a myth! This sounds crazy, doesn’t it?

Vendors provide a great deal of unique functionality within the broker (such as topic-to-queue mapping, broker routing, etc.) that provide architectural functionality to the application but are part of the broker functionality and not the application or part of the JMS specification.

Apache Kafka is an open-source protocol implementation for data streaming

Apache Kafka is an implementation to do reliable and scalable data streaming in real-time. The project is open-source and available under Apache 2.0 license, and is driven by a vast community.

Apache Kafka is NOT a standard like OPC-UA or a specification like JMS. However, Kafka at least provides the source code reference implementation, protocol and API definitions, etc.

Kafka established itself as the de facto standard for data streaming. Today, over 100,000 organizations use Apache Kafka. The Kafka API became the de facto standard for event-driven architectures and event streaming. Use cases across all industries and infrastructure. Including various kinds of transactional and analytic workloads. Edge, hybrid, multi-cloud. I collected a few examples across verticals that use Apache Kafka to show the prevalence across markets.

Now, hold on. I used the term Kafka API in the above section. Let’s clarify this: As discussed, Apache Kafka is an implementation of a distributed data streaming platform including the server-side and client-side and various APIs for producing and consuming events, configuration, security, operations, etc. The Kafka API is relevant, too, as Kafka rewrites like Azure Event Hubs and Redpanda use it.

Portability of Apache Kafka – yet another myth?

If you use Apache Kafka as an open-source project, this is the complete Kafka implementation. Some vendors use the full Apache Kafka implementation and build a more advanced product around it.

Here, the migration is super straightforward, as Kafka is not just a specification that each vendor implements differently. Instead, it is the same code, libraries, and packages.

For instance, I have seen several successful migrations from Cloudera to Confluent deployments or from self-managed Apache Kafka open-source infrastructure to serverless Confluent Cloud.

The Kafka API – Kafka rewrites like Azure Event Hubs, Redpanda, Apache Pulsar

With the global success of Kafka, some vendors and cloud services did not build a product on top of the Apache Kafka implementation. Instead, they made their implementation on top of the Kafka API. The underlying implementation is proprietary (like in Azure’s cloud service Event Hubs) or open-source (like Apache Pulsar’s Kafka bridge or Redpanda’s rewrite in C++).

Be careful and analyze if vendors integrate the whole Apache Kafka project or rewrote the complete API. Contrary to the battle-tested Apache Kafka project, a Kafka rewrite using the Kafka API is a completely new implementation!

Many vendors even exclude some components or APIs (like Kafka Connect for data integration or Kafka Streams for stream processing) completely or exclude critical features like exactly-once semantics or long-term storage in their support terms and conditions.

It is up to you to evaluate the different Kafka offerings and their limitations. Recently, I compared Kafka vendors such as Confluent, Cloudera, Red Hat, or Amazon MSK and related technologies like Azure Event Hubs, AWS Kinesis, Redpanda, or Apache Pulsar.

Just battle-test the requirements by yourself. If you find a Kafka-to-XYZ bridge with less than a hundred lines of code, or if you find a .exe Windows Kafka server download from a middleware vendor. Be skeptical!

All that glitters is not gold. Some frameworks or vendors sound too good to be true. Just saying you support the Kafka API, you provide a fully managed serverless Kafka offering, or you scale much better is not trustworthy if you are constantly forced to provide fear, uncertainty, and doubt (FUD) on Kafka and that you are much better. For instance, I was annoyed by Pulsar always trying to be better than Kafka by creating a lot of FUDs and myths in the open-source community. I responded in my Apache Pulsar vs. Kafka comparison two years ago. FUD is the wrong strategy for any vendor. It does not work. For that reason, Kafka’s adoption still grows like crazy while Pulsar grows much slower percentage-wise (even though the download numbers are on a much lower level anyway).

3. Transactional vs. analytical workloads

TL;DR: A JMS message broker provides transactional capabilities for low volumes of messages. Apache Kafka supports low and high volumes of messages supporting transactional and analytical workloads.

JMS – Session and two-phase commit (XA) transactions

Most JMS message brokers have good support for transactional workloads.

A transacted session supports a single series of transactions. Each transaction groups a set of produced messages and a set of consumed messages into an atomic unit of work.

Two-phase commit transactions (XA transactions) work on a limited scale. They are used to integrate with other systems like Mainframe CICS / DB2 or Oracle database. But it is hard to operate and not possible to scale beyond a few transactions per second.

It is important to note that support for XA transactions is not mandatory with the JMS 2.0 specification. This differs from the session transaction.

Kafka – Exactly-once semantics and transaction API

Kafka is a distributed, fault-tolerant system that is resilient by nature (if you deploy and operate it correctly). No downtime and no data loss can be guaranteed, like in your favorite database, mainframe, or other core platforms.

And even better: Kafka’s Transaction API, i.e., Exactly-Once Semantics (EOS), has been available since Kafka 0.11 (GA’ed many years ago). EOS makes building transactional workloads even easier as you don’t need to handle duplicates anymore.

Kafka supports atomic writes across multiple partitions through the transactions API. This allows a producer to send a batch of messages to multiple partitions. Either all messages in the batch are eventually visible to any consumer, or none are ever visible to consumers.

Kafka transactions work very differently than JMS transactions. But the goal is the same: Each consumer receives the produced event exactly once. Find more details in the blog post “Analytics vs. Transactions in Data Streaming with Apache Kafka“.

4. Push vs. pull message consumption

TL;DR: JMS message brokers push messages to consumer applications. Kafka consumers pull messages providing true decoupling and backpressure handling for independent consumer applications.

Pushing messages seems to be the obvious choice for a real-time messaging system like JMS-based message brokers. However, push-based messaging has various drawbacks regarding decoupling and scalability.

JMS expects the broker to provide back pressure and implement a “pre-fetch” capability, but this is not mandatory. If used, the broker controls the backpressure, which you cannot control.

With Kafka, the consumer controls the backpressure. Each Kafka consumer consumes events in real-time, batch, or only on demand – in the way the particular consumer supports and can handle the data stream. This is an enormous advantage for many inflexible and non-elastic environments.

So while JMS has some kind of backpressure, the producer stops if the queue is full. In Kafka, you control the backpressure on the consumer. There is no way to scale a producer with JMS (as there are no partitions in a JMS queue or topic).

JMS consumers can be scaled, but then you lose guaranteed ordering. Guaranteed ordering in JMS message brokers only works via a single producer, single consumer, and transaction.

5. Simple JMS API vs. powerful and complex Kafka API

TL;DR: The JMS API provides simple operations to produce and consume messages. Apache Kafka has a more granular API that brings additional power and complexity.

JMS vendors hide all the cool stuff in the implementation under the spec. You only get the 5% (no control, built by the vendor). You need to make the rest by yourself. On the other side, Kafka exposes everything. Most developers only need 5%.

In summary, be aware that JMS message brokers are built to send messages from a data source to one or more data sinks. Kafka is a data streaming platform that provides many more capabilities, features, event patterns, and processing options; and a much larger scale. With that in mind, it is no surprise that the APIs are very different and have different complexity.

If your use case requires just sending a few messages per second from A to B, the JMS is the right choice and simple to use! If you need a streaming data hub at any scale, including data integration and data processing, that’s only Kafka.

Asynchronous request-reply vs. data in motion

One of the most common wishes of JMS developers is to use are request-response function in Kafka. Note that this design pattern is different in messaging systems from an RPC (remote procedure call) as you know it from legacy tools like Corba or web service standards like SOAP/WSDL or HTTP. Request-reply in messaging brokers is an asynchronous communication that leverages a correlation ID.

Asynchronous messaging to get events from a producer (say a mobile app) to a consumer (say a database) is a very traditional workflow. No matter if you do fire-and-forget or request-reply. You put data at rest for further processing. JMS supports request-reply out-of-the-box. The API is very simple.

Data in motion with event streaming continuously processes data. The Kafka log is durable. The Kafka application maintains and queries the state in real-time or in batch. Data streaming is a paradigm shift for most developers and architects. The design patterns are very different. Don’t try to reimplement your JMS application within Kafka using the same pattern and API. That is likely to fail! That is an anti-pattern.

Request-reply is inefficient and can suffer a lot of latency depending on the use case. HTTP or better gRPC is suitable for some use cases. Request-reply is replaced by the CQRS (Command and Query Responsibility Segregation) pattern with Kafka for streaming data. CQRS is not possible with JMS API, since JMS provides no state capabilities and lacks event sourcing capability.

A Kafka example for the request-response pattern

CQRS is the better design pattern for many Kafka use cases. Nevertheless, the request-reply pattern can be implemented with Kafka, too. But differently. Trying to do it like in a JMS message broker (with temporary queues etc.) will ultimately kill the Kafka cluster (because it works differently).

The Spring project shows how you can do better. The Kafka Spring Boot Kafka Template libraries have a great example of the request-reply pattern built with Kafka.

Check out “org.springframework.kafka.requestreply.ReplyingKafkaTemplate“. It creates request/reply applications using the Kafka API easily. The example is interesting since it implements the asynchronous request/reply, which is more complicated to write if you are using, for example, JMS API). Another nice DZone article talks about synchronous request/reply using Spring Kafka templates.

The Spring documentation for Kafka Templates has a lot of details about the Request/Reply pattern for Kafka. So if you are using Spring, the request/reply pattern is pretty simple to implement with Kafka. If you are not using Spring, you can learn how to do request-reply with Kafka in your framework.

6. Storage for durability vs. true decoupling

TL;DR: JMS message brokers use a storage system to provide high availability. The storage system of Kafka is much more advanced to enable long-term storage, back-pressure handling and replayability of historical events.

Kafka storage is more than just the persistence feature you know from JMS

When I explain the Kafka storage system to experienced JMS developers, I almost always get the same response: “Our JMS message broker XYZ also has storage under the hood. I don’t see the benefit of using Kafka!”

JMS uses an ephemeral storage system, where messages are only persisted until they are processed. Long-term storage and replayability of messages are not a concept JMS was designed for.

The core Kafka principles of append-only logs, offsets, guaranteed ordering, retention time, compacted topics, and so on provide many additional benefits beyond the durability guarantees of a JMS. Backpressure handling, true decoupling between consumers, the replayability of historical events, and more are huge differentiators between JMS and Kafka.

Check the Kafka docs for a deep dive into the Kafka storage system. I don’t want to touch on how Tiered Storage for Kafka is changing the game even more by providing even better scalability and cost-efficient long-term storage within the Kafka log.

7. Server-side data-processing with JMS vs. decoupled continuous stream processing with Kafka

TL;DR: JMS message brokers provide simple server-side event processing, like filtering or routing based on the message content. Kafka brokers are dumb. Its data processing is executed in decoupled applications/microservices.

Server-side JMS filtering and routing

Most JMS message brokers provide some features for server-side event processing. These features are handy for some workloads!

Just be careful that server-side processing usually comes with a cost. For instance:

JMS Pre-filtering scalability issues: The broker has to handle so many things. This can kill the broker in a hidden fashion
JMS Selectors (= routing) performance issues: It kills 40-50% of performance

Again, sometimes, the drawbacks are acceptable. Then this is a great functionality.

Kafka – Dumb pipes and smart endpoints

Kafka intentionally does not provide server-side processing. The brokers are dumb. The processing happens at the smart endpoints. This is a very well-known design pattern: Dumb pipes and smart endpoints.

The drawback is that you need separate applications/microservices/data products to implement the logic. This is not a big issue in serverless environments (like using a ksqlDB process running in Confluent Cloud for data processing). It gets more complex in self-managed environments.

However, the massive benefit of this architecture is the true decoupling between applications/technologies/programming languages, separation of concerns between business units for building business logic and operations of infrastructure, and the much better scalability and elasticity.

Would I like to see a few server-side processing capabilities in Kafka, too? Yes, absolutely. Especially for small workloads, the performance and scalability impact should be acceptable! Though, the risk is that people misuse the features then. The future will show if Kafka will get there or not.

8. Complex operations vs. serverless cloud

TL;DR: Self-managed operations of scalable JMS message brokers or Kafka clusters are complex. Serverless offerings (should) take over the operations burden.

Operating a cluster is complex – no matter if JMS or Kafka

A basic JMS message broker is relatively easy to operate (including active/passive setups). However, this limits scalability and availability. The JMS API was designed to talk to a single broker or active/passive for high availability. This concept covers the application domain.

More than that (= clustering) is very complex with JMS message brokers. More advanced message broker clusters from commercial vendors are more powerful but much harder to operate.

Kafka is a powerful, distributed system. Therefore, operating a Kafka cluster is not easy by nature. Cloud-native tools like an operator for Kubernetes take over some burdens like rolling upgrades or handling fail-over.

Both JMS message brokers and Kafka clusters are the more challenging, the more scale and reliability your SLAs demand. The JMS API is not specified for a central data hub (using a cluster). Kafka is intentionally built for the strategic enterprise architecture, not just for a single business application.

Fully managed serverless cloud for the rescue

As the JMS API was designed to talk to a single broker, it is hard to build a serverless cloud offering that provides scalability. Hence, in JMS cloud services, the consumer has to set up the routing and role-based access control to the specific brokers. Such a cloud offering is not serverless but cloud-washing! But there is no other option as the JMS API is not like Kafka with one big distributed cluster.

In Kafka, the situation is different. As Kafka is a scalable distributed system, cloud providers can build cloud-native serverless offerings. Building such a fully managed infrastructure is still super hard. Hence, evaluate the product, not just the marketing slogans!

Every Kafka cloud service is marketed as “fully managed” or “serverless” but most are NOT. Instead, most vendors just provision the infrastructure and let you operate the cluster and take over the support risk. On the other side, some fully managed Kafka offerings are super limited in functionality (like allowing a very limited number of partitions).

Some cloud vendors even exclude Kafka support from their Kafka cloud offerings. Insane, but true. Check the terms and conditions as part of your evaluation.

9. Java/JVM vs. any programming language

TL;DR: JMS focuses on the Java ecosystem for JVM programming languages. Kafka is independent of programming languages.

As the name JMS (=Java Message Service) says: JMS was written only for Java officially. Some broker vendors support their own APIs and clients. These are proprietary to that vendor. Almost all severe JMS projects I have seen in the past use Java code.

Apache Kafka also only provides a Java client. But vendors and the community provide other language bindings for almost every programming language, plus a REST API for HTTP communication for producing/consuming events to/from Kafka. For instance, check out the blog post “12 Programming Languages Walk into a Kafka Cluster” to see code examples in Java, Python, Go, .NET, Ruby, node.js, Groovy, etc.

The true decoupling of the Kafka backend enables very different client applications to speak with each other, no matter what programming languages one uses. This flexibility allows for building a proper domain-driven design (DDD) with a microservices architecture leveraging Kafka as the central nervous system.

10. Single JMS deployment vs. multi-region (including hybrid and multi-cloud) Kafka replication

TL;DR: The JMS API is a client specification for communication between the application and the broker. Kafka is a distributed system that enables various architectures for hybrid and multi-cloud use cases.

JMS is a client specification, while multi-data center replication is a broker function. I won’t go deep here and put it simply: JMS message brokers are not built for replication scenarios across regions, continents, or hybrid/multi-cloud environments.

Multi-cluster and cross-data center deployments of Apache Kafka have become the norm rather than an exception. Various scenarios require multi-cluster Kafka solutions. Specific requirements and trade-offs need to be looked at.

Kafka technologies like MirrorMaker (open source) or Confluent Cluster Linking (commercial) enable use cases such as disaster recovery, aggregation for analytics, cloud migration, mission-critical stretched deployments and global Kafka deployments.

I covered hybrid cloud architectures in various other blog posts. “Low Latency Data Streaming with Apache Kafka and Cloud-Native 5G Infrastructure” is a great example.

Slide deck and video recording

I created a slide deck and video recording if you prefer learning or sharing that kind of material instead of a blog post:

Fullscreen Mode

JMS and Kafka solve distinct problems!

The ten comparison criteria show that JMS and Kafka are very different things. While both overlap (e.g., messaging, real-time, mission-critical), they use different technical capabilities, features, and architectures to support additional use cases.

In short, use a JMS broker for simple and low-volume messaging from A to B. Kafka is usually a real-time data hub between many data sources and data sinks. Many people call it the central real-time nervous system of the enterprise architecture.

The data integration and data processing capabilities of Kafka at any scale with true decoupling and event replayability are the major differences from JMS-based MQ systems.

However, especially in the serverless cloud, don’t fear Kafka being too powerful (and complex). Serverless Kafka projects often start very cheaply at a very low volume, with no operations burden. Then it can scale with your growing business without the need to re-architect the application.

Understand the technical differences between a JMS-based message broker and data streaming powered by Apache Kafka. Evaluate both options to find the right tool for the problem. Within messaging or data streaming, do further detailed evaluations. Every message broker is different even though they all are JMS compliant. In the same way, all Kafka products and cloud services are different regarding features, support, and cost.

Do you use JMS-compliant message brokers? What are the use cases and limitations? When did you or do you plan to use Apache Kafka instead? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Comparison: JMS Message Queue vs. Apache Kafka appeared first on Kai Waehner.

Disaster Recovery with Kafka across the Edge and Hybrid Cloud (QCon Talk)

Kai Waehner — Wed, 06 Apr 2022 11:15:46 +0000

I spoke at QCon London in April 2022 about building disaster recovery and resilient real-time enterprise architectures with Apache Kafka. This blog post summarizes the use cases, architectures, and real-world examples. The slide deck and video recording of the presentation is included as well.

What is QCon?

QCon is a leading software development conference held across the globe for 16 years. It provides a realistic look at what is trending in tech. The QCon events are organized by InfoQ, a well-known website for professional software development with over two million unique visitors per month.

QCon in 2022 uncovers emerging software trends and practices. Developers and architects learn how to solve complex engineering challenges without the product pitches.

There is no Call for Papers (CfP) for QCon. The organizers invite trusted speakers to talk about trends, best practices, and real-world stories. This makes QCon so strong and respected in the software development community.

Disaster Recovery and Resiliency with Apache Kafka

Apache Kafka is the de facto data streaming platform for analytical AND transactional workloads. Multiple options exist to design Kafka for resilient applications. For instance, MirrorMaker 2 and Confluent Replicator enable uni- or bi-directional real-time replication between independent Kafka clusters in different data centers or clouds.

Cluster Linking is a more advanced and straightforward option from Confluent leveraging the native Kafka protocol instead of additional infrastructure and complexity using Kafka Connect (like MirrorMaker 2 and Replicator).

Stretching a single Kafka cluster across multiple regions is the best option to guarantee no downtime and seamless failover in the case of a disaster. However, it is hard to operate and only recommended (i.e., consistent, stable, and mission-critical) across distances with enhanced add-ons to open-source Kafka:

QCon Presentation: Disaster Recovery with Apache Kafka

In my QCon talk, I intentionally showed the broad spectrum of real-world success stories across industries for data streaming with Apache Kafka from companies such as BMW, JPMorgan Chase, Robinhood, Royal Caribbean, and Devon Energy.

Best practices explored how to build resilient enterprises architecture with disaster recovery with RPO (Recovery Point Object) and RTO (Recovery Time Objective) in mind. The audience learns how to get your SLAs and requirements for downtime and data loss right.

The examples looked at serverless cloud offerings integrating to the IoT edge, hybrid retail architectures, and the disconnected edge in military scenarios.

The agenda looks like this:

Resilient enterprise architectures
Real-time data streaming with the Apache Kafka ecosystem
Cloud-first and serverless Industrial IoT in automotive
Multi-region infrastructure for core banking
Hybrid cloud for customer experiences in retail
Disconnected edge for safety and security in the public sector

Slide Deck from QCon Talk:

Here is the slide deck of my presentation from QCon London 2022:

We also had a great panel that discussed lessons learned from building resilient applications on the code and infrastructure level, plus the organizational challenges and best practices:

Video Recording from QCon Talk:

With the risk of Covid in mind, InfoQ decided not to record QCon sessions live.

Instead, a pre-recorded video had to be submitted by the speakers. The video recording is already available for QCon attendees (no matter if on-site in London or at the QCon Plus virtual event):

Qcon makes conference talks available for free sometime after the event. I will update this post with the free link as soon as it is available.

Disaster Recovery with Apache Kafka across all Industries

I hope you enjoyed the slides and video on this exciting topic. Hybrid and global Kafka infrastructures for disaster recovery and other use cases are the norm, not exceptions.

Real-time data beats slow data. That is true in almost every use case. Hence, data streaming with the de facto standard Apache Kafka gets adopted more and more across all industries.

How do you leverage data streaming with Apache Kafka for building resilient applications and enterprise architectures? What architecture does your platform use? Which products do you combine with data streaming? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Disaster Recovery with Kafka across the Edge and Hybrid Cloud (QCon Talk) appeared first on Kai Waehner.

Streaming ETL with Apache Kafka in the Healthcare Industry

Kai Waehner — Fri, 01 Apr 2022 05:47:00 +0000

IT modernization and innovative new technologies change the healthcare industry significantly. This blog series explores how data streaming with Apache Kafka enables real-time data processing and business process automation. Real-world examples show how traditional enterprises and startups increase efficiency, reduce cost, and improve the human experience across the healthcare value chain, including pharma, insurance, providers, retail, and manufacturing. This is part three: Streaming ETL. Examples include Babylon Health and Bayer.

Blog Series – Kafka in Healthcare

Many healthcare companies leverage Kafka today. Use cases exist in every domain across the healthcare value chain. Most companies deploy data streaming in different business domains. Use cases often overlap. I tried to categorize a few real-world deployments into different technical scenarios and added a few real-world examples:

Overview – Data Streaming Use Cases and Architectures for Healthcare (including Slide Deck)
Legacy Modernization and Hybrid Cloud (Optum / UnitedHealth Group, Centene, Bayer)
THIS POST: Streaming ETL (Bayer, Babylon Health)
Real-time Analytics (Cerner, Celmatix, CDC/Centers for Disease Control and Prevention)
Machine Learning and Data Science (Recursion, Humana)
Open API and Omnichannel (Care.com, Invitae)

Stay tuned for a dedicated blog post for each of these topics as part of this blog series. I will link the blogs here as soon as they are available (in the next few weeks). Subscribe to my newsletter to get an email after each publication (no spam or ads).

Streaming ETL with Apache Kafka

Streaming ETL is similar to concepts you might know from traditional ETL tools. I have already explored how data streaming with Kafka differs from data integration tools and iPaaS cloud services. The critical difference is that you leverage a single platform for data integration and processing at scale in real-time. There is no need to combine several platforms to achieve this. The result is a Kappa architecture that enables real-time but also batch workloads with a single integration architecture.

Streaming ETL with Kafka combines different components and features:

Kafka Connect as Kafka-native integration framework
Kafka Connect source and sink connectors to consume and produce data from/to any other database, application, or API
Single Message Transform (SMT) – an optional Kafka Connect feature – to process (filter, change, remove, etc.) incoming or outgoing messages within the connector deployment
Kafka Streams or ksqlDB for continuous data processing in real-time at scale for stateless or stateful ETL jobs
Data governance via schema management, enforcement and versioning using the Schema Registry
Security and access control using features like role-based access control, audit logs, and end-to-end encryption

In the cloud, you can leverage a serverless Kafka offering for the whole Streaming ETL pipeline. Confluent Cloud fully manages Kafka’s end-to-end infrastructure, including connectors, ksqlDB workloads, data governance, and security.

One last general note: Don’t Design for Data at Rest to Reverse it! Learn more here: “When to Use Reverse ETL and when it is an Anti-Pattern“. Instead, use real-time Streaming ETL for Data in Motion and the Kappa architecture from scratch.

Let’s look at a few real-world deployments in the healthcare sector.

Babylon Health – PII and GDRP compliant Security

Babylon Health is a digital-first health service provider and value-based care company that combines an artificial intelligence-powered platform with virtual clinical operations for patients. Patients are connected with health care professionals through its web and mobile application.

Babylon’s mission is to put an accessible and affordable health service in the hands of every person on earth. For that mission, Babylon built an agile microservice architecture with the Kafka ecosystem:

Here are the “wonders of working” in Healthcare for Babylon (= reasons to choose Kafka):

Real-time data processing
Replayability of historical information
Order matters and is ensured with guaranteed ordering
GDPR and data ownership for PII compliant security
Data governance via the schema registry to provide true decoupling and access via many programming languages like Java, Python, and Ruby

Bayer – Data Integration and Processing in R&D

Bayer AG is a German multinational pharmaceutical and life sciences company and one of the largest pharmaceutical companies in the world. They leverage Kafka in various use cases and business domains.

The following scenario is from the research and development department of the pharma business unit. Their focus areas are cardiovascular diseases, oncology, and women’s health. The division employs over 7,500 R&D people and expenses over 2.75 billion euros for R&D.

The use case Bayer presented at a recent Kafka Summit is about analyzing clinical trials, patents, reports, news, and literature leveraging the Kafka ecosystem. The R&D team processes 250 Million documents from 30+ individual data sources. The data includes 7 TB of raw text-rich data with daily updates, additions, and deletions. Algorithms and data evolve. Bayer needs to completely reprocess the data regularly. Various document streams with different formats and schemas flow through several text processing and enrichment steps.

Scalable, reliable Kafka pipelines with Kafka Streams (Java) and Faust (Python) replaced custom, error-prone, non-scalable scripts. Schemas are used as the data interface to ensure data governance. Avro is the first-class citizen data format to enable compression and better throughput.

The true decoupling of Kafka in conjunction with the Schema Registry guarantees interoperability among different components and technologies (java, python, commercial tools, open-source, scientific, proprietary).

Streaming ETL with Kafka for Real-Time Data Integration at any Scale

Think about IoT sensor analytics, cybersecurity, patient communication, insurance, research, and many other domains. Real-time data beats slow data in the healthcare supply chain almost everywhere.

This blog post explored the capabilities of the Apache Kafka Ecosystem for Streaming ETL. Real-world deployments from Babylon Health and Bayer showed how enterprises successfully deploy Kafka for different enterprise architecture use cases.

How do you leverage data streaming with Apache Kafka in the healthcare industry? What architecture does your platform use? Which products do you combine with data streaming? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Streaming ETL with Apache Kafka in the Healthcare Industry appeared first on Kai Waehner.

middleware Archives - Kai Waehner

Shift Left Architecture for AI and Analytics with Confluent and Databricks

About the Confluent and Databricks Blog Series

Medallion Architecture: Structured, Proven, but Not Always Optimal

Challenges of the Medallion Architecture

Shift Left Architecture: Process Earlier, Share Faster

How Confluent Enables Shift Left with Databricks

Data Quality Governance via Data Contracts and Schema Validation

Apache Flink for Continuous Stream Processing

Combining Shift Left with Medallion Architecture

Shift Left with Delta Lake, Iceberg, and Tableflow

AI Use Cases for Shift Left with Confluent and Databricks

Data Warehouse Use Cases for Shift Left with Confluent and Databricks

Architecture Benefits Beyond Technology

Bringing in New Types of Data

Shift Left Using ONLY Databricks

Meesho: Scaling a Real-Time Commerce Platform with Confluent and Databricks

Shift Left: Reducing Complexity, Increasing Value for the Lakehouse (and other Operational Systems)

Apache Flink: Overkill for Simple, Stateless Stream Processing and ETL?

The Nature of Stateless Stream Processing

When Flink Feels Like Overkill

The Cloud Advantage: Serverless Flink for Everyone

1. Serverless Architecture

2. Multi-Tenancy

3. Consumption-Based Pricing

4. Bridging the Gap with No-Code and Low-Code Solutions

Fully Managed (SaaS) vs. Partially Managed (PaaS) Cloud Products

Stateless vs. Stateful Stream Processing: Blurring the Lines

When Does Apache Flink Shine?

Apache Flink for Stateless Stream Processing instead of Kafka Streams or Kafka Connect’s Single Message Transform (SMT)

The Shift Left Architecture

Making the Right Choice for Apache Flink or Not

Apache Flink’s Role in Your Data Streaming Strategy

Virgin Australia’s Journey with Apache Kafka: Driving Innovation in the Airline Industry

Data Streaming with Kafka and Flink in the Aviation and Airline Industry

IT Modernization, Cloud-native Middleware and Analytics with Apache Kafka at Lufthansa

Business Value of Data Streaming at Amsterdam Airport Schiphol Group)

Virgin Australia: Business Overview and IT Strategy

Success Story 1: Real-Time Flight Schedule Updates with the Flight State Engine (FSE)

The Solution

Key Outcomes

Success Story 2: Transforming the Virgin Business Rewards Program

The Solution

Key Outcomes

IT Modernization with Data Streaming using Apache Kafka: A Blueprint for Innovation in the Airline Industry

Snowflake Data Integration Options for Apache Kafka (including Iceberg)

Blog Series: Snowflake and Apache Kafka

Data Ingestion from Apache Kafka into Snowflake (Batch vs. Streaming)

Snowflake = SaaS => Integration Layer Should Be SaaS!

Example: Data Ingestion with Confluent Cloud and Snowpipe Streaming

Processing Large Files in Kafka before Snowflake Ingestion?

Bi-Directional Integration between Apache Kafka and Snowflake with Apache Iceberg

Snowflake Support for Apache Iceberg

Kafka and Iceberg to Unify Transactional and Analytical Workloads

Confluent Tableflow: Fully Managed Kafka-Iceberg Integration

Hybrid Architectures with Kafka On-Premise and in the Public Cloud for Snowflake Integration

There is no Silver Bullet for Kafka to Snowflake Integration!

Top 5 Data Streaming Trends for 2023

Gartner Top Strategic Technology Trends for 2023

The top 5 data streaming trends for 2023

Kafka as data fabric for cloud-native lakehouses

Decentralized data mesh with valuable data products

Collaboration within and across organizations in real-time

AsyncAPI spec for Apache Kafka API schemas

Improved developer experience with low-code / no-code tools for Apache Kafka

Domain-driven design with dumb pipes and smart endpoints

Data governance across the data streaming pipeline

Policy enforcement with schemas and API contracts

Slides and Video for Data Streaming Use Cases in 2023

Data streaming goes up in the maturity curve in 2023

When to use Request-Response with Apache Kafka?

Message Queue Patterns in Data Streaming with Apache Kafka

What is the Request-Response (Request-Reply) Message Exchange Pattern?

Request-Response (HTTP) vs. Data Streaming (Kafka)

Synchronous vs. Asynchronous Communication

Synchronous Restful APIs (HTTP)

Asynchronous Message Queue (IBM MQ, RabbitMQ)

Asynchronous Data Streaming (Apache Kafka)

Request-Reply vs. CQRS and Event Sourcing

Don’t use Request-Response in Kafka if not needed!