Design Pattern Archives - Kai Waehner

Replacing Legacy Systems, One Step at a Time with Data Streaming: The Strangler Fig Approach

Kai Waehner — Thu, 27 Mar 2025 06:37:24 +0000

Organizations looking to modernize legacy applications often face a high-stakes dilemma: Do they attempt a complete rewrite or find a more gradual, low-risk approach? Enter the Strangler Fig Pattern, a method that systematically replaces legacy components while keeping the existing system running. Unlike the “Big Bang” approach, where companies try to rewrite everything at once, the Strangler Fig Pattern ensures smooth transitions, minimizes disruptions, and allows businesses to modernize at their own pace. Data streaming transforms the Strangler Fig Pattern into a more powerful, scalable, and truly decoupled approach. Let’s explore why this approach is superior to traditional migration strategies and how real-world enterprises like Allianz are leveraging it successfully.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming architectures and use cases, including various use cases from the retail industry.

What is the Strangler Fig Design Pattern?

The Strangler Fig Pattern is a gradual modernization approach that allows organizations to replace legacy systems incrementally. The pattern was coined and popularized by Martin Fowler to avoid risky “big bang” system rewrites.

Inspired by the way strangler fig trees grow around and eventually replace their host, this pattern surrounds the old system with new services until the legacy components are no longer needed. By decoupling functionalities and migrating them piece by piece, businesses can minimize disruptions, reduce risk, and ensure a seamless transition to modern architectures.

When combined with data streaming, this approach enables real-time synchronization between old and new systems, making the migration even smoother.

Why Strangler Fig is Better than a Big Bang Migration or Rewrite

Many organizations have learned the hard way that rewriting an entire system in one go is risky. A Big Bang migration or rewrite often:

Takes years to complete, leading to stale requirements by the time it’s done
Disrupts business operations, frustrating customers and teams
Requires a high upfront investment with unclear ROI
Involves hidden dependencies, making the transition unpredictable

The Strangler Fig Pattern takes a more incremental approach:

It allows gradual replacement of legacy components one service at a time
Reduces business risk by keeping critical systems operational during migration
Enables continuous feedback loops, ensuring early wins
Keeps costs under control, as teams modernize based on priorities

Here is an example from the Industrial IoT space for the strangler fig pattern leveraging data streaming to modernizing OT Middleware:

If you come from the traditional IT world (banking, retail, etc.) and don’t care about IoT, then you can explore my article about mainframe integration, offloading and replacement with data streaming.

Instead of replacing everything at once, this method surrounds the old system with new components until the legacy system is fully replaced—just like a strangler fig tree growing around its host.

Better Than Reverse ETL: A Migration with Data Consistency across Operational and Analytical Applications

Some companies attempt to work around legacy constraints using Reverse ETL—extracting data from analytical systems and pushing it back into modern operational applications. On paper, this looks like a clever workaround. In reality, Reverse ETL is a fragile, high-maintenance anti-pattern that introduces more complexity than it solves.

Reverse ETL carries several critical flaws:

Batch-based by nature: Data remains at rest in analytical lakehouses like Snowflake, Databricks, Google BigQuery, Microsoft Fabric or Amazon Redshift. It is then periodically moved—usually every few hours or once a day—back into operational systems. This results in outdated and stale data, which is dangerous for real-time business processes.
Tightly coupled to legacy systems: Reverse ETL pipelines still depend on the availability and structure of the original legacy systems. A schema change, performance issue, or outage upstream can break downstream workflows—just like with traditional ETL.
Slow and inefficient: It introduces latency at multiple stages, limiting the ability to react to real-world events at the right moment. Decision-making, personalization, fraud detection, and automation all suffer.
Not cost-efficient: Reverse ETL tools often introduce double processing costs—you pay to compute and store the data in the warehouse, then again to extract, transform, and sync it back into operational systems. This increases both financial overhead and operational burden, especially as data volumes scale.

In short, Reverse ETL is a short-term fix for a long-term challenge. It’s a temporary bridge over the widening gap between real-time operational needs and legacy infrastructure.

Why Data Streaming with Apache Kafka and Flink is an Excellent Fit for the Strangler Fig Pattern

Many modernization efforts fail because they tightly couple old and new systems, making transitions painful. Data streaming with Apache Kafka and Flink changes the game by enabling real-time, event-driven communication.

1. True Decoupling of Old and New Systems

Data streaming using Apache Kafka with its event-driven streaming and persistence layer enables organizations to:

Decouple producers (legacy apps) from consumers (modern apps)
Process real-time and historical data without direct database dependencies
Allow new applications to consume events at their own pace

This avoids dependency on old databases and enables teams to move forward without breaking existing workflows.

2. Real-Time Replication with Persistence

Unlike traditional migration methods, data streaming supports:

Real-time synchronization of data between legacy and modern systems
Durable persistence to handle retries, reprocessing, and recovery
Scalability across multiple environments (on-prem, cloud, hybrid)

This ensures data consistency without downtime, making migrations smooth and reliable.

3. Supports Any Technology and Communication Paradigm

Data streaming’s power lies in its event-driven architecture, which supports any integration style—without compromising scalability or real-time capabilities.

The data product approach using Kafka Topics with data contracts handles:

Real-time messaging for low-latency communication and operational systems
Batch processing for analytics, reporting and AI/ML model training
Request-response for APIs and point-to-point integration with external systems
Hybrid integration—syncing legacy databases with cloud apps, uni- or bi-directionally

This flexibility lets organizations modernize at their own pace, using the right communication pattern for each use case while unifying operational and analytical workloads on a single platform.

4. No Time Pressure – Migrate at Your Own Pace

One of the biggest advantages of the Strangler Fig Pattern with data streaming is flexibility in timing.

No need for overnight cutovers—migrate module by module
Adjust pace based on business needs—modernization can align with other priorities
Handle delays without data loss—thanks to durable event storage

This ensures that companies don’t rush into risky migrations but instead execute transitions with confidence.

5. Intelligent Processing in the Data Migration Pipeline

In a Strangler Fig Pattern, it’s not enough to just move data from old to new systems—you also need to transform, validate, and enrich it in motion.

While Apache Kafka provides the backbone for real-time event streaming and durable storage, Apache Flink adds powerful stream processing capabilities that elevate the modernization journey.

With Apache Flink, organizations can:

Real-time preprocessing: Clean, filter, and enrich legacy data before it’s consumed by modern systems.
Data product migration: Transform old formats into modern, domain-driven event models.
Improved data quality: Validate, deduplicate, and standardize data in motion.
Reusable business logic: Centralize processing logic across services without embedding it in application code.
Unified streaming and batch: Support hybrid workloads through one engine.

Apache Flink allows you to roll out trusted, enriched, and well-governed data products—gradually, without disruption. Together with Kafka, it provides the real-time foundation for a smooth, scalable transition from legacy to modern.

Allianz’s Digital Transformation and Transition to Hybrid Cloud: An IT Modernization Success Story

Allianz, one of the world’s largest insurers, set out to modernize its core insurance systems while maintaining business continuity. A full rewrite was too risky, given the critical nature of insurance claims and regulatory requirements. Instead, Allianz embraced the Strangler Fig Pattern with data streaming.

This approach allows an incremental, low-risk transition. By implementing data streaming with Kafka as an event backbone, Allianz could gradually migrate from legacy mainframes to a modern, scalable cloud architecture. This ensured that new microservices could process real-time claims data, improving speed and efficiency without disrupting existing operations.

Source: Allianz at Data in Motion Tour Frankfurt 2022

To achieve this IT modernization, Allianz leveraged automated microservices, real-time analytics, and event-driven processing to enhance key operations, such as pricing, fraud detection, and customer interactions.

A crucial component of this shift was the Core Insurance Service Layer (CISL), which enabled decoupling applications via a data streaming platform to ensure seamless integration across various backend systems.

With the goal of migrating over 75% of applications to the cloud, Allianz significantly increases agility, reduced operational complexity, and minimized technical debt, positioning itself for long-term digital success by transitioning many applications to the cloud, as you can read (in German) in the CIO.de article:

“As one of the largest insurers in the world, Allianz plans to migrate over 75 percent of its applications to the cloud and modernize its core insurance system.”

Beyond technology, Allianz recognized that successful modernization also required cultural transformation. To drive internal adoption of event-driven architectures, the company launched Allianz Eventing Day—an initiative to educate teams on the benefits of Kafka-based streaming.

Hosted in partnership with Confluent, the event brought together Allianz experts and industry leaders to discuss real-world implementations, best practices, and future innovations. This collaborative environment reinforced Allianz’s commitment to continuous learning and agile development, ensuring that data streaming became a core pillar of its IT strategy.

Allianz also extended this engagement to the broader insurance industry, organizing the first Insurance Roundtable on Event-Driven Architectures with top insurers from across Germany and Switzerland. Experts from Allianz, Generali, HDI, Swiss Re, and others exchanged insights on topics like real-time data analytics, API decoupling, and event discovery. The discussions highlighted how streaming technologies drive competitive advantage, allowing insurers to react faster to customer needs and continuously improve their services. By embracing domain-driven design (DDD) and event storming, Allianz ensured that event-driven architectures were not just a technical upgrade, but a fundamental shift in how insurance operates in the digital age.

The Future of IT Modernization and Legacy Migrations with Strangler Fig using Data Streaming

The Strangler Fig Pattern is the most pragmatic approach to modernizing enterprise systems, and data streaming makes it even more powerful.

By decoupling old and new systems, enabling real-time synchronization, and supporting multiple architectures, data streaming with Apache Kafka and Flink provides the flexibility, reliability, and scalability needed for successful migrations and long-term integrations between legacy and cloud-native applications.

As enterprises continue to strengthen, this approach ensures that modernization doesn’t become a bottleneck, but a business enabler. If your organization is considering legacy modernization, data streaming is the key to making the transition seamless and risk-free.

Are you ready to transform your legacy systems without the risks of a big bang rewrite? What’s one part of your legacy system you’d “strangle” first—and why? If you could modernize just one workflow with real-time data tomorrow, what would it be?

Subscribe to my newsletter for insights into data streaming and connect with me on LinkedIn to continue the conversation. And download my free book about data streaming architectures, use cases and success stories in the retail industry.

The post Replacing Legacy Systems, One Step at a Time with Data Streaming: The Strangler Fig Approach appeared first on Kai Waehner.

When to use Request-Response with Apache Kafka?

Kai Waehner — Fri, 03 Jun 2022 07:35:00 +0000

How can I do request-response communication with Apache Kafka? That’s one of the most common questions I get regularly. This blog post explores when (not) to use this message exchange pattern, the differences between synchronous and asynchronous communication, the pros and cons compared to CQRS and event sourcing, and how to implement request-response within the data streaming infrastructure.

Message Queue Patterns in Data Streaming with Apache Kafka

Before I go into this post, I want to make you aware that this content is part of a blog series about “JMS, Message Queues, and Apache Kafka”:

10 Comparison Criteria for JMS Message Broker vs. Apache Kafka Data Streaming
Alternatives for Error Handling via a Dead Letter Queue (DLQ) in Apache Kafka
THIS POST – Implementing the Request-Reply Pattern with Apache Kafka
UPCOMING – A Decision Tree for Choosing the Right Messaging System (JMS vs. Apache Kafka)
UPCOMING – From JMS Message Broker to Apache Kafka: Integration, Migration, and/or Replacement

I will link the other posts here as soon as they are available. Please follow my newsletter to get updated in real-time abo t new posts. (no spam or ads)

What is the Request-Response (Request-Reply) Message Exchange Pattern?

Request-response (sometimes called request-reply) is one of the primary methods computers use to communicate with each other in a network.

The first application sends a request for some data. The second application responds to the request. It is a message exchange pattern in which a requestor sends a request message to a replier system, which receives and processes the request, ultimately returning a message in response.

Request-reply is inefficient and can suffer a lot of latency depending on the use case. HTTP or better gRPC is suitable for some use cases. Request-reply is “replaced” by the CQRS (Command and Query Responsibility Segregation) pattern with Kafka for streaming data. CQRS is not possible with JMS API, since JMS provides no state capabilities and lacks event sourcing capability. Let’s dig deeper into these statements.

Request-Response (HTTP) vs. Data Streaming (Kafka)

Prior to discussing synchronous and asynchronous communication, let’s explore the concepts behind request-response and data streaming. Traditionally, these are two different paradigms:

Request-response (HTTP):

Typically synchronous
Point to point
High latency (compared to data streaming)
Pre-defined API

Data streaming (Kafka):

Continuous processing
Often asynchronous
Event-driven
Low latency
General-purpose events

Most architectures need request-response for point-to-point communication (e.g., between a server and mobile app) and data streaming for continuous data processing. With this in mind, let’s look at use cases where HTTP is used with Kafka.

Synchronous vs. Asynchronous Communication

The request-response message exchange pattern is often implemented purely synchronously. However, request-response may also be implemented asynchronously, with a response being returned at some unknown later time.

Let’s look at the most prevalent examples of message exchanges: REST, message queues, and data streaming.

Synchronous Restful APIs (HTTP)

A web service is the primary technology behind synchronous communication in application development and enterprise application integration. While WSDL and SOAP were dominant many years ago, REST / HTTP is the communication standard in almost all web services today.

I won’t go into the “HTTP vs. REST” debate in this post. In short, REST (Representational state transfer) has been employed throughout the software industry and is a widely accepted set of guidelines for creating stateless, reliable web APIs. A web API that obeys the REST constraints is informally described as RESTful. RESTful web APIs are typically loosely based on HTTP methods.

Synchronous web service calls over HTTP hold a connection open and wait until the response is delivered or the timeout period expires.

The latency of HTTP web services is relatively high. It requires setting up and tearing down a TCP connection for each request-response iteration when using HTTP. To be clear: The latency is still good enough for many use cases.

Another possible drawback is that the HTTP requests might block waiting for the head of the queue request to be processed and may require HTTP circuit breakers set up on the server if there are too many outstanding HTTP requests.

Asynchronous Message Queue (IBM MQ, RabbitMQ)

The message queue paradigm is a sibling of the publisher/subscriber design pattern and is typically one part of a more extensive message-oriented middleware system. Most messaging systems support both the publisher/subscriber and message queue models in their API, e.g., Java Message Service (JMS). Read the “JMS Message Queue vs. Apache Kafka” article if you are new to this discussion.

Producers and consumers are decoupled from each other and communicate asynchronously. The message queue stores events until they are consumed successfully.

Most message queue middleware products provide built-in request-response APIs. Its communication is asynchronous. The implementation uses correlation IDs.

The request-response API (for example, in JMS) creates a temporary queue or topic that is referenced in the request to be used by the consumer by taking the reply-to endpoint from the request. The ID is used to separate the requests from the single requestor. These queues or topics are also only available while the requestor is alive with a session to the reply.

Such an implementation with a temporary queue or topic does not make sense in Kafka. I have actually seen enterprises trying to do this. Kafka does not work like that. The consequence was way too many partitions and topics in the Kafka cluster. Scalability and performance issues were the consequence.

Asynchronous Data Streaming (Apache Kafka)

Data streaming continuously processes ingested events from data sources. Such data should be processed incrementally using stream processing techniques without having access to all the data.

The asynchronous communication paradigm is like message queues. Contrary to message queues, data streaming provides long-term storage of events and replayability of historical information. The consequence is a true decoupling between producers and consumers. In most Apache Kafka deployments, several producers and consumers with very different communication paradigms and latency capabilities send and read events.

Apache Kafka does not provide request-response APIs built-in. This is not necessarily a bad thing, as some people think. Data streaming provides different design patterns. That’s the main reason for this blog post! Let’s explore trade-offs of the request-response pattern in messaging systems and understand alternative approaches that suit better into a data streaming world. But this post also explores how to implement asynchronous or synchronous request-reply with Kafka.

But keep in mind: Don’t re-use your “legacy knowledge” about HTTP and MQ and try to re-build the same patterns with Apache Kafka. Having said this, request-response is possible with Kafka, too. More on this in the following sections.

Request-Reply vs. CQRS and Event Sourcing

CQRS (Command Query Responsibility Segregation) states that every method should either be a command that performs an action or a query that returns data to the caller, but not both. Services become truly decoupled from each other.

Martin Fowler has a nice diagram for CQRS:

Event sourcing is an architectural pattern in which entities do not track their internal state using direct serialization or object-relational mapping, but by reading and committing events to an event store.

When event sourcing is combined with CQRS and domain-driven design, aggregate roots are responsible for validating and applying commands (often by having their instance methods invoked from a Command Handler) and then publishing events.

With CQRS, the state is updated against every relevant event message. Therefore, the state is always known. Querying the state that is stored in the materialized view (for example, a KTable in Kafka Streams) is efficient. With request-response, the server must calculate or determine the state of every request. With CQRS, it is calculated/updated only once regardless of the number of state queries at the time the relevant occurs.

These principles fit perfectly into the data streaming world. Apache Kafka is a distributed storage that appends incoming events to the immutable commit log. True decoupling and replayability of events are built into the Kafka infrastructure out-of-the-box. Most modern microservice architectures with domain-driven design are built with Apache Kafka, not REST or MQ.

Don’t use Request-Response in Kafka if not needed!

If you build a modern enterprise architecture and new applications, apply the natural design patterns that work best with the technology. Remember: Data streaming is a different technology than web services and message queues! CQRS with event sourcing is the best pattern for most use cases in the Kafka world:

Do not use the request-response concept with Kafka if you do not really have to! Kafka was built for streaming data and true decoupling between various producers and consumers.

This is true even for transactional workloads. A transaction does NOT need synchronous communication. The Kafka API supports mission-critical transactions (although it is asynchronous and distributed by nature). Think about making a bank payment. It is never synchronous, but a complex business process with many independent events within and across organizations.

Synchronous vs. Asynchronous Request-Response with Apache Kafka

After I explained that request-response should not be the first idea when building a new Kafka application, it does not mean it is not possible. And sometimes, it is the better, simpler, or faster approach to solve a problem. Hence, let’s look at examples of synchronous and asynchronous request-response implementations with Kafka.

The request-reply pattern can be implemented with Kafka. But differently. Trying to do it like in a JMS message broker (with temporary queues etc.) will ultimately kill the Kafka cluster (because it works differently). Nevertheless, the used concepts are the same under the hood as in the JMS API, like a correlation ID.

Asynchronous Request-Response with Apache Kafka

The Spring project and its Kafka Spring Boot Kafka Template libraries have a great example of the asynchronous request-reply pattern built with Kafka.

Check out “org.springframework.kafka.requestreply.ReplyingKafkaTemplate“. It creates request/reply applications using the Kafka API easily. The example is interesting since it implements the asynchronous request/reply, which is more complicated to write if you are using, for example, JMS API).

What’s great about Spring is the availability of easy-to-use templates and method signatures. The framework enables using design patterns without custom code to implement them. For instance, here are the two Java methods for request-reply with Kafka:

RequestReplyFuture sendAndReceive(ProducerRecord record);

RequestReplyFuture sendAndReceive(ProducerRecord record, Duration replyTimeout);

The result is a ListenableFuture that is asynchronously populated with the result (or an exception, for a timeout). The result also has a sendFuture property, which is the result of calling KafkaTemplate.send(). You can use this future to determine the result of the send operation.

Synchronous Request-Response with Apache Kafka

Another excellent DZone article talks about synchronous request/reply using Spring Kafka templates. The example shows a Kafka service to calculate the sum of two numbers with synchronous request-response behavior to return the result:

Spring automatically sets a correlation ID in the producer record. This correlation ID is returned as-is by the @SendTo annotation at the consumer end.

Check out the DZone post for the complete code example.

The Spring documentation for Kafka Templates has a lot of details and code examples about the Request/Reply pattern for Kafka. Using Spring, the request/reply pattern is pretty simple to implement with Kafka. If you are not using Spring, you can learn how to do request-reply with Kafka in your framework. That’s the beauty of open-source…

Combination of Data Streaming and REST API

The above examples showed how you can implement the request-response pattern with Apache Kafka. Nevertheless, it is still only the second-best approach and is often an anti-pattern for streaming data.

Data streaming and request-response REST APIs are often combined to get the best out of both worlds. I wrote a dedicated blog post about “Use Cases and Architectures for HTTP and REST APIs with Apache Kafka“.

Apache Kafka and API Management

A very common approach is to implement applications in real-time at scale with the Kafka ecosystem, but then put an API Management layer on top to expose the events as API to the outside world (either another internal business domain or a B2B 3rd party application).

Here is an example of connecting SAP data. SAP has tens of options for integrating with Kafka, including Kafka Connect connectors, REST / HTTP, proprietary API, or 3rd party middleware.

No matter how you get data into the streaming data hub, on the right side, the Kafka REST API is used to expose events via HTTP. An API Management solution handles the security and monetization/billing requirements on top of the Kafka interface:

Read more about this discussion in the blog post “Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies?“. It covers the relation between Apache Kafka and API Management platforms like Kong, MuleSoft Anypoint, or Google’s Apigee.

After discussing the relationship between APIs, request-response communication, and Kafka, let’s explore a significant trend in the market: Data Mesh (the buzzword) and stream exchange for real-time data sharing (the problem solver).

Data Mesh is a new architecture paradigm that gets a lot of buzz these days. No single technology is the perfect fit to build a Data Mesh. An open and scalable decentralized real-time platform like Apache Kafka is often the heart of the Data Mesh infrastructure, complemented by many other data platforms to solve business problems.

Stream-native data sharing instead of using request-response and REST APIs in the middle is the natural evolution for many use cases:

Learn more in the post “Streaming Data Exchange with Kafka and a Data Mesh in Motion“.

Use Data Streaming and Request-Response Together!

Most architectures need request-response for point-to-point communication (e.g., between a server and mobile app) and data streaming for continuous data processing.

Synchronous and asynchronous request-response communication can be implemented with Apache Kafka. However, CQRS and event sourcing is the better and more natural approach for data streaming most times. Understand the different options and their trade-offs, and use the right tool (in this case, the correct design pattern) for the job.

What is your strategy for using request-response with data streaming? How do you implement the communication in your Apache Kafka applications? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post When to use Request-Response with Apache Kafka? appeared first on Kai Waehner.

Error Handling via Dead Letter Queue in Apache Kafka

Kai Waehner — Mon, 30 May 2022 11:34:22 +0000

Recognizing and handling errors is essential for any reliable data streaming pipeline. This blog post explores best practices for implementing error handling using a Dead Letter Queue in Apache Kafka infrastructure. The options include a custom implementation, Kafka Streams, Kafka Connect, the Spring framework, and the Parallel Consumer. Real-world case studies show how Uber, CrowdStrike, Santander Bank, and Robinhood build reliable real-time error handling at an extreme scale.

Apache Kafka became the favorite integration middleware for many enterprise architectures. Even for a cloud-first strategy, enterprises leverage data streaming with Kafka as a cloud-native integration platform as a service (iPaaS).

Message Queue Patterns in Data Streaming with Apache Kafka

Before I go into this post, I want to make you aware that this content is part of a blog series about “JMS, Message Queues, and Apache Kafka”:

10 Comparison Criteria for JMS Message Broker vs. Apache Kafka Data Streaming
THIS POST – Alternatives for Error Handling via a Dead Letter Queue (DLQ) in Apache Kafka
Implementing the Request-Reply Pattern with Apache Kafka
UPCOMING – A Decision Tree for Choosing the Right Messaging System (JMS vs. Apache Kafka)
UPCOMING – From JMS Message Broker to Apache Kafka: Integration, Migration, and/or Replacement

I will link the other posts here as soon as they are available. Please follow my newsletter to get updated in real-time abo t new posts. (no spam or ads)

What is the Dead Letter Queue Integration Pattern (in Apache Kafka)?

The Dead Letter Queue (DLQ) is a service implementation within a messaging system or data streaming platform to store messages that are not processed successfully. Instead of passively dumping the message, the system moves it to a Dead Letter Queue.

The Enterprise Integration Patterns (EIP) call the design pattern Dead Letter Channel instead. We can use both as synonyms.

This article focuses on the data streaming platform Apache Kafka. The main reason for putting a message into a DLQ in Kafka is usually a bad message format or invalid/missing message content. For instance, an application error occurs if a value is expected to be an Integer, but the producer sends a String. In more dynamic environments, a “Topic does not exist” exception might be another error why the message cannot be delivered.

Therefore, as so often, don’t use the knowledge from your existing middleware experience. Message Queue middleware, such as JMS-compliant IBM MQ, TIBCO EMS, or RabbitMQ, works differently than a distributed commit log like Kafka. A DLQ in a message queue is used in message queuing systems for many other reasons that do not map one-to-one to Kafka. For instance, the message in an MQ system expires because of per-message TTL (time to live).

Hence, the main reason for putting messages into a DLQ in Kafka is a bad message format or invalid/missing message content.

Alternatives for a Dead Letter Queue in Apache Kafka

A Dead Letter Queue in Kafka is one or more Kafka topics that receive and store messages that could not be processed in another streaming pipeline because of an error. This concept allows continuing the message stream with the following incoming messages without stopping the workflow due to the error of the invalid message.

The Kafka Broker is Dumb – Smart Endpoints provide the Error Handling

The Kafka architecture does not support DLQ within the broker. Intentionally, Kafka was built on the same principles as modern microservices using the ‘dumb pipes and smart endpoints‘ principle. That’s why Kafka scales so well compared to traditional message brokers. Filtering and error handling happen in the client applications.

The true decoupling of the data streaming platform enables a much more clean domain-driven design. Each microservice or application implements its logic with its own choice of technology, communication paradigm, and error handling.

In traditional middleware and message queues, the broker provides this logic. The consequence is worse scalability and less flexibility in the domains, as only the middleware team can implement integration logic.

Custom Implementation of a Kafka Dead Letter Queue in any Programming Language

A Dead Letter Queue in Kafka is independent of the framework you use. Some components provide out-of-the-box features for error handling and Dead Letter Queues. However, it is also easy to write your Dead Letter Queue logic for Kafka applications in any programming language like Java, Go, C++, Python, etc.

The source code for a Dead Letter Queue implementation contains a try-cath block to handle expected or unexpected exceptions. The message is processed if no error occurs. Send the message to a dedicated DLQ Kafka topic if any exception occurs.

The failure cause should be added to the header of the Kafka message. The key and value should not be changed so that future re-processing and failure analysis of historical events are straightforward.

Out-of-the-box Kafka Implementations for a Dead Letter Queue

You don’t always need to implement your Dead Letter Queue. Many components and frameworks provide their DLQ implementation already.

With your own applications, you can usually control errors or fix code when there are errors. However, integration with 3rd party applications does not necessarily allow you to deal with errors that may be introduced across the integration barrier. Therefore, DLQ becomes more important and is included as part of some frameworks.

Built-in Dead Letter Queue in Kafka Connect

Kafka Connect is the integration framework of Kafka. It is included in the open-source Kafka download. No additional dependencies are needed (besides the connectors themselves that you deploy into the Connect cluster).

By default, the Kafka Connect task stops if an error occurs because of consuming an invalid message (like when the wrong JSON converter is used instead of the correct AVRO converter). Dropping invalid messages is another option. The latter tolerates errors.

The configuration of the DLQ in Kafka Connect is straightforward. Just set the values for the two configuration options ‘errors.tolerance’ and ‘errors.deadletterqueue.topic.name’ to the right values:

The blog post ‘Kafka Connect Deep Dive – Error Handling and Dead Letter Queues‘ shows a detailed hands-on code example for using DLQs.

Kafka Connect can even be used to process the error message in the DLQ. Just deploy another connector that consumes from t e DLQ topic. For instance, if your application processes Avro messages and an incoming message is in JSON format. A connector then consumes the JSON message and transforms it into an AVRO message to be re-processed successfully:

Note that Kafka Connect has no Dead Letter Queue for source connectors.

Error Handling in a Kafka Streams Application

Kafka Streams is the stream processing library of Kafka. It is comparable to other streaming frameworks, such as Apache Flink, Storm, Beam, and similar tools. However, it is Kafka-native. This means you build the complete end-to-end data streaming within a single scalable and reliable infrastructure.

If you use Java, respectively, the JVM ecosystem, to build Kafka applications, the recommendation is almost always to use Kafka Streams instead of the standard Java client for Kafka. Why?

Kafka Streams is “just” a wrapper around the regular Java producer and consumer API, plus plenty of additional features built-in.
Both are just a library (JAR file) embedded into your Java application.
Both are part of the open-source Kafka download – no additional dependencies or license changes.
Many problems are already solved out-of-the-box to build mature stream processing services (streaming functions, stateful embedded storage, sliding windows, interactive queries, error handling, and much more).

One of the built-in functions of Kafka Streams is the default deserialization exception handler. It allows you to manage record exceptions that fail to deserialize. Corrupt data, incorrect serialization logic, or unhandled record types can cause the error. The feature is not called Dead Letter Queue but solves the same problem out-of-the-box.

Error Handling with Spring Kafka and Spring Cloud Stream

The Spring framework has excellent support for Apache Kafka. It provides many templates to avoid writing boilerplate code by yourself. Spring-Kafka and Spring Cloud Stream Kafka support various retry and error handling options, including time / count-based retry, Dead Letter Queues, etc.

Although the Spring framework is pretty feature-rich, it is a bit heavy and has a learning curve. Hence, it is a great fit for a greenfield project or if you are already using Spring for your projects for other scenarios.

Plenty of great blog posts exist that show different examples and configuration options. There is also the official Spring Cloud Stream example for dead letter queues. Spring allows building logic, such as DLQ, with simple annotations. This programming approach is a beloved paradigm by some developers, while others dislike it. Just know the options and choose the right one for yourself.

Scalable Processing and Error Handling with the Parallel Consumer for Apache Kafka

In many customer conversations, it turns out that often the main reason for asking for a dead letter queue is handling failures from connecting to external web services or databases. Time-outs or the inability of Kafka to send various requests in parallel brings down some applications. There is an excellent solution to this problem:

The Parallel Consumer for Apache Kafka is an open-source project under Apache 2.0 license. It provides a parallel Apache Kafka client wrapper with client-side queueing, a simpler consumer/producer API with key concurrency, and extendable non-blocking IO processing.

This library lets you process messages in parallel via a single Kafka Consumer, meaning you can increase Kafka consumer parallelism without increasing the number of partitions in the topic you intend to process. For many use cases, this improves both throughput and latency by reducing the load on your Kafka brokers. It also opens up new use cases like extreme parallelism, external data enrichment, and queuing.

A key feature is handling/repeating web service and database calls within a single Kafka consumer application. The parallelization avoids the need for a single web request sent at a time:

The Parallel Consumer client has powerful retry logic. This includes configurable delays and dynamic er or handling. Errors can also be sent to a dead letter queue.

Consuming Messages from a Dead Letter Queue

You are not finished after sending errors to a dead letter queue! The bad messages need to be processed or at least monitored!

Dead Letter Queue is an excellent way to take data error processing out-of-band from the event processing which means the error handlers can be created or evolved separately from the event processing code.

Plenty of error-handling strategies exist for using dead letter queues. DOs and DONTs explore the best practices and lessons learned.

Error handling strategies

Several options are available for handling messages stored in a dead letter queue:

Re-process: Some messages in the DLQ need to be re-processed. However, first, the issue needs to be fixed. The solution can be an automatic script, human interaction to edit the message, or returning an error to the producer asking for re-sending the (corrected) message.
Drop the bad messages (after further analysis): Bad messages might be expected depending on your setup. However, before dropping them, a business process should examine them. For instance, a dashboard app can consume the error messages and visualize them.
Advanced analytics: Instead of processing each message in the DLQ, another option is to analyze the incoming data for real-time insights or issues. For instance, a simple ksqlDB application can apply stream processing for calculations, such as the average number of error messages per hour or any other insights that help decide on the errors in your Kafka applications.
Stop the workflow: If bad messages are rarely expected, the consequence might be stopping the overall business process. The action can either be automated or decided by a human. Of course, stopping the workflow could also be done in the Kafka application that throws the error. The DLQ externalizes the problem and decision-making if needed.
Ignore: This might sound like the worst option. Just let the dead letter queue fill up and do nothing. However, even this is fine in some use cases, like monitoring the overall behavior of the Kafka application. Keep in mind that a Kafka topic has a retention time, and messages are removed from the topic aft r that time. Just set this up the right way for you. And monitor the DLQ topic for unexpected behavior (like filling up way too quickly).

Best Practices for a Dead Letter Queue in Apache Kafka

Here are some best practices and lessons learned for error handling using a Dead Letter Queue within Kafka applications:

Define a business process for dealing with invalid messages (automated vs. human)
- Reality: Often, nobody handles DLQ messages at all
- Alternative 1: The data owners need to receive the alerts, not just the infrastructure team
- Alternative 2: An alert should notify the system of record team that the data was bad, and they will need to re-send/fix the data from the system of record.
- If nobody cares or complains, consider questioning and reviewing the need for the existence of the DLQ. Instead, these messages could also be ignored in the initial Kafka application. This saves a lot of network load, infrastructure, and money.
Build a dashboard with proper alerts and integrate the relevant teams (e.g., via email or Slack alerts)
Define the error handling priority per Kafka topic (stop vs. drop vs. re-process)
Only push non-retryable error messages to a DLQ – connection issues are the responsibility of the consumer application.
Keep the original messages and store them in the DLQ (with additional headers such as the error message, time of the error, application name where the error occurred, etc.) – this makes re-processing and troubleshooting much more accessible.
Think about how many Dead Letter Queue Kafka topics you need. There are always trade-offs. But storing all errors in a single DLQ might not make sense for further analysis and re-processing.

Remember that a DLQ kills processing in guaranteed order and makes any sort of offline processing much harder. Hence, a Kafka DLQ is not perfect for every use case.

When NOT to use a Dead Letter Queue in Kafka?

Let’s explore what kinds of messages you should NOT put into a Dead Letter Queue in Kafka:

DLQ for backpressure handling? Using the DLQ for throttling because of a peak of a high volume of messages is not a good idea. The storage behind the Kafka log handles backpressure automatically. The consumer pulls data in the way it can take it at its pace (or it is misconfigured). Scale consumers elastically if possible. A DLQ does not help, even if your storage gets full. That’s its problem, independent of whether or not to use a DLQ.
DLQ for connection failures? Putting messages into a DLQ because of failed connectivity does not help (even after several retries). The following message also can not connect to that system. You need to fix the connection issue instead. The messages can be stored in the regular topic as long as necessary (depending on the retention time).

Schema Registry for Data Governance and Error Prevention

Last but not least, let’s explore the possibility to reduce or even eliminate the need for a Dead Letter Queue in some scenarios.

The Schema Registry for Kafka is a way to ensure data cleansing to prevent errors in the payload from producers. It enforces the correct message structure in the Kafka producer:

Schema Registry is a client-side check of the schema. Some implementations like Confluent Server provide an additional schema check on the broker side to reject invalid or malicious messages that come from a producer which is not using the Schema Registry.

Case Studies for a Dead Letter Queue in Kafka

Let’s look at four case studies from Uber, CrowdStrike, Santander Bank, and Robinhood for real-world deployment of Dead Letter Queues in a Kafka infrastructure. Keep in mind that those are very mature examples. Not every project needs that much complexity.

Uber – Building Reliable Reprocessing and Dead Letter Queues

In distributed systems, retries are inevitable. From network errors to replication issues and even outages in downstream dependencies, services operating at a massive scale must be prepared to encounter, identify, and handle failure as gracefully as possible.

Given the scope and pace at which Uber operates, its systems must be fault-tolerant and uncompromising when failing intelligently. Uber leverages Apache Kafka for various use cases at an extreme scale to accomplish this.

Using these properties, the Uber Insurance Engineering team extended Kafka’s role in their existing event-driven architecture by using non-blocking request reprocessing and Dead Letter Queues to achieve decoupled, observable error handling without disrupting real-time traffic. This strategy helps their opt-in Driver Injury Protection program run reliably in over 200 cities, deducting per-mile premiums per trip for enrolled drivers.

Here is an example of Uber’s error handling. Errors trickle-down levels of retry topics until landing in the DLQ:

For more information, read Uber’s very detailed technical article: ‘Building Reliable Reprocessing and Dead Letter Queues with Apache Kafka‘.

CrowdStrike – Handling Errors for Trillions of Events

CrowdStrike is a cybersecurity technology company based in Austin, Texas. It provides cloud workload and endpoint security, threat intelligence, and cyberattack response services.

CrowdStrike’s infrastructure processes trillions of events daily with Apache Kafka. I covered related use cases for creating situational awareness and threat intelligence in real-time at any scale in my ‘Cybersecurity with Apache Kaka blog series‘.

CrowdStrike defines three best practices to implement Dead Letter Queues and error handling successfully:

Store error message in the right system: Define the infrastructure and code to capture and retrieve dead letters. CrowdStrike uses an S3 object store for their potentially vast volumes of error messages. Note that Tiered Storage for Kafka solves this problem out-of-the-box without needing another storage interface (for instance, leveraging Infinite Storage in Confluent Cloud).
Use automation: Put tooling in place to make remediation foolproof, as error handling can be very error-prone when done manually.
Document the business process and engage relevant teams: Standardize and document the process to ensure ease of use. Not all engineers will be familiar with the organization’s strategy for dealing with dead letter messages.

In a cybersecurity platform like CrowdStrike, real-time data processing at scale is crucial. This requirement is valid for error handling, too. The next cyberattack might be a malicious message that intentionally includes inappropriate or invalid content (like a JavaScript exploit). Hence, handling errors in real-time via a Dead Letter Queue is a MUST.

Santander Bank – Mailbox 2.0 for a Combination of Retry and DLQ

Santander Bank had enormous challenges with their synchronous data processing in their mailbox application to process mass volumes of data. They rearchitected their infrastructure and built a decoupled and scalable architecture called “Santander Mailbox 2.0”.

Santander’s workloads and moved to Event Sourcing powered by Apache Kafka:

A key challenge in the new asynchronous event-based architecture was error handling. Santander solved the issues using error-handling built with retry and DLQ Kafka topics:

Check out the details in the Kafka Summit talk “Reliable Event Delivery in Apache Kafka Based on Retry Policy and Dead Letter Topics” from Santander’s integration partner Consdata.

Robinhood – Postgresql Database and GUI for Error Handling

Blog post UPDATE October 2022:

Robinhood, a financial services company (famous for its trading app), presented another approach for handling errors in Kafka messages at Current 2022. Instead of using only Kafka topics for error handling, they insert failed messages in a Postgresql database. A client application including CLI fixes the issues and republishes the messages to the Kafka topic:

Real-world use cases at Robinhood include:

Accounting issue that needs manual fixes
Back office operations uploads documents, and human error results in duplicate entries
Runtime checks for users have failed after an order was placed before it was executed

Currently, the error handling application is “only” usable via the command line and is relatively inflexible. New features will improve the DQL handling in the future:

UI + Operational ease of use: Access controls around dead letter management (possible sensitive info in Kafka messages). Easier ordering requirements.
Configurable data stores: Drop-in replacements for Postgres (e.g. DynamoDB). Direct integration of DLQ Kafka topics.

Robinhood’s DLQ implementation shows that error handling is worth investing in a dedicated project in some scenarios.

Reliable and Scalable Error Handling in Apache Kafka

Error handling is crucial for building reliable data streaming pipelines and platforms. Different alternatives exist for solving this problem. The solution includes a custom implementation of a Dead Letter Queue or leveraging frameworks in use anyway, such as Kafka Streams, Kafka Connect, the Spring framework, or the Parallel Consumer for Kafka.

The case studies from Uber, CrowdStrike, Santander Bank, and Robinhood showed that error handling is not always easy to implement. It needs to be thought through from the beginning when you design a new application or architecture. Real-time data streaming with Apache Kafka is compelling but only successful if you can handle unexpected behavior. Dead Letter Queues are an excellent option for many scenarios.

Do you use the Dead Letter Queue design pattern in your Apache Kafka applications? What are the use cases and limitations? How do you implement error handling in your Kafka applications? When do you prefer a message queue instead, and why? Let’s connect on LinkedIn and discuss it! S ay informed about new blog posts by subscribing to my newsletter.

The post Error Handling via Dead Letter Queue in Apache Kafka appeared first on Kai Waehner.

Design Pattern Archives - Kai Waehner

Replacing Legacy Systems, One Step at a Time with Data Streaming: The Strangler Fig Approach

What is the Strangler Fig Design Pattern?

Why Strangler Fig is Better than a Big Bang Migration or Rewrite

Better Than Reverse ETL: A Migration with Data Consistency across Operational and Analytical Applications

Why Data Streaming with Apache Kafka and Flink is an Excellent Fit for the Strangler Fig Pattern

1. True Decoupling of Old and New Systems

2. Real-Time Replication with Persistence

3. Supports Any Technology and Communication Paradigm

4. No Time Pressure – Migrate at Your Own Pace

5. Intelligent Processing in the Data Migration Pipeline

Allianz’s Digital Transformation and Transition to Hybrid Cloud: An IT Modernization Success Story

Event-Driven Innovation: Community and Knowledge Sharing at Allianz

The Future of IT Modernization and Legacy Migrations with Strangler Fig using Data Streaming

When to use Request-Response with Apache Kafka?

Message Queue Patterns in Data Streaming with Apache Kafka

What is the Request-Response (Request-Reply) Message Exchange Pattern?

Request-Response (HTTP) vs. Data Streaming (Kafka)

Synchronous vs. Asynchronous Communication

Synchronous Restful APIs (HTTP)

Asynchronous Message Queue (IBM MQ, RabbitMQ)

Asynchronous Data Streaming (Apache Kafka)

Request-Reply vs. CQRS and Event Sourcing

Don’t use Request-Response in Kafka if not needed!

Synchronous vs. Asynchronous Request-Response with Apache Kafka

Asynchronous Request-Response with Apache Kafka

Synchronous Request-Response with Apache Kafka

Combination of Data Streaming and REST API

Apache Kafka and API Management

Stream Exchange for Internal and External Data Sharing with Apache Kafka

Use Data Streaming and Request-Response Together!

Error Handling via Dead Letter Queue in Apache Kafka

Message Queue Patterns in Data Streaming with Apache Kafka

What is the Dead Letter Queue Integration Pattern (in Apache Kafka)?

Alternatives for a Dead Letter Queue in Apache Kafka

The Kafka Broker is Dumb – Smart Endpoints provide the Error Handling

Custom Implementation of a Kafka Dead Letter Queue in any Programming Language

Out-of-the-box Kafka Implementations for a Dead Letter Queue

Built-in Dead Letter Queue in Kafka Connect

Error Handling in a Kafka Streams Application

Error Handling with Spring Kafka and Spring Cloud Stream

Scalable Processing and Error Handling with the Parallel Consumer for Apache Kafka

Consuming Messages from a Dead Letter Queue

Error handling strategies

Best Practices for a Dead Letter Queue in Apache Kafka

When NOT to use a Dead Letter Queue in Kafka?

Schema Registry for Data Governance and Error Prevention

Case Studies for a Dead Letter Queue in Kafka

Uber – Building Reliable Reprocessing and Dead Letter Queues

CrowdStrike – Handling Errors for Trillions of Events

Santander Bank – Mailbox 2.0 for a Combination of Retry and DLQ

Robinhood – Postgresql Database and GUI for Error Handling

Reliable and Scalable Error Handling in Apache Kafka