ETL Archives - Kai Waehner

The Top 20 Problems with Batch Processing (and How to Fix Them with Data Streaming)

Kai Waehner — Tue, 01 Apr 2025 07:20:23 +0000

Batch processing has long been the default approach for moving and transforming data in enterprise systems. It works on fixed schedules, processes data in large chunks, and often relies on complex chains of jobs that run overnight. While this was acceptable in the past, today’s digital businesses operate in real time—and can’t afford to wait hours for fresh insights. Delays, errors, and inconsistencies caused by batch workflows lead to poor decisions, missed opportunities, and growing operational costs. In this post, we’ll look at common issues with batch processing and show why data streaming is the modern alternative for fast, reliable, and scalable data infrastructure.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming architectures and use cases to understand the benefits over batch processing.

The Issues of Batch Processing

While batch processing has powered data pipelines for decades, it introduces a range of problems that make it increasingly unfit for today’s real-time, scalable, and reliable data needs.

Adi Polak @ Current 2024 (Austin, USA)

Adi Polak’s keynote about the issues of batch processing at Current in Austin, USA, inspired me to explore each point with a concrete example and how data streaming with technologies such as Apache Kafka and Flink helps.

Real-time Data Streaming Beats Slow Data and Batch Processing

Across industries, companies are modernizing their data infrastructure to react faster, reduce complexity, and deliver better outcomes. Whether it’s fraud detection in banking, personalized recommendations in retail, or vehicle telemetry in mobility services—real-time data has become essential.

Let’s look at why batch processing falls short in today’s world, and how real-time data streaming changes the game. Each problem outlined below is grounded in real-world challenges seen across industries—from finance and manufacturing to retail and energy.

Corrupted Data and Null Values

Example: A bank’s end-of-day batch job fails because one transaction record has a corrupt timestamp.

In batch systems, a single bad record can poison the entire job. Often, that issue is only discovered hours later when reports are wrong or missing. In real-time streaming systems, bad data can be rejected or rerouted instantly without affecting valid records, leveraging enforcing contracts on the fly.

Thousands of Batch Jobs and Complexity

Example: A large logistics company runs 2,000+ daily batch jobs just to sync inventory and delivery status across regions.

Over time, batch pipelines become deeply entangled and hard to manage. Real-time pipelines are typically simpler and more modular, allowing teams to scale, test, and deploy independently.

Missing Data and Manual Backfilling

Example: A retailer’s point of sale (POS) system goes offline for several hours—sales data is missing from the batch and needs to be manually backfilled.

Batch systems struggle with late-arriving data. Real-time pipelines with built-in buffering and replay capabilities handle delays gracefully, without human intervention.

Data Inconsistencies and Data Copies

Example: A manufacturer reports conflicting production numbers from different analytics systems fed by separate batch jobs.

In batch architectures, multiple data copies lead to discrepancies. A data streaming platform provides a central source of truth via shared topics and schemas to ensure data consistency across real-time, batch and request-response applications.

Exactly-Once Not Guaranteed

Example: A telecom provider reruns a failed billing batch job and accidentally double-charges thousands of customers.

Without exactly-once guarantees, batch retries risk duplication. Real-time data streaming platforms support exactly-once semantics to ensure each record is processed once and only once.

Invalid and Incompatible Schemas

Example: An insurance company adds a new field to customer records, breaking downstream batch jobs that weren’t updated.

Batch systems often have poor schema enforcement. Real-time streaming with a schema registry and data contracts validates data at write time, catching errors early.

Compliance Challenges

Example: A user requests data deletion under GDPR. The data exists in dozens of batch outputs stored across systems.

Data subject requests are nearly impossible to fulfill accurately when data is copied across batch systems. In an event-driven architecture with data streaming, data is processed once, tracked with lineage, and deleted centrally.

Duplicated Data and Small Files

Example: A healthcare provider reruns a batch ETL job after a crash, resulting in duplicate patient records and thousands of tiny files in their data lake.

Data streaming prevents over-processing and file bloats by handling data continuously and appending to optimized storage formats.

High Latency and Outdated Information

Example: A rideshare platform calculates driver incentives daily, based on data that’s already 24 hours old.

By the time decisions are made, they’re irrelevant. Data streaming enables near-instant insights, powering real-time analytics, alerts, and user experiences.

Brittle Pipelines and Manual Fixes

Example: A retailer breaks their holiday sales reporting pipeline due to one minor schema change upstream.

Batch pipelines are fragile and tightly coupled. Real-time systems, with schema evolution support and observability, are more resilient and easier to debug.

Logically and Semantically Invalid Data

Example: A supermarket receives transactions with negative quantities—unnoticed until batch reconciliation fails.

Real-time systems allow inline validation and enrichment, preventing bad data from entering downstream systems.

Exhausted Deduplication and Inaccurate Results

Example: A news app batch-processes user clicks but fails to deduplicate properly, inflating ad metrics.

Deduplication across batch windows is error prone. Data streaming supports sophisticated, stateful deduplication logic in stream processing engines like Kafka Streams or Apache Flink.

Schema Evolution Compatibility Issues

Example: A SaaS company adds optional metadata to an event—but their batch pipeline breaks because downstream systems weren’t ready.

In data streaming, you evolve schemas safely with backward and forward compatibility—ensuring changes don’t break consumers.

Similar Yet Different Datasets

Example: Two teams at a FinTech startup build separate batch jobs for “transactions”, producing similar but subtly different datasets.

Data streaming architectures encourage shared schemas and centralized topics, reducing redundant logic and fragmentation.

Inaccurate Data

Example: A manufacturer bases production forecasts on batch-aggregated sensor data—too late to respond to real-time issues.

Batch introduces delay, distortion, and disconnect. Data streaming delivers accurate, granular, and current data for timely decision-making.

Data Streaming Is the New Standard to Avoid Batch Processing

The limitations of batch processing are no longer acceptable in a digital-first world. From inconsistent data and operational fragility to compliance risk and customer dissatisfaction—batch can’t keep up.

Data streaming isn’t just faster—it’s cleaner, smarter, and more sustainable.

Apache Kafka and Apache Flink make it possible to build a modern, real-time architecture that scales with your business, reduces complexity, and delivers immediate value.

Ready to Modernize?

If you’re exploring the shift from batch to real-time, check out my free book:

The Ultimate Guide to Data Streaming

It’s packed with use cases, architecture patterns, and success stories across industries—designed to help you become a data streaming champion.

Let’s leave batch in the past—and move forward with streaming.

And connect with me on LinkedIn to discuss data streaming! Or join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

The post The Top 20 Problems with Batch Processing (and How to Fix Them with Data Streaming) appeared first on Kai Waehner.

Streaming ETL with Apache Kafka in the Healthcare Industry

Kai Waehner — Fri, 01 Apr 2022 05:47:00 +0000

IT modernization and innovative new technologies change the healthcare industry significantly. This blog series explores how data streaming with Apache Kafka enables real-time data processing and business process automation. Real-world examples show how traditional enterprises and startups increase efficiency, reduce cost, and improve the human experience across the healthcare value chain, including pharma, insurance, providers, retail, and manufacturing. This is part three: Streaming ETL. Examples include Babylon Health and Bayer.

Blog Series – Kafka in Healthcare

Many healthcare companies leverage Kafka today. Use cases exist in every domain across the healthcare value chain. Most companies deploy data streaming in different business domains. Use cases often overlap. I tried to categorize a few real-world deployments into different technical scenarios and added a few real-world examples:

Overview – Data Streaming Use Cases and Architectures for Healthcare (including Slide Deck)
Legacy Modernization and Hybrid Cloud (Optum / UnitedHealth Group, Centene, Bayer)
THIS POST: Streaming ETL (Bayer, Babylon Health)
Real-time Analytics (Cerner, Celmatix, CDC/Centers for Disease Control and Prevention)
Machine Learning and Data Science (Recursion, Humana)
Open API and Omnichannel (Care.com, Invitae)

Stay tuned for a dedicated blog post for each of these topics as part of this blog series. I will link the blogs here as soon as they are available (in the next few weeks). Subscribe to my newsletter to get an email after each publication (no spam or ads).

Streaming ETL with Apache Kafka

Streaming ETL is similar to concepts you might know from traditional ETL tools. I have already explored how data streaming with Kafka differs from data integration tools and iPaaS cloud services. The critical difference is that you leverage a single platform for data integration and processing at scale in real-time. There is no need to combine several platforms to achieve this. The result is a Kappa architecture that enables real-time but also batch workloads with a single integration architecture.

Streaming ETL with Kafka combines different components and features:

Kafka Connect as Kafka-native integration framework
Kafka Connect source and sink connectors to consume and produce data from/to any other database, application, or API
Single Message Transform (SMT) – an optional Kafka Connect feature – to process (filter, change, remove, etc.) incoming or outgoing messages within the connector deployment
Kafka Streams or ksqlDB for continuous data processing in real-time at scale for stateless or stateful ETL jobs
Data governance via schema management, enforcement and versioning using the Schema Registry
Security and access control using features like role-based access control, audit logs, and end-to-end encryption

In the cloud, you can leverage a serverless Kafka offering for the whole Streaming ETL pipeline. Confluent Cloud fully manages Kafka’s end-to-end infrastructure, including connectors, ksqlDB workloads, data governance, and security.

One last general note: Don’t Design for Data at Rest to Reverse it! Learn more here: “When to Use Reverse ETL and when it is an Anti-Pattern“. Instead, use real-time Streaming ETL for Data in Motion and the Kappa architecture from scratch.

Let’s look at a few real-world deployments in the healthcare sector.

Babylon Health – PII and GDRP compliant Security

Babylon Health is a digital-first health service provider and value-based care company that combines an artificial intelligence-powered platform with virtual clinical operations for patients. Patients are connected with health care professionals through its web and mobile application.

Babylon’s mission is to put an accessible and affordable health service in the hands of every person on earth. For that mission, Babylon built an agile microservice architecture with the Kafka ecosystem:

Here are the “wonders of working” in Healthcare for Babylon (= reasons to choose Kafka):

Real-time data processing
Replayability of historical information
Order matters and is ensured with guaranteed ordering
GDPR and data ownership for PII compliant security
Data governance via the schema registry to provide true decoupling and access via many programming languages like Java, Python, and Ruby

Bayer – Data Integration and Processing in R&D

Bayer AG is a German multinational pharmaceutical and life sciences company and one of the largest pharmaceutical companies in the world. They leverage Kafka in various use cases and business domains.

The following scenario is from the research and development department of the pharma business unit. Their focus areas are cardiovascular diseases, oncology, and women’s health. The division employs over 7,500 R&D people and expenses over 2.75 billion euros for R&D.

The use case Bayer presented at a recent Kafka Summit is about analyzing clinical trials, patents, reports, news, and literature leveraging the Kafka ecosystem. The R&D team processes 250 Million documents from 30+ individual data sources. The data includes 7 TB of raw text-rich data with daily updates, additions, and deletions. Algorithms and data evolve. Bayer needs to completely reprocess the data regularly. Various document streams with different formats and schemas flow through several text processing and enrichment steps.

Scalable, reliable Kafka pipelines with Kafka Streams (Java) and Faust (Python) replaced custom, error-prone, non-scalable scripts. Schemas are used as the data interface to ensure data governance. Avro is the first-class citizen data format to enable compression and better throughput.

The true decoupling of Kafka in conjunction with the Schema Registry guarantees interoperability among different components and technologies (java, python, commercial tools, open-source, scientific, proprietary).

Streaming ETL with Kafka for Real-Time Data Integration at any Scale

Think about IoT sensor analytics, cybersecurity, patient communication, insurance, research, and many other domains. Real-time data beats slow data in the healthcare supply chain almost everywhere.

This blog post explored the capabilities of the Apache Kafka Ecosystem for Streaming ETL. Real-world deployments from Babylon Health and Bayer showed how enterprises successfully deploy Kafka for different enterprise architecture use cases.

How do you leverage data streaming with Apache Kafka in the healthcare industry? What architecture does your platform use? Which products do you combine with data streaming? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Streaming ETL with Apache Kafka in the Healthcare Industry appeared first on Kai Waehner.

When to use Apache Camel vs. Apache Kafka?

Kai Waehner — Fri, 28 Jan 2022 06:31:02 +0000

Should I use Apache Camel or Apache Kafka for my next integration project? The question is very valid and comes up regularly. This blog post explores both open-source frameworks and explains the difference between application integration and event streaming. The comparison discusses when to use Kafka or Camel, when to combine them, when not to use them at all. A decision tree shows how you can quickly qualify out one for the other.

The history of application integration and event streaming

My personal history and experience in application integration and event streaming are the following. It shows my background and how I see the integration and data streaming markets.

A discussion that started over a decade ago…

With my background of work in the last decade at Talend, TIBCO, and Confluent, the comparison between Camel and Kafka is very exciting as I have spent a lot of time with both open-source frameworks:

Apache Camel powered Talend ESB. Talend had a visual coding tool to design Camel routes with code generation. Unfortunately, the tool’s primary focus was Talend Data Integration (ETL and batch). The Camel-powered ESB code was integrated, but it was neither perfect nor complete.

TIBCO BusinessWorks competed with Talend ESB while TIBCO StreamBase competed with other stream processing solutions. The Kafka ecosystem came up more and more in conversations with customers.

I posted about “When to use Apache Camel” in 2011 already. In 2012, I did my first talk at an international software conference in the US. The name of the conference? CamelOne! A forum only about Apache Camel. What an exciting time. Claus Ibsen, THE Camel guy, wrote an excellent summary of CamelOne 2012 in Boston.

In my conference summary, I talked about my two talks. One of them covered a comparison between Apache Camel, Spring Integration, and Mulesoft ESB. The presentation has over 35000 views, and the number still goes up today.

… from application integration to event streaming

Over time, the buzzword “big data” came up more and more. I spent some time at Talend and TIBCO to learn new programming concepts such as Map-Reduce and Shuffling, mainly powered by Apache Hadoop and Apache Spark. The big data ecosystem snowballed with tens of frameworks such as Hive, HBase, Pig, and many more.

However, the first people realized that real-time data beats slow data in almost all use cases. The Lambda architecture was invented to separate real-time workloads from batch workloads. Event Streaming was born. Apache Kafka became the de facto standard for data streaming. Like CamelOne a decade ago, Kafka Summit is the one-stop-show for Kafka use cases, architectures, and success stories. Contrary to the small CamelOne, Kafka Summit is a global event with events across the globe, plus online events.

Data in motion with the Kappa architecture replacing Lambda

In 2014, a guy called Jay Kreps (few people knew him) was already questioning the Lambda architecture. Instead, he proposed to provide a single real-time layer to provide data for real-time and batch consumers. The Kappa architecture was born. Today, the Kappa architecture is mainstream, replacing Lambda. Various vendors adopt Kafka in the meantime.

Confluent became the clear leader in the event streaming software category. Confluent Platform is powered by Apache Kafka. The focus is on event streaming. That’s different from most other vendors like Cloudera; they focus on 10-20 frameworks or products and try to combine and integrate them somehow. Today, Confluent Cloud is a complete game-changer providing Apache Kafka and its ecosystem for application integration and stream processing as a serverless cloud offering.

This is where we are today in 2022. Application integration (= Camel) and event streaming (= Kafka) play a critical role in every modern enterprise architecture. Open-source is widely adopted and usually preferred compared to proprietary solutions for various reasons, including avoiding vendor lock-in. That’s true for self-managed and serverless cloud offerings.

Hence, the question arises: Should I use Apache Camel for application integration or Apache Kafka for event streaming? Or both? Or does one solve the other, too? These questions will be answered in the following sections, concluding with a decision tree to help you make the right choice for your project.

Let’s look at the similarities between Camel and Kafka, when to use which framework, when and how to combine them, and when not to use them at all.

Features in Apache Camel AND Apache Kafka

Camel and Kafka have many positive and negative characteristics in common. Hence, it is no surprise that people compare the two frameworks:

Open source under Apache 2.0 license
Vibrant community and adoption in the industry
Mature framework with deployments in enterprises across the globe
Fixing point-to-point spaghetti architectures with a central integration backbone
Open architecture and extensibility with custom functions and connectors
Small and big deployments possible, plus single-node deployments for non-mission-critical use cases
Re-engineered and optimized for cloud-native deployments (container, Kubernetes, cloud)
Connectivity to any technology, API, communication paradigm, and SaaS
Transformation of any data types and formats
Processes transactional and analytical workloads
Domain-specific language (DSL) for message at a time processing, with similar logic such as aggregation, filtering, conditional processing
Relative complex frameworks because of their robust feature set, hence not suitable for solving a minor problem
Not a replacement of a database, data warehouse, or data lake

Beyond the similarities, Kafka and Camel have very different sweet spots built to solve distinct problems. Hence, comparing these two tools is a bit comparison of apples and oranges. Some minor projects might use one or the other to solve the problem, but critical enterprise projects show the differences more quickly.

When to use Apache Camel?

The mission of Camel

Apache Camel is an integration framework. It solves a particular problem: Data integration between different applications, APIs, protocols, and communication paradigms. This concept is often called application integration or enterprise integration. Camel implements the famous Enterprise Integration Patterns (EIP). EIPs are based on messaging principles.

Camel’s strengths

Event-based backbone based on well-known and adopted EIP concepts
Connectivity to almost any API
Integration, processing, and routing of information with an intuitive domain-specific language (DSL) with a focus on integration; providing the ability of composability in a programming context for finer grain control in code for doing conditional logic or transformations/reformatting
Powerful routing capabilities with many built-in EIPs
Many deployment options (standalone, web container, application server, Spring, OSGi, Kubernetes via the Camel K sub-project) – okay, I guess some options are not relevant in this decade anymore
Lightweight alternative to proprietary ETL and ESB tools

Camel’s weaknesses

Only a “routing machine”, i.e., not built for long-term storage (additional cache or storage needed), for that reason, Camel is not the right choice for a central nervous system like Kafka
No stream processing (like you know it from Kafka Streams or Apache Flink)
Limited scalability, not built for massive volumes of data
No powerful visual coding like you know it from proprietary ETL/ESB/iPaaS tools
No serverless cloud offering, with that also not competing with other iPaaS offerings
Red Hat is the only vendor supporting it
Built to be deployed in a single data center or cloud region, not across hybrid or multi-cloud scenarios

The evolution of Apache Camel

Camel is widely adopted and has a strong community. Unfortunately, from a vendor and support perspective, the offerings declined in the last few years. One of the most significant pain points: I still don’t see a serverless cloud offering anywhere today:

Camel TL;DR

Camel is an application integration framework to connect different applications and interfaces. Camel is NOT built for processing data in motion continuously, i.e., stream processing. Hence, it should be compared to ETL and ESB tools, not data streaming technologies like Kafka, Kinesis, or Flink. If you look for a serverless cloud offering, you are out of luck. If you look for vendor support, Red Hat is the only option.

When to use Apache Kafka?

The mission of Kafka

Real-time data beats slow data at any scale. The event streaming platform enables processing data in motion. Kafka is the de facto standard for event streaming, including messaging, data integration, stream processing, and storage. Kafka provides all capabilities in one infrastructure at scale. It is reliable and allows to process analytics and transactional workloads.

Kafka’s strengths

Event-based streaming platform
A unique combination of pub/sub messaging, data processing, data integration, and storage in a single framework
Built for massive volumes of data and extreme scale from the beginning, with that a single framework can be used for transactional (low volume) and analytics (high volume) use cases
True decoupling between producers and consumers because of its storage component makes it the de facto standard for microservice architectures
Guaranteed ordering of events in the distributed commit log
Distributed data processing with fault-tolerance and recoverability built-in
Replayability of events
The de facto standard for event streaming
Built with hybrid and multi-cloud data replication in mind (with included tools like MirrorMaker and separate, more advanced, and more straightforward tools like Confluent Cluster Linking)
Support from many vendors, including Confluent, Cloudera, IBM, Red Hat, Amazon, Microsoft, and many more
Paradigm shift: Built to process data in motion end-to-end from source to one or more sinks

Kafka’s weaknesses

Paradigm shift: Enterprises need to learn and understand the added value of event streaming, a new software category that enables new use cases but also requires different design patterns and operations approaches
No powerful visual coding like you know it from proprietary ETL/ESB/iPaaS tools
Limited out-of-the-box routing capabilities (Kafka Connect SMT or Kafka Streams / ksqlDB app do the job very well, but not as simple as Camel)
Complex operations (if you run it by yourself instead of using 3rd party tools or even better a serverless cloud offering)

The evolution of Apache Kafka

Kafka was built at LinkedIn to process high volumes of data, as no other open-source framework could do this. Kafka found quick adoption after LinkedIn open-sourced it. Several vendors adopted Kafka and added it to their product portfolio. Some vendors just added Kafka for the sake of having it. Others innovated and used additional tools to make Kafka cloud-native for the next generation of event streaming. Kafka as a serverless cloud offering is a critical piece of many modern enterprise architectures today:

Kafka TL;DR

Kafka is an event streaming platform to process data in motion continuously. If you “just” need an integration framework to route data from a source to one or more sinks (= ETL / ESB), then Camel can be used, too. However, Kafka kills two birds with one stone (= integrating data AND processing it in motion where needed).

Plenty of Kafka offerings are available on the market. Check out the Apache Kafka landscape and comparison to understand the differences between offerings from Confluent, Cloudera, IBM, Red Hat, Amazon, Microsoft, and others.

Decision tree – Camel or Kafka?

The above sections explored when to use Camel and Kafka. So far, so good. Nevertheless, both frameworks overlap with their capabilities. Let’s get some help to choose the right one in that case.

Qualify out – the easiest way to start an evaluation!

The easiest way to decide on a specific option is to qualify out other frameworks that cannot fulfill the requirements.

Therefore, do you need

Big data processing?
A storage component for true decoupling and replayability of events?
Stateless or stateful stream processing?
A serverless cloud offering?

The above section discussed these differentiators of Kafka. In all these cases, you can qualify out Camel. It does not fulfill these requirements. These requirements are not necessarily a complete list. And you might also find a few aspects to qualify out Kafka from the beginning. Hence, you could also start from the Camel perspective and ask yourself: When should I not use Kafka. But I think it is easier the other way round.

Qualifying out solutions because of their limitations makes the decision tree and evaluation process much easier from the beginning.

Decision Tree for Camel and Kafka

Here is my decision tree to find out if Camel or Kafka is the right choice and what vendors you could evaluate:

When to use Camel and Kafka together?

It is possible to use Camel and Kafka together in a single integration architecture. Should you do that? Two options exist. One makes more sense than the other:

Kafka for event streaming and Camel for ETL

Camel and Kafka integrate well with each other. The native Kafka component of Camel is the best native integration point as a bridge between both environments:

The above architecture shows how Camel and Kafka live next to each other. Camel is used in a business domain for application integration. Kafka is the central nervous system between the Camel integration application and many other applications. I also added Kong as API Gateway to clarify that Camel or Kafka is not a silver bullet to solve every problem.

Once again, the vast advantage of Kafka as central integration layer is its unique combination of characteristics within a single infrastructure, including:

Real-time messaging at any scale
Storage for true decoupling between different applications and communication paradigms
Built-in backpressure handling and replayability of events
Data integration
Stream processing

Real-time data replication across hybrid and multi-cloud is not shown in the above picture but is also part of the enterprise architecture out-of-the-box leveraging take Kafka protocol.

With true decoupling within modern microservice architecture, each business team can decide whether they need application integration (using Camel) or event streaming (using Kafka). Often, both could be used. Additional questions around single vs. multi frameworks and APIs, vendor support, scalability needs, and other characteristics need to be evaluated to make the right choice for your business problem.

Camel connectors embedded into Kafka Connect

There is another way to combine Kafka and Camel: The “Camel Kafka Connector” sub-project of Apache Camel. Don’t get confused. This feature is not the Kafka component (= connector) of Camel! Instead, it is a relatively new initiative to deploy camel components into the Kafka Connect infrastructure.

The obvious benefit: This way, you get hundreds of new connectors “for free” within the Kafka ecosystem. This capability sounds excellent. And it is!

However, consider the total cost of ownership and the overall efforts using this approach. Application integration is one of the most challenging problems in computer science – especially if you talk about transactional data sets that require zero data loss, exactly-once semantics, and no downtime. The more components you combine in the end-to-end data flow, the harder it gets to keep your performance and reliability SLAs.

Hence, using Camel components within Kafka Connect has a considerable disadvantage: Combining two frameworks with complexities and different design concepts. Just a few examples:

Kafka world: Partitions, Offsets, Leader and Follower, Key/Value/Header, connectors (based on Kafka Connect), Bootstrap Server, ConsumerRecord, Retention Time, etc.
Camel world: Routes, RouteBuilder CamelContext, Exchange, Processor, components (Camel connectors), Endpoints, Type Converters, Registry, etc.

Please think twice before mixing two integration tools that are powerful but complex on their own. Getting this running is just one piece of the puzzle (the simple part). Don’t forget end-to-end testing, resiliency, SLAs, support across technologies and APIs. Even buying support for Camel and Kafka from Red Hat (i.e., a single vendor) does not improve this approach.

It is likely better to take the business logic and API calls out of the Camel component and copy it into a Kafka Connect connector template to run the integration natively with only Kafka code. This workaround allows a clean architecture, end-to-end integration with a single framework, a single vendor behind it, and much easier testing / debugging / monitoring.

TL;DR: I recommend only using the “Camel Kafka Connector” sub-project if the following options do not work:

Use only Apache Camel for application integration
Leverage Apache Kafka for event streaming and application integration
Choose separate deployments of Camel and Kafka and use the Camel-Kafka-Bridge

When NOT to use Camel or Kafka at all?

Once again, the easiest way for your evaluation to start is qualifying out tools that do not work to solve the problem.

Both Camel and Kafka are NOT built for the following scenarios:

A proxy for millions of clients (like mobile apps) – but native proxies (like a REST or MQTT Proxy for Kafka) exist for some use cases.
An API Management platform – but these tools are usually complementary and used to create life cycle management or monetize APIs deployed with Camel or Kafka.
A database for complex queries and batch analytics workloads
an IoT platform with features such as device management – but direct native integration with (some) IoT protocols such as MQTT or OPC-UA is possible and the approach for (some) use cases.
A technology for hard real-time applications such as safety-critical or deterministic systems – but that’s true for any other IT framework, too. Embedded systems are a different software than Camel or Kafka!

I wrote a very detailed post about this topic from a Kafka perspective. It maps almost 1:1 to the Camel world, too (and any related technology such as Flink, Spark, Pulsar, etc.): “When NOT to use Apache Kafka?”

Apache Camel vs. Apache Kafka – Who is the winner?

Simple answer: Both!

When you compare apples and oranges, you might become happy when you are hungry as both are good to eat. The same is true for Camel and Kafka. Both can do application integration. But they serve very different needs.

Many integration scenarios can use Camel or Kafka.

Camel is the right tool if you need to integrate data within an application context or business unit (with no need for stream processing, true decoupling, replayability, large scale, replication across data centers or cloud regions).

Kafka is the central event-based nervous system across business units, regions, and hybrid clouds. Kafka is all about event streaming. Application integration is just a piece of this puzzle. On the other side, I have seen plenty of integration projects powered by Apache Kafka. It is often replacing other middleware. That’s true for ETL/ESB legacy modernization and in discussions about using a cloud-native iPaaS.

Do you use Camel or Kafka today? What use cases? How do you decide which one to choose? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post When to use Apache Camel vs. Apache Kafka? appeared first on Kai Waehner.

Kafka and XML Messages – Transformation, Connector, Middleware

Kai Waehner — Fri, 25 Sep 2020 07:20:15 +0000

XML messages and XML Schema are not very common in the Apache Kafka and Event Streaming world! Why? Many people call XML legacy. It is complex, verbose, and often associated with the ugly WS-* Hell (SOAP, WSDL, etc). On the other side, every company older than five years uses XML. It is well understood, provides a good structure, and is human- and machine-readable.

This post does not want to start another flame war between XML and other technologies such as JSON (which also provides JSON Schema now), Avro, or Protobuf. Instead, I will walk you through the three main approaches to integrate between Kafka and XML messages as there is still a vast demand for implementing this integration today (often for integrating legacy applications and middleware).

XML and XML Schema

Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium’s XML 1.0 Specification of 1998 and several other related specifications – all of them free open standards – define XML.

The design goals of XML emphasize simplicity, generality, and usability across the Internet. It is a textual data format with strong support via Unicode for different human languages. Although the design of XML focuses on documents, the language is widely used for the representation of arbitrary data structures such as those used in web services. Several schema systems exist to aid in defining XML-based languages, while programmers have developed many application programming interfaces (APIs) to assist the processing of XML data.

SOAP / WSDL Web Services – The WS-* Hell

Web Services use XML messages that follow the SOAP standard and have been popular with traditional enterprises for many years. In such systems, there is often a machine-readable description of the operations offered by the service written in the Web Services Description Language (WSDL). Web Services are one of the predominant use cases for XML integration. Some people call this the “WS-* Hell”:

Kafka for any Data Format (JSON, XML, Avro, Protobuf, …)

Kafka can store and process anything, including XML. The Kafka brokers are dumb. They don’t care about data formats. The implementation of Kafka under the hood stores and processes only byte arrays. This approach follows the design principle of dumb pipes and smart endpoints (coined by Martin Fowler for microservice architectures). Dumb brokers are one of the architectural reasons why Kafka scales and performs so well.

As Kafka supports any data format, XML is no problem at all. It accepts any serializable input format. XML is just text so that plain string serializers can be used. However, if you want additional validation before pushing messages into Kafka (like checking the content is actually XML or doing Schema Validation using XML Schema), then you need to write your own XML Serializer/Deserializer implementation.

XML mappings can be very complex, including referencing other documents and using plenty of ugly open standards. XML includes generic standards such as WS-Security or industry-specific standards such as XBRL for regulatory processing or HL7 for healthcare. It is a mess. Period. Even mature tools struggle as soon as any parts of such a standard are adjusted to their own needs (even though the standards support customization).

Kafka works with any programming language, and Confluent also provides a REST Proxy for HTTP(S) communication with Kafka. But these clients require the developer to implement all the complex XML mapping and processing. Hence, let’s now talk about the most common approaches to implement the integration between any XML-based application and Apache Kafka: 3rd party middleware (ETL / ESB tools) and Kafka-native Kafka Connect.

Kafka and XML via Middleware (ETL, ESB)

Apache Kafka and traditional middleware (ETL, ESB) are frenemies (friends and enemies). Check out this blog and video recording/slide deck for a more in-depth discussion and comparison. If these legacy integration tools such as TIBCO BusinessWorks or Software AG webMethods do one thing well, then it is graphical mappings of complex XML structures, including good and mature (but not 100%) support of the ugly WS-* web service standards.

Pros of 3rd Party Middleware for XML-Kafka Integration

Visual coding for a more straightforward mapping experience (especially crucial for very complex structures) – for all coders: Trust me, this is really easier and more time-efficient than writing, testing, and debugging source code!
Mature (10-20 years old technologies don’t have many bugs anymore – if they are still alive)
Support of complex XML Schema structures (but yet often issues with import, UI, and export)
(Often) already in place (implemented, tested, and deployed)
Kafka integration exists for any middleware which is still alive (i.e., maintained and supported by the vendor)

Cons of 3rd Party Middleware for XML-Kafka Integration

Products are legacy – as old as the source systems
Monolithic, inflexible architecture
Separate infrastructure to operate, test, maintain and pay
End-to-end integration is more challenging (from a technical and support perspective) as two systems in the middle instead of just one
Licensing
Point-to-point and tight coupling, and not event-based streaming with real decoupling
(Often) proprietary solution

Traditional middleware (such as TIBCO, IBM, Software AG, or Mulesoft) complements Kafka deployments. If you have the middleware already running and licensed (and do not plan to migrate away from it), then this is a viable approach to integrate between XML messages from legacy systems and Kafka.

Kafka and XML via Kafka Connect

The open Kafka ecosystem provides Kafka-native support for XML integration leveraging Kafka Connect. Kafka Connect is a Kafka-native tool for scalably and reliably streaming data integration between Apache Kafka and other data systems. It makes it simple to quickly define connectors that move large data sets into and out of Kafka. Think about it as ESB-on-Kafka.

Use cases include

Messaging integration (MQ)
Mainframe offloading
File outputs from batch processes
Web services (SOAP / WSDL / WS-*)
Legacy applications

Pros

Kafka-native (leveraging Kafka under the hood for scalability, throughput, high availability, exactly-once semantics, low latency, etc.)
Decoupled design (Domain-driven Microservice approach instead of tight coupling)
Open ecosystem and flexible integration with any data sources and sinks – check out Confluent Hub for hundreds of open source and commercial connectors
Cloud-native to be deployed in any edge / on-premise or cloud infrastructure such as Kubernetes

Cons

Limited support for complex XML Schemas and standards – not all ugly documents will work well
No visual coding – unfortunately, no Kafka-native visual coding tools exist in 2020. Let’s go, Confluent!

Let’s take a look at two Kafka Connect approaches in more detail: A dedicated XML Connector and an SMT (Single Message Transformation) embedded into any Kafka Connect source or sink connector.

Kafka Connect Connector for XML Files

An XML connector directly accesses the XML file to parse and transform the content:

Connect FilePulse is an open-source Kafka Connect connector built by streamthoughts to parse, transform and stream any XML file. Other file formats are also supported. But as many other tools support the modern data formats such as JSON, CSV, Avro, or Protobuf, I really think the highlight of this connector is the XML support.

Features:

Support for recursive scanning of local directories.
Reading and writing files into Kafka line by line.
Support multiple input file formats (e.g: CSV, JSON, AVRO, XML).
Parsing and transforming data using built-in or custom processing filters
Error handler definition
Monitoring files while they are written into Kafka
Support pluggable strategies to cleanup up completed files

Here is an excellent tutorial for using this XML connector for Kafka Connect: Streaming data into Kafka – Loading an XML file.

The Connect FilePulse Kafka Connector is the right choice for direct integration between XML files and Kafka.

SMT for Embedding XML Transformations into ANY Kafka Connect Connector

An SMT (Single Message Transformation) is part of the Kafka Connect framework. SMTs are applied to messages as they flow through Kafka Connect. They transform inbound messages after a source connector has produced them, but before they are written to Kafka. SMTs transform outbound messages before they are sent to a sink connector.

An SMT can be embedded into any Kafka Connect source or sink connector. Hence, the XML SMT for Kafka Connect allows direct integration with any interface and mapping XML messages without the need for storing the file or using a specific XML connector.

SMTs even allow to add or change metadata, e.g., by adding a new header in addition to the key and value of the message.

Here is an example: Receive XML messages from JMS-based messaging platforms and convert the XML payload to JSON, AVRO, or Protobuf for further processing and integration into the rest of the (modern) enterprise architecture. For instance, Confluent provides a generic JMS connector but also dedicated connectors for various legacy MQ products such as IBM MQ (often running on the mainframe), TIBCO EMS, and ActiveMQ. The XML SMT allows on-the-fly transformation of the incoming XML messages. I have seen integration and later replacement of these MQ tools across the globe in any kind of industry.

Dead Letter Queue (DLQ) for Handling Bad XML Messages

Just transforming messages is often not sufficient. A Dead Letter Queue (DLQ), aka Dead Letter Channel, is an Enterprise Integration Pattern (EIP) to handle bad messages. This design pattern is complementary for XML integration. For instance, a DLQ can store badly processed XML that didn’t fit the XSD in the transform. Here is an example of how to implement the rerouting to a DLQ using the above SMT.

Confluent Schema Registry for Data Governance

The Confluent Schema Registry is a complimentary (optional) tool. It provides a smart implementation of data format and content validation (including enforcement, versioning, and other features). I see it used in ~70% of Kafka projects across the globe. As soon as you do more than just data ingestion into a data lake like HDFS or S3, the added value is enormous. Today, Confluent Schema Registry supports JSON Schema, Avro, and Protobuf.

The Schema Registry provides an open interface and is pluggable. For example, some users have asked for Schema Registry to support XML. Now, you can add XML support to Schema Registry directly, and use the Schema Registry to store both XML and Avro at the same time. For more on how to add your schema formats, please refer to the documentation. The workaround is to do the discussed XML-to-another-format transformation first and start your event streaming data governance from that point.

Summary

XML is predominant in most enterprises and mostly used for legacy applications, batch processing, and SOAP / WSDL web services. A digital transformation can only be successful if the old world is connected well to the new world (as you might have learned in my example about how to integrate between Kafka and Mainframes).

This post explored the three most common options for integration between Kafka and XML:

XML integration with a 3rd party middleware
Kafka Connect connector for integration with XML files
Kafka Connect SMT to embed the transformation into ANY other Kafka Connect source or sink connector to transform XML messages on the flight

What are your experiences with XML integration for Kafka? Which implementation did you choose? What challenges did you face, and how did you or do you plan to solve this? What is your strategy? Let’s connect on LinkedIn and discuss it!

The post Kafka and XML Messages – Transformation, Connector, Middleware appeared first on Kai Waehner.

Kafka SAP Integration – APIs, Tools, Connector, ERP et al

Kai Waehner — Tue, 25 Aug 2020 13:58:42 +0000

A question I get every week from customers across the globe: How can I integrate my SAP system with Apache Kafka? This post explores various alternatives, including connectors, 3rd party tools, custom glue code, and trade-offs between the different options.

After exploring what SAP is, I will discuss several integration options between Apache Kafka and SAP systems:

Traditional middleware (ETL/ESB)
Web services (SOAP/REST)
3rd party turnkey solutions
Kafka-native connectivity with Kafka Connect
Custom glue code using SAP SDKs

Disclaimer before you read on:

I am not an SAP expert. It is tough to stay up-to-date with the vast and complex ecosystem of SAP products, (re-)brands, versions, services, SDKs, and APIs. I am sorry if some of the below information is not 100% accurate or outdated. Always double-check on the SAP website (if the links from Google still work – I had some issues with some pages “no longer available” while researching for this blog post). If you see any inaccurate or missing information, please let me know and I will update the blog post.

What is SAP?

SAP is a German multinational software corporation that makes enterprise software to manage business operations and customer relations. In 2019, SAP had revenue of €27.553 billion, a net income of €3.387 billion, and ~100.000 employees.

It is quite interesting: Nobody asks how to integrate with IBM or Oracle. Instead, people more specifically ask how to integrate with IBM MQ, IBM DB2, IBM Mainframe (still very ambiguous), or any other of the 100s of IBM products.

For SAP, people ask: How can I integrate with SAP? Let’s clarify what SAP is before exploring integration options.

The company is primarily known for its ERP software. But if you check out the official “What is SAP?” page, you find out that SAP offers solutions across a wide range of areas:

ERP and Finance
CRM and Customer Experience
Network and Spend Management
Digital Supply Chain
HR and People Engagement
Experience Management
Business Technology Platform
Digital Transformation
Small and Midsize Enterprises
Industry Solutions

SAP’s Software Portfolio

SAP’s stack includes homegrown products like SAP ERP and acquisitions with their own codebase, including Ariba for supplier network, hybris for e-commerce solutions, Concur for travel & expense management, and Qualtrics for experience management.

Even if you talk about SAP ERP, the situation is still not that easy. Most companies still run SAP ERP Central Component (ECC, formerly called SAP R/3), SAP’s sophisticated (and aged) ERP product. ECC runs on a third-party relational database from Oracle, IBM, or Microsoft, while HANA is SAP’s in-memory database. The new ERP product is SAP S4/Hana (no, this is not just the famous in-memory database). Oh, and there is SAP S4/Hana Cloud. And before you wonder: No, this is not the same feature set as the on-premise version!

Various interfaces exist depending on your product. An interface can be an (awful) proprietary technologies like BAPI or iDoc, (okayish) standards-based web service APIs using SOAP or REST / HTTP, a (non-scalable) JDBC database connectivity, or if you are lucky even a (scalable and real-time) Event / Messaging API. The article “The ERP is Dead. Long live the Distributed Planning System” from the SAP blog describes the situation very well.

And sorry, we are still not done yet. Even if you talk about ERP systems, this can mean anything from a zoo of products or components, depending on who you are talking to:

So, before you want to discuss the integration of your SAP product with Kafka, please please please find out the product, version, and deployment infrastructure of your SAP components.

Different Integration Options between Kafka and SAP

After this introduction, you hopefully understand that there is no silver bullet for SAP integration. The following will explore different integration options between Kafka and SAP and their trade-offs. The main focus is on SAP ERP (old ECC and new S4/Hana), but the overview is more generic, including integration capabilities with other components and products.

Also, keep in mind that you typically need or want to integrate with a function or service. Direct integration with the data object does not make much sense in most cases, as you would have to re-implement the mapping and denormalization between the data objects. Especially for source integration, i.e., building pipelines from SAP to Kafka. In the case of SAP ERP, you typically integrate with RFC/BAPI/iDoc or any other web service interface for this reason.

Traditional Middleware (ETL / ESB) for SAP Integration

Integration tools exist just for the sake of integrating different sources and sinks:

Extract-Transform-Load (ETL) for batch integration, like Informatica, Talend or SAP NetWeaver Process Integration
Enterprise Service Bus (ESB) for integration via web services and messaging, like TIBCO BusinessWorks or Software AG webMethods
Integration Platform as a Service (iPaaS) for cloud-native integration, similar to ETL/ESB tools, but provided as a fully managed service, such as Boomi, Mulesoft, or SAP Cloud Integration (and some cloud-washed products from legacy middleware vendors).

Most traditional middleware products were built to integrate with complex, proprietary systems from the last 20+ years, such as IBM Mainframe, EDIFACT, and guess what ERP systems like SAP ECC. In the meantime, all of them also have a Kafka connector. There are plenty of good reasons why many companies chose Kafka as a modern integration platform instead of a legacy of traditional middleware.

Most traditional ETL and ESB tools provide SAP connectivity. SAP Cloud Platform Integration (SAP CPI) is SAP’s own “modern” middleware solution. CPI includes a Kafka adapter to send and receive Kafka messages.

Pros:

In place: Typically already in place, no new project is required.
Maturity: Built over the years (because of the complexity), running in production for a long time already
Tooling: Visual coding for the integration (required because of the complexity), directly map iDoc / BAPI / Hana / SOAP schemas to other data structures
Integration: Not just connectors to the legacy systems but also Kafka for producing and consuming messages (due to market pressure)

Cons:

Legacy: Products are as old as the source and sink systems,
Scalability: Monolithic, inflexible architecture
Tight coupling: Integration has to be developed and runs on the middleware, no real decoupling and domain-driven design DDD like in Kafka
Licensing: High-cost per server, often already planned to be replaced (e.g., you can replace 100+ IBM MQ or TIBCO EMS servers with a single Kafka cluster)
Point-to-point: No streaming architecture, most integrations are based on web services (even if the core under the hood is based on a messaging system)

TL;DR:

Traditional integration tools are mature and have great tooling, but limited scalability/flexibility and high licensing cost. Often a quick win as it is already running, and you just need to add the Kafka connector.

Custom Glue Code for Kafka Integration using SAP SDKs

Writing your custom integration between SAP systems and Kafka is a viable option. This glue code typically leverages the same SDKs as 3rd party tools use under the hood:

Legacy: SAP NetWeaver RFC SDK – a C/C++ interface for connecting to SAP systems from release R/3 4.6C up to today’s SAP S/4HANA systems.
Legacy: SAP Java Connector (SAP JCo) – the famous JCO.jar library – is a Java SDK for integration to the SAP ECC / ERP (this is just a wrapper around the C/C++ SDK).
Legacy: SAP ACO is an integrated ABAP component that is designed to consume RFC Services on remote ABAP systems.
Legacy: SAP ABAP TCP Push Channel if you are forced to use custom ABAP code and need or want to use TCP instead of the Confluent REST Proxy for HTTP communication.
Legacy: JMS Adapter to integrate via the standard messaging protocol. Great option (if you get it running and working for your use case and functions). For instance, JMS integration can be done via SAP PI.
Modern: SAP Cloud SDK allows developing applications with Java or JavaScript that communicate with SAP solutions and services such as SAP S/4 Hana Cloud, SAP SuccessFactors, and others (the term ‘Cloud’ actually means ‘Cloud-native’ in this case, i.e., this SDK also works with SAP’s on-premise products).
Modern: SAP Cloud Platform Enterprise Messaging: S4/Hana provides an asynchronous messaging interface (running on Solace on CloudFoundry under the hood). Different messaging standards are supported, including AMQP 1.0 and JMS (depending on the specific product you look at). Some examples demonstrate how to connect via the Java Client using the JMS API.
Modern: SAP ODP (Operational Data Provisioning): Technical infrastructure for operational analytics, and data extraction + replication. Some kind of CDC (Change Data Capture) with out-of-the-box support for various SAP products, including SAP BW, SAP BW/4HANA, SAP Data Services, and SAP HANA Smart Data Integration. ODP is not just for SAP interfaces but also integrates with 3rd party technologies (via a custom connector, not out-of-the-box) such as HDFS or Kafka.

Pros:

Flexibility: Custom coding allows you to implement precisely what you need.

Cons:

Maintenance: No vendor support – develop, maintain, operate, support by yourself.
Point-to-point: No streaming architecture, most integrations are based on web services (even if the core under the hood is based on a messaging system).

TL;DR:

“Build vs. Buy” always has trade-offs. I have only seen custom glue code for SAP integration in the field if no solution from a vendor was available and affordable. SAP Cloud Platform Enterprise Messaging is a possible integration pattern for Kafka, but it also adds yet another messaging layer to the architecture.

SOAP / REST Web Services for SAP Integration

The last 15 years brought us web services for building a Service-oriented Architecture (SOA) to integrate applications. A web service typically uses SOAP or REST / HTTP as technology. I will not start yet another FUD war here. Both have their use cases and trade-offs.

Pros:

Standards-based: Different SDKs, products, and services talk the same language (at least in theory; true for HTTP, not so true for SOAP); most middleware tools have proper support for building HTTP services.
Simplicity (HTTP): Well-understood, supported by most programming languages and APIs, established for many use cases – middleware is just yet another one.

Cons:

Point-to-point: No streaming architecture, most integrations are based on web services (even if the core under the hood is based on a messaging system).
Tight coupling: Integration has to be developed and runs on the middleware, no real decoupling, and domain-driven design DDD like in Kafka.
Complexity (SOAP): SOAP/WSDL is just the tip of the iceberg! Check out the list of WS-* standards to understand why this is often called the “WS star hell”. The AXIS framework (Apache extensible Interaction System) is one example of SAP’s SOAP integration using an open framework. While the Apache project was last updated in 2006, SAP still recommends using this interface in 2020.
Missing features (REST / HTTP): Representational state transfer (REST) is a concept, but most people mean synchronous HTTP communication. Most middleware tools (and most other applications) only just a small fraction of the full standard. HTTP is an excellent standard, but all the tooling and features need to be built on top of it.
Only indirect support: Several SAP products do not provide open interfaces. While using SOAP or HTTP under the hood, you are forced to use the licensed tooling to create web services. For instance, SAP Business Connector (restricted license version of webMethods Integration Server), SAP NetWeaver Process Integration (PI), SAP Process Orchestration (PO), Cloud Platform Integration (CPI), or SAP Cloud Integration.

TL;DR:

SOAP and REST web services work well for point-to-point communication and have good tool support. Both have their trade-offs, make sure to choose the right one – if your SAP product provides both interfaces. Unfortunately, you will often not have a choice. Even worse: You cannot use any tool but are forced to use the right licensed SAP tool or wrapper interface. Large scale, high volume, and continuous processing of data are not ideal requirements for these (legacy) integration products.

For direct HTTP(S) communication with Kafka, Confluent Rest Proxy is an excellent option for producing, consuming, and administrating from any Kafka client (including custom SAP applications). For instance, SAP Cloud Platform Integration (CPI) can use this integration pattern to integrate between SAP and Kafka.

SAP-specific 3rd Party Tools for Kafka

SAP integration is a huge market globally. SAP provides several tools for data integration (some legacy, some modern – honestly, I don’t have a full overview of their complex product and API portfolio). Additionally, plenty of other software vendors have built specific integration software for SAP systems.

A few examples I have seen in the field recently:

Examples:

SAP itself has various integration tools, though many are already deprecated. One example of using an SAP solution is the SAP OpenHub Service. It allows you to distribute data from a BW system to non-SAP data marts, analytical applications, and other applications. SAP Data Hub (rebranded to SAP Data Intelligence – always search for both terms) can also connect to Kafka as demonstrated in this example using ODP under the hood. Limitations of SAP Data Hub include that you have to operate a licensed BI system, that dedicated “OpenHub” components have to be implemented in BW for each scenario, that a rather batch-oriented, request-based processing takes place and that this is a classic point-to-point scenario. Furthermore, the additional license fees have to be taken into account.
There are more SAP products in the confusing SAP universe that are able to integrate with Kafka. Honestly, the list of deprecated products in conjunction with new products and rebranding of old products is a total mess. Even the SAP websites struggle with this a lot!
ASAPIO: Their Cloud Integrator is designed for SAP ERP and SAP S/4HANA. It enables the replication of required data between system environments based on SAP NetWeaver technology.
Advantco: The Kafka Adapter is fully integrated with the SAP Adapter Framework, which means the Kafka connections can be configured, managed, and monitored using the standard SAP PI/PO & CPI tools. Includes support for Confluent Schema Registry (+ Avro / Protobuf).
workato: The company provides SAP OData Integration to Kafka and various pre-built recipe templates.
INIT Software has implemented a Kafka Connect ODP connector called i-OhJa. The Kafka-native nature provides benefits such as high performance, high scalability, and exactly-once semantics.
KaTe Kafka Adapter to connect SAP PO with Kafka in both directions.

These are just a few examples. Many more exist for on-premise, cloud, and hybrid integration with different SAP products and interfaces.

Some of these tools are natively integrated into SAP’s integration tools instead of being completely independent runtimes. This can be good or bad. An advantage of this approach is that you can leverage the SAP-native features for complex iDoc / BAPI mappings and the integrated 3rd party connector for Kafka communication.

Pros:

Turnkey solution: Built for SAP integration, often combined with other additional helpful features beyond just doing the connectivity, more lightweight than traditional generic middleware.
Focus: Many 3rd party solutions focus on a few specific use cases and/or products and technologies. It is much harder to integrate with “SAP in general” than focusing on a particular niche, e.g., Human Resources processes and related HTTP interfaces.
Maturity: Built over the years
Tooling: Visual coding for the integration (required because of the complexity), directly map iDoc / BAPI / Hana / SOAP schemas to other data structures
Integration: Not just connectors to the legacy systems but also modern technologies such as Kafka

Cons:

Scalability: Often monolithic, inflexible architecture (but focusing on SAP integration only, therefore often “okayish”)
Tight coupling: Integration has to be developed and runs on the tool, but separated from other middleware, thus decoupling and domain-driven design DDD in conjunction with Kafka
Licensing: Moderate cost per server (typically cheaper than the traditional generic middleware)
Point-to-point: No streaming architecture, most integrations are based on web services (even if the core under the hood is based on a messaging system)

TL;DR:

A turnkey solution is an excellent choice in many scenarios. I see this pattern of combining Kafka with a dedicated 3rd party solution for SAP integration very often. I like it as the architecture is still decoupled, but no vast efforts required for doing a (complex) SAP integration. And there is still hope that even SAP themselves releases a nice Kafka-native integration platform.

Kafka-native SAP Integration with Kafka Connect

Kafka Connect, an open-source component of Apache Kafka, is a framework for connecting Kafka with external systems such as databases, key-value stores, search indexes, and file systems.

Kafka Connect connectors are available for SAP ERP databases:

Confluent Hana connector and SAP Hana connector for S4/Hana
Confluent JDBC connector for R/3 / ECC to integrate with Oracle / IBM DB2 / MS SQL Server.

Pros:

Kafka-native: Kafka under the hood, providing real-time processing for high volumes of data with high scalability and reliability.
Simplicity: Just one infrastructure for messaging and data integration, much easier to develop, test, operate, scale, and license than using different frameworks or products (e.g., Kafka for messaging plus an ESB for data integration).
Real decoupling: Kafka’s architecture uses smart endpoints and dumb pipes by design, one of the key design principles of microservices. Not just for the applications, but also for the integration components. Leverage all the benefits of a domain-driven architecture for your Kafka-native middleware.
Custom connectors: Kafka Connect provides an open template. If no connector is available, you (or your favorite system integrator or Kafka-vendor) can build an SAP-specific connector once, and you can roll it out everywhere.

Cons:

Only database connectors: No connectors available beyond the native JDBC database integration are available at the time of writing this.
Anti-pattern of direct database access: In most cases, you want or need to integrate with a function or service, not with the data objects. In most cases, you don’t even get direct access from the database admin anyway.
Efforts: Build your own SAP-native (i.e., non-JDBC) connector or ask (and pay) your favorite SI or Kafka vendor.

UPDATE January 2021: A Kafka-native integration is available with INIT’s ODP connector (as discussed in section “3rd party tools”). It eliminates the above cons and might be a great 3rd party option for some use cases.

TL;DR:

Kafka Connect is a great framework and used in most Kafka architectures for various good reasons. For SAP integration, the situation is different because no connectors are available (beyond direct database access). It took 3rd party vendors many years to implement RFC/BAPI/iDoc integration with their tools. Such an implementation will probably not happen again for Kafka because it is very complex, and these proprietary legacy interfaces are dying anyway. The situation is different for modern SAP interfaces: Some 3rd party providers leverage Kafka Connect for their product. For instance, INIT Software’s Kafka Connect ODP connector.

A Kafka Connect connector for SAP Cloud Platform Enterprise Messaging using its Java client would be a feasible and best option. I assume we will see such a connector on the market sooner or later.

Embedded Kafka in SAP Products

We have seen various integration options between SAP and Kafka. Unfortunately, all of them are based on the principle of “data at rest” in contrary to Kafka processing “data in motion”. The closest fit until here is the integration via the SAP Cloud Platform Enterprise Messaging because you can at least leverage an asynchronous messaging API.

The real added value comes when Kafka is leveraged not just for real-time messaging but for event streaming. Kafka provides a combination of messaging, data integration, data processing, and real decoupling using its distributed storage infrastructure.

Native Event Streaming with Kafka in SAP Products

Interestingly, some of SAPs acquisitions leverage Kafka under the hood for event streaming. Two public examples:

SAP Concur: Wanny Morellato, a director of engineering, lead the effort of refactoring Concur monolithic travel and expense backend into an event-driven distributed system of microservices using Kafka: Breaking Down a SQL Monolith with Change Tracking, Kafka and KStreams/KSQL
SAP Qualtrics: A particular challenge amidst their growth was implementing standardizations around different types of data. They were working out how to blend numerical data – about companies’ subscriptions, or sales, or content engagement – with experience data collected from surveys. To do that, they utilize technologies like Kafka and Spark and really fast data stores to create a real-time engine to transform data into actionable observational reports.

Obviously, people are also waiting for the Kafka-native SAP S4/Hana interface so that they can leverage events in real-time for processing data in motion and correlate real-time and historical data together. A native Kafka integration with SAP S4/Hana should be the next step for SAP! HERE Technologies provides a great example of how to provide a Kafka-native interface (and an alternative REST option) for their product.

Having said this, current SAP blogs (from mid of 2019) still talk about replacing the 20+ years old BAPI and RFC integration style with SOAP and OData (Open Data Protocol, an open protocol that allows the creation and consumption of queryable and interoperable REST) APIs in SAP S/4HANA Public Cloud.

My personal feeling and hope are that a native Kafka interface is just a matter of time as the market demand is everywhere across the globe (I talk to many customers in EMEA, US, and APAC), and even several non-S4/Hana SAP products use Kafka internally.

I have also seen a two-fold approach from some other vendors: Provide a Kafka-native interface to the outside world first (in SAP terms you could e.g. provide a Kafka-interface on top of BAPIs. At a later point, reengineer the internal architecture away from the non-scalable technology to Kafka under the hood (in SAP terms you could replace RFC / BAPI functions with a more scalable Kafka-native version – even using the same API interface and message structure).

Native Streaming Replication between Products, Departments, and Companies

Native Kafka integration does not just happen within a product or company. A widespread trend I see on the market in different industries is to integrate with partners via Kafka-native streaming replication instead of REST APIs:

Think about it: If you use Kafka in different application infrastructures, but the interface is just a web service or database, then all the benefits might go away because scalability and/or real-time data correlation capabilities go away.

More and more vendors of standard software use Kafka as the backbone of their internal architecture. If the interface between products (imaginatively say SAP’s ERP system, SAP’s MES system, and the SCM application of an OEM customer) is just a SOAP or REST API, then this does not scale and perform well for the requirements of use cases in the digital transformation and Industry 4.0.

Hence, more and more companies leverage Kafka not just internally but also between departments or even different companies. Streaming replication between companies is possible with tools like MirrorMaker 2.0 or Confluent Replicator. Or you use the much simpler Cluster Linking from Confluent, which enables integration between hybrid, multi-cloud, or 3rd party integration using the Kafka protocol under the hood.

SAP + Apache Kafka = The Future for ERP et al

There is huge demand across the globe to integrate SAP applications with Apache Kafka for real-time messaging, data integration, and data processing at scale. The demand is true for SAP ERP (ECC and S4/Hana) but also for most other products from the vast SAP portfolio.

Kafka is deployed in many modern and innovative use cases for supply chain management, manufacturing, customer experience, and so on. Edge, hybrid and multi-cloud Kafka deployments is the norm, not an exception.

Kafka integrates with SAP systems well. Different integration options are available via SAP SDKs and 3rd party products for proprietary interfaces, open standards, and modern messaging and event streaming concepts. Choose the right option for your need and get started with Kafka SAP integration…

If you want to build a modernize your existing ERP infrastructure (no matter if SAP or any other vendor), also check out the article “Building a Postmodern ERP with Apache Kafka“.

What are your experiences with SAP Kafka integration? How did it work? What challenges did you face and how did you or do you plan to solve this? What is your strategy? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Kafka SAP Integration – APIs, Tools, Connector, ERP et al appeared first on Kai Waehner.