ESB Archives - Kai Waehner

When to use Apache Camel vs. Apache Kafka?

Kai Waehner — Fri, 28 Jan 2022 06:31:02 +0000

Should I use Apache Camel or Apache Kafka for my next integration project? The question is very valid and comes up regularly. This blog post explores both open-source frameworks and explains the difference between application integration and event streaming. The comparison discusses when to use Kafka or Camel, when to combine them, when not to use them at all. A decision tree shows how you can quickly qualify out one for the other.

The history of application integration and event streaming

My personal history and experience in application integration and event streaming are the following. It shows my background and how I see the integration and data streaming markets.

A discussion that started over a decade ago…

With my background of work in the last decade at Talend, TIBCO, and Confluent, the comparison between Camel and Kafka is very exciting as I have spent a lot of time with both open-source frameworks:

Apache Camel powered Talend ESB. Talend had a visual coding tool to design Camel routes with code generation. Unfortunately, the tool’s primary focus was Talend Data Integration (ETL and batch). The Camel-powered ESB code was integrated, but it was neither perfect nor complete.

TIBCO BusinessWorks competed with Talend ESB while TIBCO StreamBase competed with other stream processing solutions. The Kafka ecosystem came up more and more in conversations with customers.

I posted about “When to use Apache Camel” in 2011 already. In 2012, I did my first talk at an international software conference in the US. The name of the conference? CamelOne! A forum only about Apache Camel. What an exciting time. Claus Ibsen, THE Camel guy, wrote an excellent summary of CamelOne 2012 in Boston.

In my conference summary, I talked about my two talks. One of them covered a comparison between Apache Camel, Spring Integration, and Mulesoft ESB. The presentation has over 35000 views, and the number still goes up today.

… from application integration to event streaming

Over time, the buzzword “big data” came up more and more. I spent some time at Talend and TIBCO to learn new programming concepts such as Map-Reduce and Shuffling, mainly powered by Apache Hadoop and Apache Spark. The big data ecosystem snowballed with tens of frameworks such as Hive, HBase, Pig, and many more.

However, the first people realized that real-time data beats slow data in almost all use cases. The Lambda architecture was invented to separate real-time workloads from batch workloads. Event Streaming was born. Apache Kafka became the de facto standard for data streaming. Like CamelOne a decade ago, Kafka Summit is the one-stop-show for Kafka use cases, architectures, and success stories. Contrary to the small CamelOne, Kafka Summit is a global event with events across the globe, plus online events.

Data in motion with the Kappa architecture replacing Lambda

In 2014, a guy called Jay Kreps (few people knew him) was already questioning the Lambda architecture. Instead, he proposed to provide a single real-time layer to provide data for real-time and batch consumers. The Kappa architecture was born. Today, the Kappa architecture is mainstream, replacing Lambda. Various vendors adopt Kafka in the meantime.

Confluent became the clear leader in the event streaming software category. Confluent Platform is powered by Apache Kafka. The focus is on event streaming. That’s different from most other vendors like Cloudera; they focus on 10-20 frameworks or products and try to combine and integrate them somehow. Today, Confluent Cloud is a complete game-changer providing Apache Kafka and its ecosystem for application integration and stream processing as a serverless cloud offering.

This is where we are today in 2022. Application integration (= Camel) and event streaming (= Kafka) play a critical role in every modern enterprise architecture. Open-source is widely adopted and usually preferred compared to proprietary solutions for various reasons, including avoiding vendor lock-in. That’s true for self-managed and serverless cloud offerings.

Hence, the question arises: Should I use Apache Camel for application integration or Apache Kafka for event streaming? Or both? Or does one solve the other, too? These questions will be answered in the following sections, concluding with a decision tree to help you make the right choice for your project.

Let’s look at the similarities between Camel and Kafka, when to use which framework, when and how to combine them, and when not to use them at all.

Features in Apache Camel AND Apache Kafka

Camel and Kafka have many positive and negative characteristics in common. Hence, it is no surprise that people compare the two frameworks:

Open source under Apache 2.0 license
Vibrant community and adoption in the industry
Mature framework with deployments in enterprises across the globe
Fixing point-to-point spaghetti architectures with a central integration backbone
Open architecture and extensibility with custom functions and connectors
Small and big deployments possible, plus single-node deployments for non-mission-critical use cases
Re-engineered and optimized for cloud-native deployments (container, Kubernetes, cloud)
Connectivity to any technology, API, communication paradigm, and SaaS
Transformation of any data types and formats
Processes transactional and analytical workloads
Domain-specific language (DSL) for message at a time processing, with similar logic such as aggregation, filtering, conditional processing
Relative complex frameworks because of their robust feature set, hence not suitable for solving a minor problem
Not a replacement of a database, data warehouse, or data lake

Beyond the similarities, Kafka and Camel have very different sweet spots built to solve distinct problems. Hence, comparing these two tools is a bit comparison of apples and oranges. Some minor projects might use one or the other to solve the problem, but critical enterprise projects show the differences more quickly.

When to use Apache Camel?

The mission of Camel

Apache Camel is an integration framework. It solves a particular problem: Data integration between different applications, APIs, protocols, and communication paradigms. This concept is often called application integration or enterprise integration. Camel implements the famous Enterprise Integration Patterns (EIP). EIPs are based on messaging principles.

Camel’s strengths

Event-based backbone based on well-known and adopted EIP concepts
Connectivity to almost any API
Integration, processing, and routing of information with an intuitive domain-specific language (DSL) with a focus on integration; providing the ability of composability in a programming context for finer grain control in code for doing conditional logic or transformations/reformatting
Powerful routing capabilities with many built-in EIPs
Many deployment options (standalone, web container, application server, Spring, OSGi, Kubernetes via the Camel K sub-project) – okay, I guess some options are not relevant in this decade anymore
Lightweight alternative to proprietary ETL and ESB tools

Camel’s weaknesses

Only a “routing machine”, i.e., not built for long-term storage (additional cache or storage needed), for that reason, Camel is not the right choice for a central nervous system like Kafka
No stream processing (like you know it from Kafka Streams or Apache Flink)
Limited scalability, not built for massive volumes of data
No powerful visual coding like you know it from proprietary ETL/ESB/iPaaS tools
No serverless cloud offering, with that also not competing with other iPaaS offerings
Red Hat is the only vendor supporting it
Built to be deployed in a single data center or cloud region, not across hybrid or multi-cloud scenarios

The evolution of Apache Camel

Camel is widely adopted and has a strong community. Unfortunately, from a vendor and support perspective, the offerings declined in the last few years. One of the most significant pain points: I still don’t see a serverless cloud offering anywhere today:

Camel TL;DR

Camel is an application integration framework to connect different applications and interfaces. Camel is NOT built for processing data in motion continuously, i.e., stream processing. Hence, it should be compared to ETL and ESB tools, not data streaming technologies like Kafka, Kinesis, or Flink. If you look for a serverless cloud offering, you are out of luck. If you look for vendor support, Red Hat is the only option.

When to use Apache Kafka?

The mission of Kafka

Real-time data beats slow data at any scale. The event streaming platform enables processing data in motion. Kafka is the de facto standard for event streaming, including messaging, data integration, stream processing, and storage. Kafka provides all capabilities in one infrastructure at scale. It is reliable and allows to process analytics and transactional workloads.

Kafka’s strengths

Event-based streaming platform
A unique combination of pub/sub messaging, data processing, data integration, and storage in a single framework
Built for massive volumes of data and extreme scale from the beginning, with that a single framework can be used for transactional (low volume) and analytics (high volume) use cases
True decoupling between producers and consumers because of its storage component makes it the de facto standard for microservice architectures
Guaranteed ordering of events in the distributed commit log
Distributed data processing with fault-tolerance and recoverability built-in
Replayability of events
The de facto standard for event streaming
Built with hybrid and multi-cloud data replication in mind (with included tools like MirrorMaker and separate, more advanced, and more straightforward tools like Confluent Cluster Linking)
Support from many vendors, including Confluent, Cloudera, IBM, Red Hat, Amazon, Microsoft, and many more
Paradigm shift: Built to process data in motion end-to-end from source to one or more sinks

Kafka’s weaknesses

Paradigm shift: Enterprises need to learn and understand the added value of event streaming, a new software category that enables new use cases but also requires different design patterns and operations approaches
No powerful visual coding like you know it from proprietary ETL/ESB/iPaaS tools
Limited out-of-the-box routing capabilities (Kafka Connect SMT or Kafka Streams / ksqlDB app do the job very well, but not as simple as Camel)
Complex operations (if you run it by yourself instead of using 3rd party tools or even better a serverless cloud offering)

The evolution of Apache Kafka

Kafka was built at LinkedIn to process high volumes of data, as no other open-source framework could do this. Kafka found quick adoption after LinkedIn open-sourced it. Several vendors adopted Kafka and added it to their product portfolio. Some vendors just added Kafka for the sake of having it. Others innovated and used additional tools to make Kafka cloud-native for the next generation of event streaming. Kafka as a serverless cloud offering is a critical piece of many modern enterprise architectures today:

Kafka TL;DR

Kafka is an event streaming platform to process data in motion continuously. If you “just” need an integration framework to route data from a source to one or more sinks (= ETL / ESB), then Camel can be used, too. However, Kafka kills two birds with one stone (= integrating data AND processing it in motion where needed).

Plenty of Kafka offerings are available on the market. Check out the Apache Kafka landscape and comparison to understand the differences between offerings from Confluent, Cloudera, IBM, Red Hat, Amazon, Microsoft, and others.

Decision tree – Camel or Kafka?

The above sections explored when to use Camel and Kafka. So far, so good. Nevertheless, both frameworks overlap with their capabilities. Let’s get some help to choose the right one in that case.

Qualify out – the easiest way to start an evaluation!

The easiest way to decide on a specific option is to qualify out other frameworks that cannot fulfill the requirements.

Therefore, do you need

Big data processing?
A storage component for true decoupling and replayability of events?
Stateless or stateful stream processing?
A serverless cloud offering?

The above section discussed these differentiators of Kafka. In all these cases, you can qualify out Camel. It does not fulfill these requirements. These requirements are not necessarily a complete list. And you might also find a few aspects to qualify out Kafka from the beginning. Hence, you could also start from the Camel perspective and ask yourself: When should I not use Kafka. But I think it is easier the other way round.

Qualifying out solutions because of their limitations makes the decision tree and evaluation process much easier from the beginning.

Decision Tree for Camel and Kafka

Here is my decision tree to find out if Camel or Kafka is the right choice and what vendors you could evaluate:

When to use Camel and Kafka together?

It is possible to use Camel and Kafka together in a single integration architecture. Should you do that? Two options exist. One makes more sense than the other:

Kafka for event streaming and Camel for ETL

Camel and Kafka integrate well with each other. The native Kafka component of Camel is the best native integration point as a bridge between both environments:

The above architecture shows how Camel and Kafka live next to each other. Camel is used in a business domain for application integration. Kafka is the central nervous system between the Camel integration application and many other applications. I also added Kong as API Gateway to clarify that Camel or Kafka is not a silver bullet to solve every problem.

Once again, the vast advantage of Kafka as central integration layer is its unique combination of characteristics within a single infrastructure, including:

Real-time messaging at any scale
Storage for true decoupling between different applications and communication paradigms
Built-in backpressure handling and replayability of events
Data integration
Stream processing

Real-time data replication across hybrid and multi-cloud is not shown in the above picture but is also part of the enterprise architecture out-of-the-box leveraging take Kafka protocol.

With true decoupling within modern microservice architecture, each business team can decide whether they need application integration (using Camel) or event streaming (using Kafka). Often, both could be used. Additional questions around single vs. multi frameworks and APIs, vendor support, scalability needs, and other characteristics need to be evaluated to make the right choice for your business problem.

Camel connectors embedded into Kafka Connect

There is another way to combine Kafka and Camel: The “Camel Kafka Connector” sub-project of Apache Camel. Don’t get confused. This feature is not the Kafka component (= connector) of Camel! Instead, it is a relatively new initiative to deploy camel components into the Kafka Connect infrastructure.

The obvious benefit: This way, you get hundreds of new connectors “for free” within the Kafka ecosystem. This capability sounds excellent. And it is!

However, consider the total cost of ownership and the overall efforts using this approach. Application integration is one of the most challenging problems in computer science – especially if you talk about transactional data sets that require zero data loss, exactly-once semantics, and no downtime. The more components you combine in the end-to-end data flow, the harder it gets to keep your performance and reliability SLAs.

Hence, using Camel components within Kafka Connect has a considerable disadvantage: Combining two frameworks with complexities and different design concepts. Just a few examples:

Kafka world: Partitions, Offsets, Leader and Follower, Key/Value/Header, connectors (based on Kafka Connect), Bootstrap Server, ConsumerRecord, Retention Time, etc.
Camel world: Routes, RouteBuilder CamelContext, Exchange, Processor, components (Camel connectors), Endpoints, Type Converters, Registry, etc.

Please think twice before mixing two integration tools that are powerful but complex on their own. Getting this running is just one piece of the puzzle (the simple part). Don’t forget end-to-end testing, resiliency, SLAs, support across technologies and APIs. Even buying support for Camel and Kafka from Red Hat (i.e., a single vendor) does not improve this approach.

It is likely better to take the business logic and API calls out of the Camel component and copy it into a Kafka Connect connector template to run the integration natively with only Kafka code. This workaround allows a clean architecture, end-to-end integration with a single framework, a single vendor behind it, and much easier testing / debugging / monitoring.

TL;DR: I recommend only using the “Camel Kafka Connector” sub-project if the following options do not work:

Use only Apache Camel for application integration
Leverage Apache Kafka for event streaming and application integration
Choose separate deployments of Camel and Kafka and use the Camel-Kafka-Bridge

When NOT to use Camel or Kafka at all?

Once again, the easiest way for your evaluation to start is qualifying out tools that do not work to solve the problem.

Both Camel and Kafka are NOT built for the following scenarios:

A proxy for millions of clients (like mobile apps) – but native proxies (like a REST or MQTT Proxy for Kafka) exist for some use cases.
An API Management platform – but these tools are usually complementary and used to create life cycle management or monetize APIs deployed with Camel or Kafka.
A database for complex queries and batch analytics workloads
an IoT platform with features such as device management – but direct native integration with (some) IoT protocols such as MQTT or OPC-UA is possible and the approach for (some) use cases.
A technology for hard real-time applications such as safety-critical or deterministic systems – but that’s true for any other IT framework, too. Embedded systems are a different software than Camel or Kafka!

I wrote a very detailed post about this topic from a Kafka perspective. It maps almost 1:1 to the Camel world, too (and any related technology such as Flink, Spark, Pulsar, etc.): “When NOT to use Apache Kafka?”

Apache Camel vs. Apache Kafka – Who is the winner?

Simple answer: Both!

When you compare apples and oranges, you might become happy when you are hungry as both are good to eat. The same is true for Camel and Kafka. Both can do application integration. But they serve very different needs.

Many integration scenarios can use Camel or Kafka.

Camel is the right tool if you need to integrate data within an application context or business unit (with no need for stream processing, true decoupling, replayability, large scale, replication across data centers or cloud regions).

Kafka is the central event-based nervous system across business units, regions, and hybrid clouds. Kafka is all about event streaming. Application integration is just a piece of this puzzle. On the other side, I have seen plenty of integration projects powered by Apache Kafka. It is often replacing other middleware. That’s true for ETL/ESB legacy modernization and in discussions about using a cloud-native iPaaS.

Do you use Camel or Kafka today? What use cases? How do you decide which one to choose? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post When to use Apache Camel vs. Apache Kafka? appeared first on Kai Waehner.

Apache Kafka in the Financial Services Industry

Kai Waehner — Mon, 18 Jan 2021 13:24:41 +0000

The rise of event streaming in financial services is growing like crazy. Continuous real-time data integration and processing are mandatory for many use cases. Many business departments in the financial services sector deploy Apache Kafka for mission-critical transactional workloads and big data analytics. High scalability, high reliability, and an elastic open infrastructure are the key reasons for Kafka’s success. This blog post explores different use cases, architectures, and real-world examples in the FinServ sector.

FinServ Enterprise Reality: Innovate or be disrupted!

There is no way around it: The new business reality is very different from the last decades:

Technology was a support function in the past.
Innovation required for growth.
“Good enough” to run on yesterday’s data.
Technology is the business.
Innovation is required for survival.
Yesterday’s data = failure.
Modern, real-time data infrastructure is required.

Only two options exist for enterprises in the finance sector: Innovate or be disrupted!

The New FinServ Enterprise Reality – Every Company is a Software Company

Please take a look at your favorite traditional bank and how its market cap looks like compared to new FinTech companies such as Robinhood, Stripe, Square, or Revolut.

Some traditional companies re-invented themselves to focus on innovative new products and great customer experience to stay competitive. Software is eating the world, including the finance sector.

Here are a few examples:

Capital One: 10,000 of 40,000 employees are software engineers
Goldman Sachs: 1.5B (billion!) lines of code across 7,000+ applications
JPMorgan Chase & Co: Employs over 50,000 people in technology and has $10B+ technology spend.

Most successful post-modern companies in the finance sector heavily rely on Apache Kafka. This is true for emerging fintechs but also for (some) traditional banks.

Apache Kafka in Financial Services

Various use cases emerged to deploy event streaming with Apache Kafka in the finance industry. This includes mission-critical transactional workloads like payment processing or regulatory reporting and big data analytics projects leveraging Machine Learning, data lakes, etc.

Examples for Real-World Deployments

Here are a few companies leveraging Apache Kafka for banking projects:

Check past Kafka Summit video recordings and slides for details about use cases and architectures of these companies from the finance sector.

Here are a few concrete examples:

Capital One: Becoming truly event-driven – offering a service other parts of the bank can use.
ING: Significantly improved customer experience – as a differentiator + Fraud detection and cost savings.
Nordea: Able to meet strict regulatory requirements around real-time reporting + cost savings.
Paypal: Processing 400+ Billion events per day for user behavioral tracking, merchant monitoring, risk & compliance, fraud detection, and other use cases.
Royal Bank of Canada (RBC): Mainframe off-load, better CX & fraud detection – brought many parts of the bank together
10X Banking: Cloud-native and open core-banking platform to implement a next-generation FinServ platform
Robinhood: Commission-free stock trading using a mobile app and website.

This is just a concise list of companies in the financial sector using Apache Kafka as an event streaming platform for their business’s heart. Plenty of other examples are available by tens of global banks leveraging Apache Kafka for many use cases.

Kafka makes your Business Real-Time

The huge advantage is that Kafka allows decoupling your applications and infrastructure in a domain-driven design (DDD). Each microservice can use its own technology or product but leverage the same data (with security and privacy in mind, of course):

It is great to see that many FinServ companies do not just leverage Kafka in their applications but also contribute to the community. For instance, Robinhood published Faust: A stream processing library, porting Kafka Streams’ ideas from Java to Python.

This is a great example of a microservice architecture and the freedom of technology choice: Robinhood did not want to use Java for (some) applications and chose Python instead. No problem with Kafka as the brokers are dumb. The data processing and business logic happen in the clients in your favorite programming language.

And to be clear: Financial services are important in every company! Payments, orders, fraud, and similar transactional and analytical data rely on FinServ applications and the integration with partners.

Slides – The Rise of Event Streaming in FinServ

The following slide deck goes into more detail. Learn about the rise of event streaming in the financial services industry. Kafka is adopted in more and more scenarios:

Kafka in Banking for Middleware, Mainframe, Machine Learning, Open API, and more

The following links share additional content related to many banking and FinServ use cases and architectures:

Please check them out to learn more about the usage of event streaming with Kafka and its ecosystem across various business units in the Finserv sector.

Last but not least, please be aware that the term “real-time” is used in many contexts and can have different meanings. Read “Kafka is NOT hard real-time” to understand why Kafka is used in most banking projects, but not for the specific use case of trading in microseconds.

Software is Eating the Banks and FinServ Industry

Software is eating the world, including financial services. Continuous real-time data integration and processing are mandatory for many use cases. Apache Kafka is deployed across industries for mission-critical transactional workloads and big data analytics. No matter if you need to integrate with legacy systems, process mission-critical payment data, or build batch reports and analytic models, Kafka is a predominant choice as part of the architecture. Hybrid, edge, and multi-cloud deployments of Kafka are the new black.

What are your experiences and plans for event streaming in the financial services industry? Did you already build applications with Apache Kafka? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka in the Financial Services Industry appeared first on Kai Waehner.

Kafka and XML Messages – Transformation, Connector, Middleware

Kai Waehner — Fri, 25 Sep 2020 07:20:15 +0000

XML messages and XML Schema are not very common in the Apache Kafka and Event Streaming world! Why? Many people call XML legacy. It is complex, verbose, and often associated with the ugly WS-* Hell (SOAP, WSDL, etc). On the other side, every company older than five years uses XML. It is well understood, provides a good structure, and is human- and machine-readable.

This post does not want to start another flame war between XML and other technologies such as JSON (which also provides JSON Schema now), Avro, or Protobuf. Instead, I will walk you through the three main approaches to integrate between Kafka and XML messages as there is still a vast demand for implementing this integration today (often for integrating legacy applications and middleware).

XML and XML Schema

Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium’s XML 1.0 Specification of 1998 and several other related specifications – all of them free open standards – define XML.

The design goals of XML emphasize simplicity, generality, and usability across the Internet. It is a textual data format with strong support via Unicode for different human languages. Although the design of XML focuses on documents, the language is widely used for the representation of arbitrary data structures such as those used in web services. Several schema systems exist to aid in defining XML-based languages, while programmers have developed many application programming interfaces (APIs) to assist the processing of XML data.

SOAP / WSDL Web Services – The WS-* Hell

Web Services use XML messages that follow the SOAP standard and have been popular with traditional enterprises for many years. In such systems, there is often a machine-readable description of the operations offered by the service written in the Web Services Description Language (WSDL). Web Services are one of the predominant use cases for XML integration. Some people call this the “WS-* Hell”:

Kafka for any Data Format (JSON, XML, Avro, Protobuf, …)

Kafka can store and process anything, including XML. The Kafka brokers are dumb. They don’t care about data formats. The implementation of Kafka under the hood stores and processes only byte arrays. This approach follows the design principle of dumb pipes and smart endpoints (coined by Martin Fowler for microservice architectures). Dumb brokers are one of the architectural reasons why Kafka scales and performs so well.

As Kafka supports any data format, XML is no problem at all. It accepts any serializable input format. XML is just text so that plain string serializers can be used. However, if you want additional validation before pushing messages into Kafka (like checking the content is actually XML or doing Schema Validation using XML Schema), then you need to write your own XML Serializer/Deserializer implementation.

XML mappings can be very complex, including referencing other documents and using plenty of ugly open standards. XML includes generic standards such as WS-Security or industry-specific standards such as XBRL for regulatory processing or HL7 for healthcare. It is a mess. Period. Even mature tools struggle as soon as any parts of such a standard are adjusted to their own needs (even though the standards support customization).

Kafka works with any programming language, and Confluent also provides a REST Proxy for HTTP(S) communication with Kafka. But these clients require the developer to implement all the complex XML mapping and processing. Hence, let’s now talk about the most common approaches to implement the integration between any XML-based application and Apache Kafka: 3rd party middleware (ETL / ESB tools) and Kafka-native Kafka Connect.

Kafka and XML via Middleware (ETL, ESB)

Apache Kafka and traditional middleware (ETL, ESB) are frenemies (friends and enemies). Check out this blog and video recording/slide deck for a more in-depth discussion and comparison. If these legacy integration tools such as TIBCO BusinessWorks or Software AG webMethods do one thing well, then it is graphical mappings of complex XML structures, including good and mature (but not 100%) support of the ugly WS-* web service standards.

Pros of 3rd Party Middleware for XML-Kafka Integration

Visual coding for a more straightforward mapping experience (especially crucial for very complex structures) – for all coders: Trust me, this is really easier and more time-efficient than writing, testing, and debugging source code!
Mature (10-20 years old technologies don’t have many bugs anymore – if they are still alive)
Support of complex XML Schema structures (but yet often issues with import, UI, and export)
(Often) already in place (implemented, tested, and deployed)
Kafka integration exists for any middleware which is still alive (i.e., maintained and supported by the vendor)

Cons of 3rd Party Middleware for XML-Kafka Integration

Products are legacy – as old as the source systems
Monolithic, inflexible architecture
Separate infrastructure to operate, test, maintain and pay
End-to-end integration is more challenging (from a technical and support perspective) as two systems in the middle instead of just one
Licensing
Point-to-point and tight coupling, and not event-based streaming with real decoupling
(Often) proprietary solution

Traditional middleware (such as TIBCO, IBM, Software AG, or Mulesoft) complements Kafka deployments. If you have the middleware already running and licensed (and do not plan to migrate away from it), then this is a viable approach to integrate between XML messages from legacy systems and Kafka.

Kafka and XML via Kafka Connect

The open Kafka ecosystem provides Kafka-native support for XML integration leveraging Kafka Connect. Kafka Connect is a Kafka-native tool for scalably and reliably streaming data integration between Apache Kafka and other data systems. It makes it simple to quickly define connectors that move large data sets into and out of Kafka. Think about it as ESB-on-Kafka.

Use cases include

Messaging integration (MQ)
Mainframe offloading
File outputs from batch processes
Web services (SOAP / WSDL / WS-*)
Legacy applications

Pros

Kafka-native (leveraging Kafka under the hood for scalability, throughput, high availability, exactly-once semantics, low latency, etc.)
Decoupled design (Domain-driven Microservice approach instead of tight coupling)
Open ecosystem and flexible integration with any data sources and sinks – check out Confluent Hub for hundreds of open source and commercial connectors
Cloud-native to be deployed in any edge / on-premise or cloud infrastructure such as Kubernetes

Cons

Limited support for complex XML Schemas and standards – not all ugly documents will work well
No visual coding – unfortunately, no Kafka-native visual coding tools exist in 2020. Let’s go, Confluent!

Let’s take a look at two Kafka Connect approaches in more detail: A dedicated XML Connector and an SMT (Single Message Transformation) embedded into any Kafka Connect source or sink connector.

Kafka Connect Connector for XML Files

An XML connector directly accesses the XML file to parse and transform the content:

Connect FilePulse is an open-source Kafka Connect connector built by streamthoughts to parse, transform and stream any XML file. Other file formats are also supported. But as many other tools support the modern data formats such as JSON, CSV, Avro, or Protobuf, I really think the highlight of this connector is the XML support.

Features:

Support for recursive scanning of local directories.
Reading and writing files into Kafka line by line.
Support multiple input file formats (e.g: CSV, JSON, AVRO, XML).
Parsing and transforming data using built-in or custom processing filters
Error handler definition
Monitoring files while they are written into Kafka
Support pluggable strategies to cleanup up completed files

Here is an excellent tutorial for using this XML connector for Kafka Connect: Streaming data into Kafka – Loading an XML file.

The Connect FilePulse Kafka Connector is the right choice for direct integration between XML files and Kafka.

SMT for Embedding XML Transformations into ANY Kafka Connect Connector

An SMT (Single Message Transformation) is part of the Kafka Connect framework. SMTs are applied to messages as they flow through Kafka Connect. They transform inbound messages after a source connector has produced them, but before they are written to Kafka. SMTs transform outbound messages before they are sent to a sink connector.

An SMT can be embedded into any Kafka Connect source or sink connector. Hence, the XML SMT for Kafka Connect allows direct integration with any interface and mapping XML messages without the need for storing the file or using a specific XML connector.

SMTs even allow to add or change metadata, e.g., by adding a new header in addition to the key and value of the message.

Here is an example: Receive XML messages from JMS-based messaging platforms and convert the XML payload to JSON, AVRO, or Protobuf for further processing and integration into the rest of the (modern) enterprise architecture. For instance, Confluent provides a generic JMS connector but also dedicated connectors for various legacy MQ products such as IBM MQ (often running on the mainframe), TIBCO EMS, and ActiveMQ. The XML SMT allows on-the-fly transformation of the incoming XML messages. I have seen integration and later replacement of these MQ tools across the globe in any kind of industry.

Dead Letter Queue (DLQ) for Handling Bad XML Messages

Just transforming messages is often not sufficient. A Dead Letter Queue (DLQ), aka Dead Letter Channel, is an Enterprise Integration Pattern (EIP) to handle bad messages. This design pattern is complementary for XML integration. For instance, a DLQ can store badly processed XML that didn’t fit the XSD in the transform. Here is an example of how to implement the rerouting to a DLQ using the above SMT.

Confluent Schema Registry for Data Governance

The Confluent Schema Registry is a complimentary (optional) tool. It provides a smart implementation of data format and content validation (including enforcement, versioning, and other features). I see it used in ~70% of Kafka projects across the globe. As soon as you do more than just data ingestion into a data lake like HDFS or S3, the added value is enormous. Today, Confluent Schema Registry supports JSON Schema, Avro, and Protobuf.

The Schema Registry provides an open interface and is pluggable. For example, some users have asked for Schema Registry to support XML. Now, you can add XML support to Schema Registry directly, and use the Schema Registry to store both XML and Avro at the same time. For more on how to add your schema formats, please refer to the documentation. The workaround is to do the discussed XML-to-another-format transformation first and start your event streaming data governance from that point.

Summary

XML is predominant in most enterprises and mostly used for legacy applications, batch processing, and SOAP / WSDL web services. A digital transformation can only be successful if the old world is connected well to the new world (as you might have learned in my example about how to integrate between Kafka and Mainframes).

This post explored the three most common options for integration between Kafka and XML:

XML integration with a 3rd party middleware
Kafka Connect connector for integration with XML files
Kafka Connect SMT to embed the transformation into ANY other Kafka Connect source or sink connector to transform XML messages on the flight

What are your experiences with XML integration for Kafka? Which implementation did you choose? What challenges did you face, and how did you or do you plan to solve this? What is your strategy? Let’s connect on LinkedIn and discuss it!

The post Kafka and XML Messages – Transformation, Connector, Middleware appeared first on Kai Waehner.

Kafka SAP Integration – APIs, Tools, Connector, ERP et al

Kai Waehner — Tue, 25 Aug 2020 13:58:42 +0000

A question I get every week from customers across the globe: How can I integrate my SAP system with Apache Kafka? This post explores various alternatives, including connectors, 3rd party tools, custom glue code, and trade-offs between the different options.

After exploring what SAP is, I will discuss several integration options between Apache Kafka and SAP systems:

Traditional middleware (ETL/ESB)
Web services (SOAP/REST)
3rd party turnkey solutions
Kafka-native connectivity with Kafka Connect
Custom glue code using SAP SDKs

Disclaimer before you read on:

I am not an SAP expert. It is tough to stay up-to-date with the vast and complex ecosystem of SAP products, (re-)brands, versions, services, SDKs, and APIs. I am sorry if some of the below information is not 100% accurate or outdated. Always double-check on the SAP website (if the links from Google still work – I had some issues with some pages “no longer available” while researching for this blog post). If you see any inaccurate or missing information, please let me know and I will update the blog post.

What is SAP?

SAP is a German multinational software corporation that makes enterprise software to manage business operations and customer relations. In 2019, SAP had revenue of €27.553 billion, a net income of €3.387 billion, and ~100.000 employees.

It is quite interesting: Nobody asks how to integrate with IBM or Oracle. Instead, people more specifically ask how to integrate with IBM MQ, IBM DB2, IBM Mainframe (still very ambiguous), or any other of the 100s of IBM products.

For SAP, people ask: How can I integrate with SAP? Let’s clarify what SAP is before exploring integration options.

The company is primarily known for its ERP software. But if you check out the official “What is SAP?” page, you find out that SAP offers solutions across a wide range of areas:

ERP and Finance
CRM and Customer Experience
Network and Spend Management
Digital Supply Chain
HR and People Engagement
Experience Management
Business Technology Platform
Digital Transformation
Small and Midsize Enterprises
Industry Solutions

SAP’s Software Portfolio

SAP’s stack includes homegrown products like SAP ERP and acquisitions with their own codebase, including Ariba for supplier network, hybris for e-commerce solutions, Concur for travel & expense management, and Qualtrics for experience management.

Even if you talk about SAP ERP, the situation is still not that easy. Most companies still run SAP ERP Central Component (ECC, formerly called SAP R/3), SAP’s sophisticated (and aged) ERP product. ECC runs on a third-party relational database from Oracle, IBM, or Microsoft, while HANA is SAP’s in-memory database. The new ERP product is SAP S4/Hana (no, this is not just the famous in-memory database). Oh, and there is SAP S4/Hana Cloud. And before you wonder: No, this is not the same feature set as the on-premise version!

Various interfaces exist depending on your product. An interface can be an (awful) proprietary technologies like BAPI or iDoc, (okayish) standards-based web service APIs using SOAP or REST / HTTP, a (non-scalable) JDBC database connectivity, or if you are lucky even a (scalable and real-time) Event / Messaging API. The article “The ERP is Dead. Long live the Distributed Planning System” from the SAP blog describes the situation very well.

And sorry, we are still not done yet. Even if you talk about ERP systems, this can mean anything from a zoo of products or components, depending on who you are talking to:

So, before you want to discuss the integration of your SAP product with Kafka, please please please find out the product, version, and deployment infrastructure of your SAP components.

Different Integration Options between Kafka and SAP

After this introduction, you hopefully understand that there is no silver bullet for SAP integration. The following will explore different integration options between Kafka and SAP and their trade-offs. The main focus is on SAP ERP (old ECC and new S4/Hana), but the overview is more generic, including integration capabilities with other components and products.

Also, keep in mind that you typically need or want to integrate with a function or service. Direct integration with the data object does not make much sense in most cases, as you would have to re-implement the mapping and denormalization between the data objects. Especially for source integration, i.e., building pipelines from SAP to Kafka. In the case of SAP ERP, you typically integrate with RFC/BAPI/iDoc or any other web service interface for this reason.

Traditional Middleware (ETL / ESB) for SAP Integration

Integration tools exist just for the sake of integrating different sources and sinks:

Extract-Transform-Load (ETL) for batch integration, like Informatica, Talend or SAP NetWeaver Process Integration
Enterprise Service Bus (ESB) for integration via web services and messaging, like TIBCO BusinessWorks or Software AG webMethods
Integration Platform as a Service (iPaaS) for cloud-native integration, similar to ETL/ESB tools, but provided as a fully managed service, such as Boomi, Mulesoft, or SAP Cloud Integration (and some cloud-washed products from legacy middleware vendors).

Most traditional middleware products were built to integrate with complex, proprietary systems from the last 20+ years, such as IBM Mainframe, EDIFACT, and guess what ERP systems like SAP ECC. In the meantime, all of them also have a Kafka connector. There are plenty of good reasons why many companies chose Kafka as a modern integration platform instead of a legacy of traditional middleware.

Most traditional ETL and ESB tools provide SAP connectivity. SAP Cloud Platform Integration (SAP CPI) is SAP’s own “modern” middleware solution. CPI includes a Kafka adapter to send and receive Kafka messages.

Pros:

In place: Typically already in place, no new project is required.
Maturity: Built over the years (because of the complexity), running in production for a long time already
Tooling: Visual coding for the integration (required because of the complexity), directly map iDoc / BAPI / Hana / SOAP schemas to other data structures
Integration: Not just connectors to the legacy systems but also Kafka for producing and consuming messages (due to market pressure)

Cons:

Legacy: Products are as old as the source and sink systems,
Scalability: Monolithic, inflexible architecture
Tight coupling: Integration has to be developed and runs on the middleware, no real decoupling and domain-driven design DDD like in Kafka
Licensing: High-cost per server, often already planned to be replaced (e.g., you can replace 100+ IBM MQ or TIBCO EMS servers with a single Kafka cluster)
Point-to-point: No streaming architecture, most integrations are based on web services (even if the core under the hood is based on a messaging system)

TL;DR:

Traditional integration tools are mature and have great tooling, but limited scalability/flexibility and high licensing cost. Often a quick win as it is already running, and you just need to add the Kafka connector.

Custom Glue Code for Kafka Integration using SAP SDKs

Writing your custom integration between SAP systems and Kafka is a viable option. This glue code typically leverages the same SDKs as 3rd party tools use under the hood:

Legacy: SAP NetWeaver RFC SDK – a C/C++ interface for connecting to SAP systems from release R/3 4.6C up to today’s SAP S/4HANA systems.
Legacy: SAP Java Connector (SAP JCo) – the famous JCO.jar library – is a Java SDK for integration to the SAP ECC / ERP (this is just a wrapper around the C/C++ SDK).
Legacy: SAP ACO is an integrated ABAP component that is designed to consume RFC Services on remote ABAP systems.
Legacy: SAP ABAP TCP Push Channel if you are forced to use custom ABAP code and need or want to use TCP instead of the Confluent REST Proxy for HTTP communication.
Legacy: JMS Adapter to integrate via the standard messaging protocol. Great option (if you get it running and working for your use case and functions). For instance, JMS integration can be done via SAP PI.
Modern: SAP Cloud SDK allows developing applications with Java or JavaScript that communicate with SAP solutions and services such as SAP S/4 Hana Cloud, SAP SuccessFactors, and others (the term ‘Cloud’ actually means ‘Cloud-native’ in this case, i.e., this SDK also works with SAP’s on-premise products).
Modern: SAP Cloud Platform Enterprise Messaging: S4/Hana provides an asynchronous messaging interface (running on Solace on CloudFoundry under the hood). Different messaging standards are supported, including AMQP 1.0 and JMS (depending on the specific product you look at). Some examples demonstrate how to connect via the Java Client using the JMS API.
Modern: SAP ODP (Operational Data Provisioning): Technical infrastructure for operational analytics, and data extraction + replication. Some kind of CDC (Change Data Capture) with out-of-the-box support for various SAP products, including SAP BW, SAP BW/4HANA, SAP Data Services, and SAP HANA Smart Data Integration. ODP is not just for SAP interfaces but also integrates with 3rd party technologies (via a custom connector, not out-of-the-box) such as HDFS or Kafka.

Pros:

Flexibility: Custom coding allows you to implement precisely what you need.

Cons:

Maintenance: No vendor support – develop, maintain, operate, support by yourself.
Point-to-point: No streaming architecture, most integrations are based on web services (even if the core under the hood is based on a messaging system).

TL;DR:

“Build vs. Buy” always has trade-offs. I have only seen custom glue code for SAP integration in the field if no solution from a vendor was available and affordable. SAP Cloud Platform Enterprise Messaging is a possible integration pattern for Kafka, but it also adds yet another messaging layer to the architecture.

SOAP / REST Web Services for SAP Integration

The last 15 years brought us web services for building a Service-oriented Architecture (SOA) to integrate applications. A web service typically uses SOAP or REST / HTTP as technology. I will not start yet another FUD war here. Both have their use cases and trade-offs.

Pros:

Standards-based: Different SDKs, products, and services talk the same language (at least in theory; true for HTTP, not so true for SOAP); most middleware tools have proper support for building HTTP services.
Simplicity (HTTP): Well-understood, supported by most programming languages and APIs, established for many use cases – middleware is just yet another one.

Cons:

Point-to-point: No streaming architecture, most integrations are based on web services (even if the core under the hood is based on a messaging system).
Tight coupling: Integration has to be developed and runs on the middleware, no real decoupling, and domain-driven design DDD like in Kafka.
Complexity (SOAP): SOAP/WSDL is just the tip of the iceberg! Check out the list of WS-* standards to understand why this is often called the “WS star hell”. The AXIS framework (Apache extensible Interaction System) is one example of SAP’s SOAP integration using an open framework. While the Apache project was last updated in 2006, SAP still recommends using this interface in 2020.
Missing features (REST / HTTP): Representational state transfer (REST) is a concept, but most people mean synchronous HTTP communication. Most middleware tools (and most other applications) only just a small fraction of the full standard. HTTP is an excellent standard, but all the tooling and features need to be built on top of it.
Only indirect support: Several SAP products do not provide open interfaces. While using SOAP or HTTP under the hood, you are forced to use the licensed tooling to create web services. For instance, SAP Business Connector (restricted license version of webMethods Integration Server), SAP NetWeaver Process Integration (PI), SAP Process Orchestration (PO), Cloud Platform Integration (CPI), or SAP Cloud Integration.

TL;DR:

SOAP and REST web services work well for point-to-point communication and have good tool support. Both have their trade-offs, make sure to choose the right one – if your SAP product provides both interfaces. Unfortunately, you will often not have a choice. Even worse: You cannot use any tool but are forced to use the right licensed SAP tool or wrapper interface. Large scale, high volume, and continuous processing of data are not ideal requirements for these (legacy) integration products.

For direct HTTP(S) communication with Kafka, Confluent Rest Proxy is an excellent option for producing, consuming, and administrating from any Kafka client (including custom SAP applications). For instance, SAP Cloud Platform Integration (CPI) can use this integration pattern to integrate between SAP and Kafka.

SAP-specific 3rd Party Tools for Kafka

SAP integration is a huge market globally. SAP provides several tools for data integration (some legacy, some modern – honestly, I don’t have a full overview of their complex product and API portfolio). Additionally, plenty of other software vendors have built specific integration software for SAP systems.

A few examples I have seen in the field recently:

Examples:

SAP itself has various integration tools, though many are already deprecated. One example of using an SAP solution is the SAP OpenHub Service. It allows you to distribute data from a BW system to non-SAP data marts, analytical applications, and other applications. SAP Data Hub (rebranded to SAP Data Intelligence – always search for both terms) can also connect to Kafka as demonstrated in this example using ODP under the hood. Limitations of SAP Data Hub include that you have to operate a licensed BI system, that dedicated “OpenHub” components have to be implemented in BW for each scenario, that a rather batch-oriented, request-based processing takes place and that this is a classic point-to-point scenario. Furthermore, the additional license fees have to be taken into account.
There are more SAP products in the confusing SAP universe that are able to integrate with Kafka. Honestly, the list of deprecated products in conjunction with new products and rebranding of old products is a total mess. Even the SAP websites struggle with this a lot!
ASAPIO: Their Cloud Integrator is designed for SAP ERP and SAP S/4HANA. It enables the replication of required data between system environments based on SAP NetWeaver technology.
Advantco: The Kafka Adapter is fully integrated with the SAP Adapter Framework, which means the Kafka connections can be configured, managed, and monitored using the standard SAP PI/PO & CPI tools. Includes support for Confluent Schema Registry (+ Avro / Protobuf).
workato: The company provides SAP OData Integration to Kafka and various pre-built recipe templates.
INIT Software has implemented a Kafka Connect ODP connector called i-OhJa. The Kafka-native nature provides benefits such as high performance, high scalability, and exactly-once semantics.
KaTe Kafka Adapter to connect SAP PO with Kafka in both directions.

These are just a few examples. Many more exist for on-premise, cloud, and hybrid integration with different SAP products and interfaces.

Some of these tools are natively integrated into SAP’s integration tools instead of being completely independent runtimes. This can be good or bad. An advantage of this approach is that you can leverage the SAP-native features for complex iDoc / BAPI mappings and the integrated 3rd party connector for Kafka communication.

Pros:

Turnkey solution: Built for SAP integration, often combined with other additional helpful features beyond just doing the connectivity, more lightweight than traditional generic middleware.
Focus: Many 3rd party solutions focus on a few specific use cases and/or products and technologies. It is much harder to integrate with “SAP in general” than focusing on a particular niche, e.g., Human Resources processes and related HTTP interfaces.
Maturity: Built over the years
Tooling: Visual coding for the integration (required because of the complexity), directly map iDoc / BAPI / Hana / SOAP schemas to other data structures
Integration: Not just connectors to the legacy systems but also modern technologies such as Kafka

Cons:

Scalability: Often monolithic, inflexible architecture (but focusing on SAP integration only, therefore often “okayish”)
Tight coupling: Integration has to be developed and runs on the tool, but separated from other middleware, thus decoupling and domain-driven design DDD in conjunction with Kafka
Licensing: Moderate cost per server (typically cheaper than the traditional generic middleware)
Point-to-point: No streaming architecture, most integrations are based on web services (even if the core under the hood is based on a messaging system)

TL;DR:

A turnkey solution is an excellent choice in many scenarios. I see this pattern of combining Kafka with a dedicated 3rd party solution for SAP integration very often. I like it as the architecture is still decoupled, but no vast efforts required for doing a (complex) SAP integration. And there is still hope that even SAP themselves releases a nice Kafka-native integration platform.

Kafka-native SAP Integration with Kafka Connect

Kafka Connect, an open-source component of Apache Kafka, is a framework for connecting Kafka with external systems such as databases, key-value stores, search indexes, and file systems.

Kafka Connect connectors are available for SAP ERP databases:

Confluent Hana connector and SAP Hana connector for S4/Hana
Confluent JDBC connector for R/3 / ECC to integrate with Oracle / IBM DB2 / MS SQL Server.

Pros:

Kafka-native: Kafka under the hood, providing real-time processing for high volumes of data with high scalability and reliability.
Simplicity: Just one infrastructure for messaging and data integration, much easier to develop, test, operate, scale, and license than using different frameworks or products (e.g., Kafka for messaging plus an ESB for data integration).
Real decoupling: Kafka’s architecture uses smart endpoints and dumb pipes by design, one of the key design principles of microservices. Not just for the applications, but also for the integration components. Leverage all the benefits of a domain-driven architecture for your Kafka-native middleware.
Custom connectors: Kafka Connect provides an open template. If no connector is available, you (or your favorite system integrator or Kafka-vendor) can build an SAP-specific connector once, and you can roll it out everywhere.

Cons:

Only database connectors: No connectors available beyond the native JDBC database integration are available at the time of writing this.
Anti-pattern of direct database access: In most cases, you want or need to integrate with a function or service, not with the data objects. In most cases, you don’t even get direct access from the database admin anyway.
Efforts: Build your own SAP-native (i.e., non-JDBC) connector or ask (and pay) your favorite SI or Kafka vendor.

UPDATE January 2021: A Kafka-native integration is available with INIT’s ODP connector (as discussed in section “3rd party tools”). It eliminates the above cons and might be a great 3rd party option for some use cases.

TL;DR:

Kafka Connect is a great framework and used in most Kafka architectures for various good reasons. For SAP integration, the situation is different because no connectors are available (beyond direct database access). It took 3rd party vendors many years to implement RFC/BAPI/iDoc integration with their tools. Such an implementation will probably not happen again for Kafka because it is very complex, and these proprietary legacy interfaces are dying anyway. The situation is different for modern SAP interfaces: Some 3rd party providers leverage Kafka Connect for their product. For instance, INIT Software’s Kafka Connect ODP connector.

A Kafka Connect connector for SAP Cloud Platform Enterprise Messaging using its Java client would be a feasible and best option. I assume we will see such a connector on the market sooner or later.

Embedded Kafka in SAP Products

We have seen various integration options between SAP and Kafka. Unfortunately, all of them are based on the principle of “data at rest” in contrary to Kafka processing “data in motion”. The closest fit until here is the integration via the SAP Cloud Platform Enterprise Messaging because you can at least leverage an asynchronous messaging API.

The real added value comes when Kafka is leveraged not just for real-time messaging but for event streaming. Kafka provides a combination of messaging, data integration, data processing, and real decoupling using its distributed storage infrastructure.

Native Event Streaming with Kafka in SAP Products

Interestingly, some of SAPs acquisitions leverage Kafka under the hood for event streaming. Two public examples:

SAP Concur: Wanny Morellato, a director of engineering, lead the effort of refactoring Concur monolithic travel and expense backend into an event-driven distributed system of microservices using Kafka: Breaking Down a SQL Monolith with Change Tracking, Kafka and KStreams/KSQL
SAP Qualtrics: A particular challenge amidst their growth was implementing standardizations around different types of data. They were working out how to blend numerical data – about companies’ subscriptions, or sales, or content engagement – with experience data collected from surveys. To do that, they utilize technologies like Kafka and Spark and really fast data stores to create a real-time engine to transform data into actionable observational reports.

Obviously, people are also waiting for the Kafka-native SAP S4/Hana interface so that they can leverage events in real-time for processing data in motion and correlate real-time and historical data together. A native Kafka integration with SAP S4/Hana should be the next step for SAP! HERE Technologies provides a great example of how to provide a Kafka-native interface (and an alternative REST option) for their product.

Having said this, current SAP blogs (from mid of 2019) still talk about replacing the 20+ years old BAPI and RFC integration style with SOAP and OData (Open Data Protocol, an open protocol that allows the creation and consumption of queryable and interoperable REST) APIs in SAP S/4HANA Public Cloud.

My personal feeling and hope are that a native Kafka interface is just a matter of time as the market demand is everywhere across the globe (I talk to many customers in EMEA, US, and APAC), and even several non-S4/Hana SAP products use Kafka internally.

I have also seen a two-fold approach from some other vendors: Provide a Kafka-native interface to the outside world first (in SAP terms you could e.g. provide a Kafka-interface on top of BAPIs. At a later point, reengineer the internal architecture away from the non-scalable technology to Kafka under the hood (in SAP terms you could replace RFC / BAPI functions with a more scalable Kafka-native version – even using the same API interface and message structure).

Native Streaming Replication between Products, Departments, and Companies

Native Kafka integration does not just happen within a product or company. A widespread trend I see on the market in different industries is to integrate with partners via Kafka-native streaming replication instead of REST APIs:

Think about it: If you use Kafka in different application infrastructures, but the interface is just a web service or database, then all the benefits might go away because scalability and/or real-time data correlation capabilities go away.

More and more vendors of standard software use Kafka as the backbone of their internal architecture. If the interface between products (imaginatively say SAP’s ERP system, SAP’s MES system, and the SCM application of an OEM customer) is just a SOAP or REST API, then this does not scale and perform well for the requirements of use cases in the digital transformation and Industry 4.0.

Hence, more and more companies leverage Kafka not just internally but also between departments or even different companies. Streaming replication between companies is possible with tools like MirrorMaker 2.0 or Confluent Replicator. Or you use the much simpler Cluster Linking from Confluent, which enables integration between hybrid, multi-cloud, or 3rd party integration using the Kafka protocol under the hood.

SAP + Apache Kafka = The Future for ERP et al

There is huge demand across the globe to integrate SAP applications with Apache Kafka for real-time messaging, data integration, and data processing at scale. The demand is true for SAP ERP (ECC and S4/Hana) but also for most other products from the vast SAP portfolio.

Kafka is deployed in many modern and innovative use cases for supply chain management, manufacturing, customer experience, and so on. Edge, hybrid and multi-cloud Kafka deployments is the norm, not an exception.

Kafka integrates with SAP systems well. Different integration options are available via SAP SDKs and 3rd party products for proprietary interfaces, open standards, and modern messaging and event streaming concepts. Choose the right option for your need and get started with Kafka SAP integration…

If you want to build a modernize your existing ERP infrastructure (no matter if SAP or any other vendor), also check out the article “Building a Postmodern ERP with Apache Kafka“.

What are your experiences with SAP Kafka integration? How did it work? What challenges did you face and how did you or do you plan to solve this? What is your strategy? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Kafka SAP Integration – APIs, Tools, Connector, ERP et al appeared first on Kai Waehner.

Apache Kafka is the New Black at the Edge in Industrial IoT, Logistics and Retailing

Kai Waehner — Wed, 01 Jan 2020 10:45:28 +0000

The following question comes up almost every week in conversations with customers: Can and should I deploy Apache Kafka at the edge? Or should I just deploy Kafka in a “real” data center or public cloud infrastructure? I am glad that people ask because it is a valid question in various industries, including manufacturing, automation industry, aviation, logistics, and retailing. This blog post explains why Apache Kafka is the New Black at the Edge in Internet of Things (IoT) projects. I cover use cases and different architectures. The last section discusses how Kafka as event streaming platform complements other IoT frameworks and products at the edge for data integration and edge processing in real time at scale.

Multiple Kafka Clusters Became the Norm, not an Exception!

Multi-cluster and cross-data center deployments of Apache Kafka have become the norm rather than an exception. A Kafka deployment at the edge can be an independent project. However, in most cases, Kafka at the edge is part of an overall Kafka architecture.

Many reasons exist to create more than just one Kafka cluster in your organization:

Independent Projects
Hybrid integration
Edge computing
Aggregation
Migration
Disaster recovery
Global infrastructure (regional or even cross continent communication)
Cross-company communication

This blog post focuses on the deployment of Apache Kafka at the edge. The relation to all the other kinds of Kafka architectures will be discussed in February 2020 at DevNexus in Atlanta. There, I will explain in detail the “architecture patterns for distributed, hybrid and global Apache Kafka deployments“. I will share the slides and a video recording in another blog post after the conference.

Before we think about running Kafka at the edge, we need to define the term “edge”.

What is “The Edge” or “Edge Computing”?

Wikipedia says “edge computing is a distributed computing paradigm which brings computation and data storage closer to the location where it is needed, to improve response times and save bandwidth”. Other benefits include add cost reduction, flexible architecture and separation of concerns.

Apache Kafka at the Edge

Different options exist to discuss “Kafka for edge computing”:

Only Clients at the Edge: Kafka Clients running at the edge. Kafka Cluster deployed in a Data Center or Public Cloud environment.
Everything at the Edge: Kafka Cluster and Kafka Clients (e.g. sensors in the factory) deployed at the edge.
Edge and Beyond: Kafka Cluster deployed at the edge. Kafka Clients (e.g. smartphones in the region) running close to the edge.

Therefore, the range of “Kafka at the Edge” is gigantic:

Edge in an IIoT shop floor could be a Kafka Client written in C and deployed to a microcontroller in a sensor. Such a sensor typically has just a few kilobyte of memory. It lives for a defined time span before it dies and gets replaced.
Edge in telco business could be a full distributed Kafka cluster running on StarlingX. This is an open source private cloud infrastructure software stack based on Kubernetes for the edge used by the most demanding applications in industrial IoT, telecom, video delivery and other ultra-low latency use cases. The hardware requirements for such an edge deployment are higher than what some teams in a traditional bank or insurance company get after waiting six months for management approvals.
In most cases, edge is somewhere in the middle of the two extreme scenarios described above.

In most scenarios, “Kafka at the edge” means the deployment of a Kafka cluster on site at the edge. The Kafka Clients are either running on site or close to the site (“close” could be several miles away in some cases). Sometimes, the edge site is offline and disconnected from the cloud regularly. This is my definition for this blog post, too. Just make sure to define what you mean with the term “edge” in your conversations with colleagues and customers.

Now let’s think about use cases for running Kafka at the edge.

Use Cases for Kafka at the Edge

Use cases for Kafka deployments at the edge exist in various industries. No matter if “your things” are smartphones, machines in the shop floor, sensors, cars, robots, cash point machines, or anything else.

Let’s take a look at a few examples of what I have seen in 2019 in many different enterprises:

Industrial IoT (IIoT): Edge integration and processing in real time are key for success in modern IoT architectures. Various use cases exist for Industry 4.0, including predictive maintenance, quality insurance, process optimization and cyber security. Building a Digital Twin with Kafka is one of the most frequent scenarios and a perfect fit for many use cases.
Retailing: The digital transformation enables many new, innovative use cases, including customer 360 experiences, cross-selling and collaboration with partner vendors. This is relevant for any kind of retailer. No matter if you think about retail stores like Walmart, coffee shops like Starbucks, or cutting-edge shops like the Amazon Go store.
Logistics: Correlation of data in real time at scale is a game changer for any logistics scenario: Track and trace end-to-end for package delivery, communication between delivery drones (or humans using cars) and the local self-service collection booths, accelerated processing in the logistics center, coordination and planning of car sharing / ride sharing, traffic light management in a smart city, and so on.

Kafka at the Edge – High Level Architecture

No matter which use case you want to implement: The architecture for Apache Kafka at the edge typically looks more or less like the following (on a very high level):

Why and How does Kafka help at the Edge?

The ideas and use cases discussed above are nothing new. But think about your favorite consumer application, retailer or coffee shop: How many enterprises have already rolled out real time applications for context-specific push messages, customer experience or other services at scale in a reliable way?

Honestly, not many. Only some tech companies like Uber and Lyft did a great job. Though, these companies could start on the green field a few years ago. Not really surprisingly, all the tech companies use the event streaming platform Kafka as the heart of their real time infrastructure.

But the situation gets better and better these days. More and more traditional enterprises already rolled out an event streaming platform in their data centers or in the cloud to process data in real time at scale to build innovative applications.

Challenges at the Edge

However, enterprises often have challenges to bring these innovative real time applications into scenarios where data is distributed in local sites, like factories, retail stores, coffee shops, etc.

Challenges include:

Integration with all the hardware, machines, devices at the edge is hard due to bad network and many other limitations
Processing at scale and in real time is mandatory for many use cases. Thus, processing should happen on site, not in a remote data center or cloud (often hundreds of miles away).
Various technologies and protocols have to be integrated at the edge. Often, legacy and proprietary protocols need to communicate with modern big data tools on the other side of the tunnel.
Limited hardware resources and human expertise at the edge. IT experts cannot go on site to every location. Teams cannot spend a lot of money on hardware and operations continuously in every site.

Data has to be stored and processed locally on site in real time at scale. Plus, data needs to be replicated to the data center or cloud to do further processing and analytics with the aggregated data from different sites. In best case, communication is bidirectional so that smaller local sites can be controlled from one single point by sending commands and control events back to the local sites.

The Same Infrastructure and Technology at the Edge and in the Data Center / Cloud

The implementation of edge use cases is a lot of efforts. Rolling out the solution to all the sites is another big challenge. Ideally, enterprises can leverage the same infrastructure, technologies and applications BOTH on site at the edge in factories or stores AND in the big data center or cloud.

This is why Apache Kafka comes into play in more and more edge scenarios: Leverage the same infrastructure, technologies and applications everywhere. Real time stream processing, reliability and flexible scalability are core features of Kafka. This allows large scale scenarios in the cloud and small or medium scale scenarios at the edge.

Let’s take a loot at different options for deploying Kafka at the edge.

Kafka Architectures for Edge Computing

The key question you have to ask yourself: Do I need high availability at the edge?

Edge Computing does not always require high availability. If it does, then you deploy a traditional Kafka cluster. If it does not, then you choose the simple and less costly option with just one single Kafka broker at the edge. A hardware appliance could make this even easier for roll out in tens or hundreds of sites.

The following example shows three edge sites. One Kafka cluster is deployed at each site including additional Kafka components:

Resilient Deployment at the Edge with 3+ Kafka Brokers

Kafka and its ecosystem are build for high availability and zero downtime (even if individual nodes fail). Usually, you deploy a distributed system. Kafka and Zookeeper required at least three nodes. Other components require at least two nodes to guarantee robust operations without data loss:

Check out the Apache Kafka and Confluent Platform Reference Architecture for more details about deployment best practices. These best practices do not change at the edge. Having said this, the load and throughput is often lower at the edge. Therefore, less memory and smaller disks are often sufficient. It all depends on your SLAs, requirements and use cases.

Non-Resilient Deployment at the Edge with One Single Kafka Broker

The demand to deploy a “lightweight Kafka cluster” at the edge and synchronize / replicate data with a bigger central Kafka cluster comes up more and more. Due to hardware limitations or lower SLAs regarding high availability, the deployment of just one single Kafka broker (plus one Zookeeper) at the edge is totally fine. You can even deploy the whole Kafka environment on one single server:

This creates some obvious drawbacks: No replication, downtime in case of failure of the node or network, risk of data loss. However, a single-node Kafka deployment works and still provides many benefits of the Kafka fundamentals:

Decoupling between producers and consumers
Handling of back-pressure
High volume real time processing (even one broker has a lot of power)
Storage on disks
Ability to reprocess data
All Kafka-native components available (Kafka Connect for integration, Kafka Streams or ksqlDB for stream processing, Schema Registry for governance).

TL;DR: A Single-Broker Kafka deployment works well. Just be aware of the drawbacks and doublecheck if this is okay for your SLAs and requirements.

ZooKeeper Removal – A Great Help for Kafka at the Edge

Kafka is a powerful distributed infrastructure. Hence, Kafka is not the easiest piece of infrastructure to operate. A key reason is the dependency to ZooKeeper (the same is true for many other distributed systems like Hadoop or Spark). I won’t go into details about the challenges and problems with ZooKeeper here. There is enough information on the web. TL;DR: ZooKeeper makes Kafka harder to operate and less scalable. Many P1 and P2 support tickets are not about Kafka but about ZooKeeper (not because ZooKeeper is unstable, but because it is hard to operate).

The good news: Kafka will get rid of ZooKeeper in the next few releases. Check out “KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum” for more details. KIP-500 will make Kafka much more lightweight, scalable, elastic and simple to operate. This is relevant BOTH at extreme scale in the cloud AND in small single-broker deployments at the edge.

As most IoT projects do not plan just one year ahead, but being rolled out to different sites over five and more years, remember that Kafka will be much easier and more lightweight without ZooKeeper in ~1 year from now.

Kafka as Gateway between Edge Devices and Cloud

A Kafka gateway is an additional optional component in the architecture. In some configurations you might want the edges devices to communicate with a local gateway on premise.

As example, in an industry plant, you might see multiple machines or production lines as edge devices. They integrate with their own Kafka cluster and send this data to a gateway Kafka cluster. At the gateway Kafka cluster, you can do some analytics directly on premise, and maybe filter the data or transform it, before sending it to a large remote aggregation Kafka cluster:

In the above example, we have two independent factories. Both use non-resilient single-broker Kafka deployments for local processing. A resilient gateway Kafka cluster with three Kafka brokers aggregates and processes the data locally in the factory. Only the important and pre-processed data is forwarded to a remote Kafka cluster; in this case Confluent Cloud. The Kafka cluster in the cloud aggregates the data from different factories to integrate with other business applications or analytics tools.

Kafka at the Edge as OEM or in Hardware Appliances

At the edge, enterprises do not have the capabilities and possibilities like in their data centers or in the public cloud. Installing hardware is a huge challenge. Operations is even harder. A standardized way of installing Kafka components at the edge reduces the efforts and risks.

Tens of hardware vendors are available to build your own OEM or hardware appliance. Or you just pick a ready-to-ship box and install all required software components with some DevOps tools via remote management.

Plenty of options exist to ease installation and operations of a Kafka cluster at the edge. Hivecell is one interesting example which you could use. “Hivecell enables companies to deploy and maintain software at the edge without an army of technicians” is the slogan of this startup. You just ship one or more boxes to the site, connect it to the local WiFi and everything else can be done remotely. The Hivecell boxes be shipped in a pre-configured way. For instance, the hardware could already have installed Kubernetes, the Kafka ecosystem and other business applications. Tooling like Confluent Operator can run on the boxes to ease and automate operations of the Kafka environment at the edge. This way, remote management is typically sufficient.

Communication, Connectivity, Integration, Data Processing

As already shown in the diagrams above, a Kafka environment includes more than just the Kafka broker and mandatory Zookeeper. Communication, connectivity, integration and data processing are important components in a Kafka infrastructure, no matter if in the cloud, on premise or at the edge:

Communication between Kafka Brokers and Kafka Clients:

Edge-Only: Device -> Kafka at the edge -> Device
Edge-to-Remote: Device -> Kafka at the edge -> Replication -> Kafka (Data Center / Cloud) -> Analytics / Real Time Processing
Bidirectional: A combination of 1) and 2) for bidirectional communication between the edge and remote Kafka cluster

Kafka-native Connectivity, Integration, Data Processing:

Kafka-native components leverage Kafka under the hood. This way, you have to manage just one platform for communication, integration and processing of data in real time at scale:

Kafka Connect: MQTT, OPC-UA, FTP, CSV, PLC4X (Legacy and proprietary IIoT protocols and PLCs like Modbus, Siemens S7, Beckhoff, Allen Bradley), etc.
Mirrormaker 2 / Confluent Replicator: Uni- or Bidirectional replication between two Kafka Clusters
Kafka Clients (Producers / Consumers): Java, Python, C++, C, Go, Javascript, …
Data Processing: Stream Processing (stateless streaming ETL or stateful applications) with Kafka Streams or ksqlDB
Proxies: REST Proxy for HTTP(S) communication, MQTT Proxy for MQTT integration
Schema Registry: Governance and schema enforcement.

Make sure to plan the whole infrastructure from the beginning. Especially at the edge where you typically have limited hardware resources… Doublecheck if you really need another database or external processing framework! Maybe the Kafka stack is good enough for your needs?

Kafka is NOT an IoT framework

Kafka can be used in very different scenarios. It is best for event streaming at scale and building a reliable and open infrastructure to integrate IoT at the edge and the rest of the enterprise.

However, Kafka is not the silver bullet for every problem. A specific IoT product might be the better choice for integration and processing of IoT interfaces and shop floors. This depends on the specific requirements, existing ecosystem and already used software. Complexity and cost of solutions need to be evaluated. “Build vs. buy” is always a valid question. Often, the best choice and solution is a mix of building an open, flexible, self-built central streaming infrastructure and buying COTS for specific edge integration and processing scenarios.

Hybrid Architecture – When and Why to use Additional IoT Frameworks and Products?

You should always ask yourself: Is Kafka alone sufficient? If yes, why use an additional framework or product? End-to-End integration gets harder with each additional technology. 24/7 deployments, zero data loss, and real time processing without latency spikes are requirements Kafka works best for.

If Kafka is not sufficient alone, combine it with another IoT framework or solution:

Sometimes, the shop floor connects to an IoT solution. The IoT solution is used as gateway or proxy. This can be a broad, powerful (but also more complex and expensive) solution like Siemens MindSphere. Or the choice is to deploy “just” a specific solution, more lightweight solution to solve one problem. For instance, HiveMQ could be deployed as scalable MQTT cluster to connect to machines and devices. This IoT gateway or proxy connects to Kafka. Kafka is either deployed in the same infrastructure or in another data center or cloud. The Kafka cluster connects to the rest of the enterprise.

In other scenarios, Kafka is used as IoT gateway or proxy to connect to the PLCs or Distributed Control System (DCS) directly. Kafka then connects to an IoT Solution like AWS IoT or Google Cloud’s MQTT Bridge where further processing and analytics happen.

Communication is often bidirectional. No matter what architecture you choose: The data is ingested from the shop floor or other IoT devices, processed and correlated in real time, and finally control events are sent back to the machines. For instance, in predictive analytics, you first train analytic models with tools like TensorFlow in the cloud. Then you deploy the analytic model at the edge for real time predictions.

Do you really need NiFi / MiNiFi or an ESB / ETL Tool?

I just explained why you might combine Kafka with other IoT frameworks or solutions. These are very complementary to the Kafka ecosystem and focus on different use cases like device management, training an analytic model, reporting or building a digital twin.

For example, the major cloud providers provide IoT services for device management, proxies to their own cloud services, and analytics tools. At the edge, Eclipse IoT alone provides various different IoT frameworks. For instance, Eclipse ditto is a great open source framework for building a digital twin.

For integration problems, I think differently!

24/7 Uptime and Zero Data Loss Required?

This is a long discussion, but TL;DR: Most integration projects have critical SLAs. The infrastructure has to run 24/7 without downtime and without data loss. The more middleware components combined, the harder it gets to ensure your SLAs and requirements. If you can run Kafka infrastructure 99.95, for each additional middleware components you combine with it, the end-to-end availability goes down. Additionally, you have to develop, test, operate and pay for two or more middleware components instead of just focusing on one single infrastructure.

SLAs Important? Kafka Eats its Own Dog Food

This is one of the key benefits of Kafka; Kafka eats its own dog food: All components leverage the Kafka protocol and its features like offsets, replication, consumer groups, etc. under the hood:

I see the added value of tools like Apache NiFi or Node-RED: You have a drag&drop UI to build pipelines. This is really nice! If you just have to built a pipeline to send data from the edge to the data lake for reporting and analytics, Nifi et al are great tools – if you can live with the risk of downtime and data loss!

Kafka + NiFi + XYZ = Too Many Distributed Systems to Operate!

If you have to built a scalable, reliable streaming infrastructure for edge computing and hybrid architecture without downtime and without data loss: Trust me, you will have a lot of pain. I have seen many deployments where end-to-end integration combining different middleware tools did not go through the integration tests. The idea looks nice in the beginning, but is not robust enough for mission-critical scenarios.

Think about it again: The more tools you combine with each other, the higher the risk for an outage or data loss! NiFi, as one example, runs its own distributed infrastructure. This means you have to guarantee 24/7 end-to-end uptime from the producers via NiFi and Kafka to the final consumer. Kafka-native tools like Kafka Connect or Kafka Streams use Kafka topics (including all the high availability features of Kafka) under the hood. This means you have to operate just one single infrastructure in 24/7 mode to guarantee end-to-end integration without downtime and without data loss.

My heart aches when I see architecture recommendations where you have pipelines like “Sensor ABC -> NiFi (Ingestion) -> Kafka Topic A -> NiFi (Transformation) -> Kafka Topic B -> NiFi (Load) -> Application XYZ”. Again, this is fine for batch ETL pipelines and I see the added value of a nice UI tool like NiFi. But this is NOT the right way to build a 24/7 infrastructure for real time processing at scale with zero data loss.

I have a lot of material to learn the differences between an event-driven streaming platform like Apache Kafka and middleware like NiFi, Node-RED, Message Queues (MQ), Extract-Transform-Load (ETL) and Enterprise Service Bus (ESB).

Kafka at the Edge to Consolidate the Hybrid IoT Architecture

“Kafka at the edge” is the new black. Many industries deploy Kafka in hybrid architectures. Edge computing allows innovative use cases, increases processing speed, reduces network cost and makes the overall infrastructure more scalable, reliable and robust.

Start small, roll out Kafka at the edge in one site, connect it to the remote Kafka cluster. Then, connect more and more sites step-by-step or build bidirectional use cases.

Edge computing is just a part of the overall architecture. It is okay to deploy just one single Kafka broker in small sites like a coffee shop, retail store or small plant. A “real Kafka cluster” has to be deployed on site to ensure mission-critical use cases. Local processing allows mission-critical processing in real time at scale without the need for remote communication. This reduces costs and increases security.

However, added value often comes from combining the data from different sites to use it for real time decisions. IoT Kafka infrastructures often combine small edge deployments with bigger Kafka deployments in the data center or public cloud. In the meantime, you can even run a single Apache Kafka cluster across multiple datacenters to build regional and global Kafka infrastructures – and connect these to the local edge Kafka clusters.

What are your thoughts about Kafka at the edge and in hybrid architectures? Please let me know. Also, check out the “Infrastructure Checklist for Apache Kafka at the Edge” if you plan to go that direction!

Let’s connect on LinkedIn and discuss your use cases, architectures, and requirements. Also, stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka is the New Black at the Edge in Industrial IoT, Logistics and Retailing appeared first on Kai Waehner.

IoT Live Demo – 100.000 Connected Cars with Kubernetes, Kafka, MQTT, TensorFlow

Kai Waehner — Fri, 08 Nov 2019 13:38:34 +0000

You want to see an Internet of Things (IoT) example at huge scale? Not just 100 or 1000 devices producing data, but a really scalable demo with millions of messages from tens of thousands of devices? This is the right demo for you! we leveraging Kubernetes, Apache Kafka, MQTT and TensorFlow.

The demo shows how you can integrate with tens or hundreds of thousands IoT devices and process the data in real time. The demo use case is predictive maintenance (i.e. anomaly detection) in a connected car infrastructure to predict motor engine failures:

IoT Infrastructure – MQTT and Kafka on Kubernetes

We deploy Kubernetes, Kafka, MQTT and TensorFlow in a scalable, cloud-native infrastructure to integrate and analyse sensor data from 100000 cars in real time. The infrastructure is built with Terraform. We use GCP, but you could do the same on AWS, Azure, Alibaba or on premises.

Data processing and analytics is done in real time at scale with GCP GKE, HiveMQ, Confluent and TensorFlow I/O for streaming machine learning / deep learning and bi-directional communication in a scalable, elastic and reliable infrastructure:

Github Project – 100000 Connected Cars

The project is available on Github. You can set the demo up in ~30min by just installing a few CLI tools and executing two or three shell scripts.

Check out the Github project “Streaming Machine Learning at Scale from 100000 IoT Devices with HiveMQ, Apache Kafka and TensorFlow“.

Please try out the demo. Feedback and PRs are welcome.

20min Live Demo – IoT at Scale on GCP with GKE, Confluent, HiveMQ and TensorFlow IO

Here is the video recording of the live demo:

If your area of interest is Industrial IoT (IIoT), you might also check out the following example. It covers the integration of machines and PLCs like Siemens S7, Modbus or Beckhoff in factories and shop floors:

Apache Kafka, KSQL and Apache PLC4X for IIoT Data Integration and Processing

The post IoT Live Demo – 100.000 Connected Cars with Kubernetes, Kafka, MQTT, TensorFlow appeared first on Kai Waehner.

Apache Kafka, KSQL and Apache PLC4X for IIoT Data Integration and Processing

Kai Waehner — Mon, 02 Sep 2019 09:35:03 +0000

Data integration and processing is a huge challenge in Industrial IoT (IIoT, aka Industry 4.0 or Automation Industry) due to monolithic systems and proprietary protocols. Apache Kafka, its ecosystem (Kafka Connect, KSQL) and Apache PLC4X are a great open source choice to implement this IIoT integration end to end in a scalable, reliable and flexible way.

This blog post covers a high level overview about the challenges and a good, flexible architecture to solve the problems. At the end, I share a video recording and the corresponding slide deck. These provide many more details and insights.

Challenges in IIoT / Industry 4.0

Here are some of the key challenges in IIoT / Industry 4.0:

IoT != IIoT: Automation industry does not use MQTT or other standards, but is slow, insecure, not scalable and proprietary.
Product Lifecycles are very long (tens of years), no simple changes or upgrades
IIoT usually uses incompatible protocols, typically proprietary and just built for one specific vendor.
Automation industry uses proprietary and expensive monoliths which are not scalable and not extendible.
Machines and PLCs are insecure by nature with no authentication, no authorization, no encryption.

This is still state of the art in automation industry. This is no surprise with such long product life cycles, but still very concerning.

Evolution of Convergence between IT and Automation Industry

Today, everybody talks about cloud, big data analytics, machine learning and real time processing at scale. The convergence between IT and Automation Industry is coming, as the analyst report from IoT research company IOT Analytics shows:

There is huge demand to build an open, flexible, scalable platform. Many opportunities from business and technical perspective:

Cost reduction
Flexibility
Standards-based
Scalability
Extendibility
Security
Infrastructure-independent

So, how to get from legacy technologies and proprietary IIoT protocols to cloud, big data, machine learning, real time processing? How to build a reliable, scalable and flexible architecture and infrastructure?

Apache Kafka and Apache PLC4X for End-to-End IIoT Integration

I assume you already know it: Apache Kafka is the De-facto Standard for Real-Time Event Streaming. It provides

Open Source (Apache 2.0 License)
Global-scale
Real-time
Persistent Storage
Stream Processing

If you need more details about Apache Kafka, check out the Kafka website, the extensive Confluent documentation or some free video recordings and slides from any Kafka Summit to learn about the technology and use cases.

The only very important thing I want to point out is that Apache Kafka includes Kafka Connect and Kafka Streams:

Kafka Connect enables reliable and scalable integration of Kafka with other systems. Kafka Streams allows to write standard Java apps and microservices to continuously process your data in real-time with a lightweight stream processing API. And finally, KSQL enables Stream Processing using SQL-like Semantics.

Apache PLC4X for PLC Integration (Siemens S7, Modbus, Allen Bradley, Beckhoff ADS, etc.)

Apache PLC4X is less established on the market than Apache Kafka. It also “just covers a niche” (a big one, of course) compared to Kafka, which is used in any industry for many different use cases. However, PLC4X is a very interesting top level Apache project for automation industry.

The Goal is to open up PLC interfaces from IIoT world to the outside world. PCL4X allows vertical integration and to write software independent of PLCs using JDBC-like adapters for various protocols like Siemens S7, Modbus, Allen Bradley, Beckhoff ADS, OPC-UA, Emerson, Profinet, BACnet, Ethernet.

PLC4X provides a Kafka Connect connector. Therefore, you can leverage the benefits of Apache Kafka (high availability, high throughput, high scalability reliability, real time processing) to deploy PLC4X integration pipelines. With this, you can build one single architecture and infrastructure for

legacy IIoT connectivity using PLC4X and Kafka Connect
data processing using Kafka Streams / KSQL
integration with the rest of the enterprise using Kafka Connect and any other sink (database, big data analytics, machine learning, ERP, CRM, cloud services, custom business applications, etc.)

As Kafka decouples the producers from the consumers, you can consume the IIoT machine sensor data from any application – some might be real time, some might be batch, and some might be request-response communication for human interaction on a web or mobile app.

Apache PLC4X vs. OPC-UA

A little bit off-topic: How to choose between Apache PLC4X (open source framework for IIoT) and OPC-UA (open standard for IIoT). In short, both are different things and can also be complementary. Here is a comparison:

OPC-UA

Open standard
All the pros and cons of an open standard (works with different vendors; slow adoption; inflexible, etc.)
Often poorly implemented by the vendors
Requires app server on top of PLC
Every device has to be retrofitted with the ability to speak a new protocol and use a common client to speak with these devices
Often over-engineering for just reading the data
Activating OPC-UA support on existing PLCs greatly increases the load on the PLCs
With licensing cost for every machine

Apache PLC4X

Open source framework (Apache 2.0 license)
Provides unified API by implementing drivers for communicating with most industrial controllers in the protocols they natively understand
No need to modify existing hardware
No increased load on the PLCs
No need to pay for licenses to activate OPC-UA support
Drivers being implemented from the specs or by reverse engineering protocols in order to be fully Apache 2.0 licensed
PLC4X adapter for OPC-UA available -> Both can be used together!

As you see, both have their pros and cons. To me, and this is clearly my subjective opinion, PLC4X provides a great alternatives with high flexibility and low footprint.

Confluent and IoT Platform Solutions

Many IoT Platform Solutions are available on the market. This includes products like Siemens MindSphere or Cisco Kinetic, and cloud services from the major cloud providers like AWS, GCP or Azure. And you have Kafka + PLC4X as you just learned above. Often, this is not a “neither … nor” decision:

You can either use

just Kafka and PLC4X for lightweight and flexible IIoT integration based on a scalable, reliable and open event streaming platform
just a IoT Platform Solution if the pros of such a specific product (dedicated for a specific vendor protocol, nice GUI, etc.) outperform the cons (like high cost, proprietary and inflexible solution)
both together where you use the IoT Platform Solution to integrate with the PLCs and then send the data to Kafka to integrate with the rest of the enterprise (with all the benefits and added value Kafka brings)
both together where you use Kafka and PLC4X for PLC integration and one of the consumers is the IoT Platform Solution (while other consumers can also get the data from Kafka – fully decoupled from the IoT Platform Solution)

All alternatives have their pros and cons. There is no single solution which fits every use case! Therefore, no surprise that most IoT Solution Platforms provide Kafka source and sink connectors.

Apache Kafka and Apache PLC4X – Slides / Video Recording / Github Code Example

If you got curious about more details and insights, please check out my video recording and slide deck.

Slide Deck – Apache Kafka and PLC4X:

Video Recording – Apache Kafka and PLC4X:

Github Code Example – Apache Kafka and PLC4X:

We are also building a nice and simple demo on Github these days:

Kafka-native end-to-end IIoT Data Integration and Processing with Kafka Connect, KSQL and Apache PLC4X

PLC4X gets most exciting if you try it out by yourself and connect to your machines or tools. So, check out the example and adjust it to connect to your infrastructure.

Feedback and Questions?

Please let me know your feedback and questions about Kafka, its ecosystem and PLC4X for IIoT integration. Let’s also connect on LinkedIn to discuss interesting IIoT use cases and technologies in the future.

The post Apache Kafka, KSQL and Apache PLC4X for IIoT Data Integration and Processing appeared first on Kai Waehner.

IoT Integration with Kafka Connect, REST / HTTP, MQTT, OPC-UA – Lightboard Video

Kai Waehner — Fri, 26 Jul 2019 08:06:35 +0000

I just want to share my lightboard video recording. I talk about IoT integration and processing with Apache Kafka using Kafka Connect, Kafka Streams, KSQL, REST / HTTP, MQTT and OPC-UA. Use cases, alternative architectures and different integration options are discussed on whiteboard.

End-to-End IoT Integration from Edge to Confluent Cloud

In this lightboard, Confluent’s Kai Waehner (Technology Evangelist) and Konstantin Karantasis (Software Engineer) discuss use cases leveraging the Apache Kafka open source ecosystem as a streaming platform to process IoT data. The session shows architectural alternatives of how devices like cars, machines or mobile devices connect to Apache Kafka via IoT standards like MQTT or OPC-UA.

Learn how to analyze the IoT data either natively on Apache Kafka with Kafka Streams / KSQL or other tools leveraging Kafka Connect. Kai Waehner will also discuss the benefits of Confluent Cloud and other tools like Confluent Replicator or MQTT Proxy to build bidirectional real time integration from the edge to the cloud.

This presentation may help you:

Understand end-to-end use cases from different industries where you integrate IoT devices with enterprise IT using open source technologies and standards
See how Apache Kafka enables bidirectional end-to-end integration processing from IoT data to various backend applications in the cloud
Compare different architectural alternatives and see their benefits and caveats
Learn about various standards, APIs and tools of integrating and processing IoT data with different open source components of the Apache Kafka ecosystem
Understand the benefits of Confluent Cloud, which provides a highly available and scalable Apache Kafka ecosystem as a managed service

IoT Edge to Confluent Cloud

Here is the video recording:

Please let me know if you have any comments or questions.

Integration between Kafka and Siemens S7, Modbus, Beckhoff ADS, OPC-UA using PLC4X

Stay tuned for a dedicated blog post, slide deck and video recording about a core them in IoT space: Integration with Industrial IoT (IIoT). I will talk about integration between Apache Kafka and technologies / proprietary standards from automation industry. I will discuss how to integrate with proprietary protocols like Siemens S7, Modbus, Beckhoff ADS, OPC-UA, etc. leveraging the Kafka Connect connector for PLC4X. This content is in the works these days and I will probably publish it in August or September 2019.

Please also check out the following related material about IoT Integration:

Processing IoT Data from End to End with MQTT and Apache Kafka (Slides + Video from Kafka Summit Talk)
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) – Friends, Enemies or Frenemies? (Slides + Video + Article)
Enabling connected transformation with Apache Kafka and TensorFlow on Google Cloud Platform (Google Blog about the use case “connected car infrastructure”)

The post IoT Integration with Kafka Connect, REST / HTTP, MQTT, OPC-UA – Lightboard Video appeared first on Kai Waehner.

Apache Kafka vs. Middleware (MQ, ETL, ESB) – Slides + Video

Kai Waehner — Thu, 07 Mar 2019 15:45:15 +0000

Learn the differences between an event-driven streaming platform like Apache Kafka and middleware like Message Queues (MQ), Extract-Transform-Load (ETL) and Enterprise Service Bus (ESB). Including best practices and anti-patterns, but also how these concepts and tools complement each other in an enterprise architecture.

This blog post shares my slide deck and video recording. I discuss the differences between Apache Kafka as Event Streaming Platform and integration middleware. Learn if they are friends, enemies or frenemies.

Problems of Legacy Middleware

Extract-Transform-Load (ETL) is still a widely-used pattern to move data between different systems via batch processing. Due to its challenges in today’s world where real time is the new standard, an Enterprise Service Bus (ESB) is used in many enterprises as integration backbone between any kind of microservice, legacy application or cloud service to move data via SOAP / REST Web Services or other technologies. Stream Processing is often added as its own component in the enterprise architecture for correlation of different events to implement contextual rules and stateful analytics. Using all these components introduces challenges and complexities in development and operations.

Apache Kafka – An Open Source Event Streaming Platform

This session discusses how teams in different industries solve these challenges by building a native event streaming platform from the ground up instead of using ETL and ESB tools in their architecture. This allows to build and deploy independent, mission-critical streaming real time application and microservices. The architecture leverages distributed processing and fault-tolerance with fast failover, no-downtime, rolling deployments and the ability to reprocess events, so you can recalculate output when your code changes. Integration and Stream Processing are still key functionality but can be realized in real time natively instead of using additional ETL, ESB or Stream Processing tools.

A concrete example architecture shows how to build a complete streaming platform leveraging the widely-adopted open source framework Apache Kafka to build a mission-critical, scalable, highly performant streaming platform. Messaging, integration and stream processing are all build on top of the same strong foundation of Kafka; deployed on premise, in the cloud or in hybrid environments. In addition, the open source Confluent projects, based on top of Apache Kafka, adds additional features like a Schema Registry, additional clients for programming languages like Go or C, or many pre-built connectors for various technologies.

Slides: Apache Kafka vs. Integration Middleware

Here is the slide deck:

Video Recording: Kafka vs. MQ / ETL / ESB – Friends, Enemies or Frenemies?

Here is the video recording where I walk you through the above slide deck:

Article: Apache Kafka vs. Enterprise Service Bus (ESB)

I also published a detailed blog post on Confluent blog about this topic in 2018:

Apache Kafka vs. Enterprise Service Bus (ESB)

Talk and Slides from Kafka Summit London 2019

The slides and video recording from Kafka Summit London 2019 (which are similar to above) are also available for free.

Why Apache Kafka instead of Traditional Middleware?

If you don’t want to spend a lot of time on the slides and recording, here is a short summary of the differences of Apache Kafka compared to traditional middleware:

Questions and Discussion…

Please share your thoughts, too!

Does your infrastructure see similar architectures? Do you face similar challenges? Do you like the concepts behind an Event Streaming Platform (aka Apache Kafka)? How do you combine legacy middleware with Kafka? What’s your strategy to integrate the modern and the old (technology) world? Is Kafka part of that architecture?

Please let me know either via a comment or via LinkedIn, Twitter, email, etc. I am curious about other opinions and experiences (and people who disagree with my presentation).

The post Apache Kafka vs. Middleware (MQ, ETL, ESB) – Slides + Video appeared first on Kai Waehner.

Apache Kafka vs. ESB / ETL / MQ

Kai Waehner — Wed, 18 Jul 2018 18:33:56 +0000

Apache Kafka and Enterprise Service Bus (ESB) are complementary, not competitive!

Apache Kafka is much more than messaging in the meantime. It evolved to a streaming platform including Kafka Connect, Kafka Streams, KSQL and many other open source components. Kafka leverages events as a core principle. You think in data flows of events and process the data while it is in motion. Many concepts, such as event sourcing, or design patterns such as Enterprise Integration Patterns (EIPs), are based on event-driven architecture.

Kafka is unique because it combines messaging, storage, and processing of events all in one platform. It does this in a distributed architecture using a distributed commit log and topics divided into multiple partitions.

ETL and ESB have excellent tooling, including graphical mappings for doing complex integration with legacy systems and technologies such as SOAP, EDIFACT, SAP BAPI, COBOL, etc. (Trust me, you don’t want to write the code for this.)

Therefore, existing MQ and ESB solutions, which already integrate with your legacy world, are not competitive to Apache Kafka. Rather, they are complementary!

Read more details about this question in my Confluent blog post:

Apache Kafka® vs. Enterprise Service Bus (EBS) | Confluent

As always, I appreciate any feedback, comments or criticism.

The post Apache Kafka vs. ESB / ETL / MQ appeared first on Kai Waehner.