Kafka Connect Archives - Kai Waehner

Snowflake Data Integration Options for Apache Kafka (including Iceberg)

Kai Waehner — Mon, 22 Apr 2024 05:40:32 +0000

The integration between Apache Kafka and Snowflake is often cumbersome. Options include near real-time ingestion with a Kafka Connect connector, batch ingestion from large files, or leveraging a standard table format like Apache Iceberg. This blog post explores the alternatives and discusses its trade-offs. The end shows how data streaming helps with hybrid architectures where data needs to be ingested from the private data center into Snowflake in the public cloud.

Blog Series: Snowflake and Apache Kafka

Snowflake is a leading cloud-native data warehouse. Its usability and scalability made it a prevalent data platform in thousands of companies. This blog series explores different data integration and ingestion options, including traditional ETL / iPaaS and data streaming with Apache Kafka. The discussion covers why point-to-point Zero-ETL is only a short term win, why Reverse ETL is an anti-pattern for real-time use cases and when a Kappa Architecture and shifting data processing “to the left” into the streaming layer helps to build transactional and analytical real-time and batch use cases in a reliable and cost-efficient way.

Blog series:

Snowflake Integration Patterns: Zero ETL and Reverse ETL vs. Apache Kafka
THIS POST: Snowflake Data Integration Options for Apache Kafka (including Iceberg)
Apache Kafka + Flink + Snowflake: Cost Efficient Analytics and Data Governance

Subscribe to my newsletter to get an email about the next publications.

Data Ingestion from Apache Kafka into Snowflake (Batch vs. Streaming)

Several options exist to ingest data into Snowflake. Criteria to evaluate the options include complexity, latency, throughout and cost.

The article “Streaming on Snowflake” by Paul Needleman explored the three common architecture patterns for data ingestion from any data source into Snowflake:

Source: Paul Needleman (Snowflake)

Paul’s article described the architecture options without and with Kafka. The numbered list below follows the numbers in the upper diagram:

Snowpipe — This solution provides Cloud storage (e.g., Amazon S3, Azure Blob Storage, Google Cloud Storage) the ability for serverless alerting Snowflake to auto-ingest data upon arrival. Once a file lands, Snowflake is alerted to pick up and process the file. Snowpipe is used for micro-batch file transfer, not real-time message ingestion.
Kafka Connector — This connector provides a simple yet elegant solution to connect Kafka Topics with Snowflake, abstracting the complexity through Snowpipe. The Kafka Topics write the data to a Snowflake-managed internal stage, which is auto-ingested to the table using Snowpipe. The internal stage and pipe objects are created automatically as part of the process.
Kafka with Snowpipe Streaming — This builds upon the first two approaches and allows for a more native connection between Snowflake and Kafka through a new channel object. This object seamlessly streams message data into a Snowflake table without needing first to store the data. The data is also stored in an optimized format to support the low-latency data interval.

Read the article “Streaming on Snowflake” for more details about these options.

Snowflake = SaaS => Integration Layer Should Be SaaS!

Snowflake is one of the first most successful true cloud data warehouses, i.e. fully managed with no need to operate and worry about the infrastructure. SaaS, Snowflake offers benefits such as scalability, ease of use, vendor-managed updates and maintenance, multi-cloud support, enhanced security, cost-effectiveness with consumption-based pricing, and global accessibility. These advantages make it an attractive choice for organizations looking to leverage a modern and efficient data warehousing solution.

The same benefits exist for fully managed data integration solutions. It does not matter if you use open source-based technologies (e.g., Apache Camel), a traditional iPaaS middleware, or a data streaming solution like Kafka.

I wrote a detailed article comparing iPaaS offerings like Dell Boomi, SnapLogic, Informatica, and fully managed data streaming cloud platforms like Confluent Cloud or Amazon MSK. Read this article to understand why your next integration platform should be fully managed the same way Snowflake is.

Example: Data Ingestion with Confluent Cloud and Snowpipe Streaming

Confluent Cloud and Snowflake are a perfect combination for fully managed end-to-end data pipelines. For instance, connecting to a data source like Salesforce CRM via CDC, streaming data through Kafka, and ingesting the events into Snowflake is entirely fully managed.

Source: Confluent

Using Kafka Connect with Snowpipe Streaming has several advantages:

Faster, more efficient data pipelines
Reduced architectural complexity
Support for exactly-once delivery
Ordered ingestion
Error handling with dead-letter queue (DLQ) support

Streaming ingestion is not meant to replace file-based ingestion. Rather, it augments the existing integration architecture for data-loading scenarios where it makes sense, such as

Low-latency telemetry analytics of user-application interactions for clickstream recommendations
Identification of security issues in real-time streaming log analytics to isolate threats
Stream processing of information from IoT devices to monitor critical assets

Why should you NOT only use Snowpipe Streaming mode? Cost. Snowflake has different pricing models for the ingestion modes.

Processing Large Files in Kafka before Snowflake Ingestion?

A last aspect of data ingestion options via Kafka into Snowflake: What to do with large files?

One of the most common use cases for data ingestion into Snowflake is large CSV, XML or JSON files generated from batch legacy analytics systems.

Option 1: Send the large files via Kafka into Snowflake and process it in the data warehouse. Apache Kafka was never built for large messages. Nevertheless, more and more projects send and process 1Mb, 10Mb, and even bigger files and other large payloads via Kafka into Snowflake. Why? Because it just works.

Option 2: Apache Kafka splits up and chunks large messages into small pieces.

For the latter approach, ideally, events are processed line by line, if possible. The enormous benefit of this approach is bringing even batch-based monolithic systems into an event-driven architecture. Snowflake and other downstream applications consume the events in near real-time. This architecture leverages the Composed Message Processor Enterprise Integration Pattern (EIP):

For a deep dive including various use cases and customer stories, check out the article “Handling Large Messages With Apache Kafka“.

Bi-Directional Integration between Apache Kafka and Snowflake with Apache Iceberg

After covering batch, file, and streaming integration from Kafka to Snowflake, let’s move to the latest innovation that is more compelling than old the “legacy approaches”: Native integration between Apache Kafka and Snowflake using Apache Iceberg.

Apache Iceberg is the leading open-source table format for storing large-scale structured data in cloud object stores or distributed file systems, designed for high-performance querying and analytics. It provides features such as schema evolution, time travel, and data versioning, making it well-suited for data lakes and modern data architectures.

Snowflake Support for Apache Iceberg

Snowflake already supports Apache Iceberg.

Source: Snowflake

Augusto Kiniama Rosa points out in his Overview of Snowflake Apache Iceberg Tables:

“Iceberg will always use customer-controlled external storage, like an AWS S3 or Azure Blog Storage. Snowflake Iceberg Tables support Iceberg in two ways: an Internal Catalog (Snowflake-managed catalog) or an externally managed catalog (AWS Glue or Objectstore).”

I won’t start a flame war of Apache Iceberg vs. Apache Hudi and Databricks’ Delta Lake here. It reminds me about the containers wars with Kubernetes, Cloud Foundry and Apache Mesos. In the end, Kubernetes won. The competitors adopted it. The same seems to be happening with Iceberg. If not, the as principles and benefits will be the same, no matter if the future is Iceberg or a competing technology. As it seems today like Iceberg will win this war, I focus on this technology in the following sections.

Kafka and Iceberg to Unify Transactional and Analytical Workloads

Any data source can feed data via Apache Kafka directly into Snowflake (or any other analytics engine) as Apache Iceberg table. This solves the challenges of the above described integration options between Kafka and Snowflake. Operational data is accessible to the analytical world without a complex, expensive, and fragile process.

Source: Confluent

Confluent Tableflow: Fully Managed Kafka-Iceberg Integration

Confluent announced Tableflow at Kafka Summit 2024 in London, UK, to demonstrate its fully managed out-of-the-box integration between a Kafka Topic and Schema and an Iceberg Tables in Confluent Cloud. Confluent’s Marc Selwan writes:

“In the past, there has been a tight coupling of tables (storage) and query engines. In recent years, we’ve witnessed the rise of ‘headless’ data infrastructure where companies are building a more open lakehouse in cloud object storage that is accessible by many tools.

Just like the Apache Kafka API has evolved to be the de facto open standard for data streaming, we’re seeing Apache Iceberg grow into the de facto open-table standard for large-scale datasets stored in lakehouses. We’ve seen its ecosystem grow with robust tooling and support from compute engines such as Apache Spark, Snowflake, Amazon Athena, Dremio, Trino, Apache Druid, and many others.

Source: Confluent

We believe the rise of open-table formats and the ‘headless’ data infrastructure is being driven by the needs of data engineers evolving beyond the tight coupling of table to computing platform. These factors made Apache Iceberg support a natural first choice for us.”

Check out Confluent’s blog post Announcing Tableflow. Other Kafka vendors will likely provide Apache Iceberg support in the near future, too. I am really excited about this development of unifying operational and analytics with a standardized interface across open source frameworks and cloud solutions.

Hybrid Architectures with Kafka On-Premise and in the Public Cloud for Snowflake Integration

Snowflake is only available in the public cloud on AWS, GCP or Azure. Most companies across industries follow a cloud-first strategy for new applications. However, as existing companies exist for years or decades, they are typically not born in the cloud. Therefore, hybrid cloud architectures are the new black for most companies. Apache Kafka is the best approach to synchronize and replicate data in a single pipeline with low latency, reliability and guaranteed ordering from the data center to the public cloud.

Legacy infrastructure has to be maintained, integrated, and (maybe) replaced over time. Data Streaming with the Apache Kafka ecosystem is a perfect technology for building hybrid synchronization in real-time at scale. This enables bidirectional integration for transactional and analytical workloads without creating a spaghetti architecture with various point-to-point connections between on-prise and the cloud.

There is no Silver Bullet for Kafka to Snowflake Integration!

Various data integration options are available between Apache Kafka and Snowflake. Kafka Connect connectors are a great option, no matter if you do batch or near real-time ingestion. Even large files can be ingested via data streaming using the right enterprise integration patterns.

A new and innovative approach is Apache Iceberg as the integration interface. The standard table format allows connecting from Snowflake; and any other analytics engine. But data needs to be stored only once. Kafka to Iceberg integration is even more interesting as it unifies transactional and analytical workloads.

Data Streaming also helps with hybrid integrations where data needs to be replicated from the on-premise data center into the public cloud in near real-time with consistent near real-time synchronization.

How do you integrate between Kafka and Snowflake? Do you already look at Apache Iceberg? Or maybe even another Table Format like Apache Hudi or Databricks’ Delta Lake? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Snowflake Data Integration Options for Apache Kafka (including Iceberg) appeared first on Kai Waehner.

GenAI Demo with Kafka, Flink, LangChain and OpenAI

Kai Waehner — Mon, 29 Jan 2024 14:32:13 +0000

Generative AI (GenAI) enables automation and innovation across industries. This blog post explores a simple but powerful architecture and demo for the combination of Python, and LangChain with OpenAI LLM, Apache Kafka for event streaming and data integration, and Apache Flink for stream processing. The use case shows how data streaming and GenAI help to correlate data from Salesforce CRM, searching for lead information in public datasets like Google and LinkedIn, and recommending ice-breaker conversations for sales reps.

The Emergence of Generative AI

Generative AI (GenAI) refers to a class of artificial intelligence (AI) systems and models that generate new content, often as images, text, audio, or other types of data. These models can understand and learn the underlying patterns, styles, and structures present in the training data and then generate new, similar content on their own.

Generative AI has applications in various domains, including:

Image Generation: Generating realistic images, art, or graphics.
Text Generation: Creating human-like text, including natural language generation.
Music Composition: Generating new musical compositions or styles.
Video Synthesis: Creating realistic video content.
Data Augmentation: Generating additional training data for machine learning models.
Drug Discovery: Generating molecular structures for new drugs.

A key challenge of Generative AI is the deployment in production infrastructure with context, scalability, and data privacy in mind. Let’s explore an example of using CRM and customer data to integrate GenAI into an enterprise architecture to support sales and marketing.

Demo: LangChain + Kafka + Flink = Automated Cold Calls of Sales Leads with Salesforce CRM and LinkedIn Data

This article shows a demo that combines real-time data streaming powered by Apache Kafka and Flink with a large language model from OpenAI within LangChain. If you want to learn more about data streaming with Kafka and Flink in conjunction with Generative AI, check out these two articles:

The following demo is about supporting sales reps or automated tools with Generative AI:

The Salesforce CRM creates new leads through other interfaces or by the human manually.
The sales rep / SDR receives lead information in real time to call the prospect.
A special GenAI service leverages the lead information (name and company) to search the web (mainly LinkedIn) to generate helpful content for the cold call of the lead, including: Summary, two interesting facts, topic of interest, and two creative ice-breaker for initiating a conversation.

Kudos to my colleague Carsten Muetzlitz who built the demo. The code is available on Github. Here is the architecture of the demo:

Technologies and Infrastructure in the Demo

The following technologies and infrastructure are used to implement and deploy the GenAI demo.

Python: The programming language almost every data engineer and data scientist uses.
LangChain: The Python framework implements the application to support sales conversations.
OpenAI: The language model and API help to build simple but powerful GenAI applications.
Salesforce: The cloud CRM tool stores the lead information and other sales and marketing data.
Apache Kafka: Scalable real-time data hub decoupling the data sources (CRM) and data sinks (GenAI application and other services).
Kafka Connect: Data integration via Change Data Capture (CDC) from Salesforce CRM.
Apache Flink: Stream processing for enrichment and data quality improvements of the CRM data.
Confluent Cloud: Fully managed Kafka (stream and store), Flink (process), and Salesforce connector (integrate).
SerpAPI: Scrape Google and other search engines with the lead information.
proxyCurl: Pull rich data about the lead from LinkedIn without worrying about scaling a web scraping and data-science team.

Live Demo of Python + LangChain + OpenAI + Kafka + Flink

Here is a 15 minute video walking you through the demo:

Use case
Technical architecture
GitHub project with Python code using Kafka and LangChain
Fully managed Kafka and Flink in the Confluent Cloud UI
Push new leads in real-time from Salesforce CRM via CDC using Kafka Connect
Streaming ETL with Apache Flink
Generative AI with Python, LangChain and OpenAI

Missing: No Vector DB and RAG with Model Embeddings in the LangChain Demo

This demo does NOT use advanced GenAI technologies for RAG (retrieval augmented generation), model embeddings, or vector search via a Vector database (Vector DB) like Pinecone, Weaviate, MongoDB or Oracle.

The principle of the demo is KISS (“keep it as simple as possible”). These technologies can and will be integrated into many real-world architectures.

The demo has limitations regarding latency and scale. Kafka and Flink run as fully managed and elastic SaaS. But the AI/ML part around LangChain could have improved latency, using a SaaS for hosting, and integration with other dedicated AI platforms. Especially data-intensive applications will need a vector database and advanced retrieval and semantic search technologies like RAG.

Fun fact: The demo breaks when I search for my name instead of Carsten’s. Because the web scraper finds too much content in the web about me and as a result the LangChain app crashes… This is a compelling event for complementary technologies like Pinecone or MongoDB that can do indexing, RAG and semantic search at scale. These technologies provide fully managed integration with Confluent Cloud so the demo could easily be extended.

The Role of LangChain in GenAI

LangChain is an open-source framework for developing applications powered by language models. LangChain is also the name of the commercial vendor behind the framework. The tool provides the needed “glue code” for data engineers to build GenAI applications with intuitive APIs for chaining together large language models (LLM), prompts with context, agents that drive decision making with stateful conversations, and tools that integrate with external interfaces.

LangChain supports:

Context-awareness: connect a language model to sources of context (prompt instructions, few shot examples, content to ground its response in, etc.)
Reason: rely on a language model to reason (about how to answer based on provided context, what actions to take, etc.)

The main value props of the LangChain packages are:

Components: composable tools and integrations for working with language models. Components are modular and easy-to-use, whether you are using the rest of the LangChain framework or not.
Off-the-shelf chains: built-in assemblages of components for accomplishing higher-level tasks.

Together, these products simplify the entire application lifecycle:

Develop: Write your applications in LangChain/LangChain.js. Hit the ground running using Templates for reference.
Productionize: Use LangSmith to inspect, test and monitor your chains, so that you can constantly improve and deploy with confidence.
Deploy: Turn any chain into an API with LangServe.

LangChain in the Demo

The demo uses several LangChain concepts such as Prompts, Chat Models, Chains using the LangChain Expression Language (LCEL), Agents using a language model to choose a sequence of actions to take

Here is the logical flow of the LangChain business process:

Get new leads: Collect full name and company of the lead from Salesforce CRM in real-time from a Kafka Topic.
Find LinkedIn profile: Use the Google Search API “SerpAPI” to search for the URL of the lead’s LinkedIn profile.
Collect information about the lead: Use Proxycurl to collect the required information about the lead from LinkedIn.
Create cold call recommendations for the sales rep or automated script: Ingest all information into the ChatGPT LLM via OpenAI API and send the generated text to a Kafka Topic.

The following screenshot shows a snippet of the generated content. It includes context-specific icebreaker conversations based on the LinkedIn profile. For the context, Carsten worked at Oracle for 24 years before joining Confluent. The LLM uses this context of the LangChain prompt to generate related content:

The Role of Apache Kafka in GenAI

Apache Kafka is a distributed streaming platform used for building real-time data pipelines and streaming applications. It plays a crucial role in handling and managing large volumes of data streams efficiently and reliably.

Generative AI typically involves models and algorithms for creating new data, such as images, text, or other types of content. Apache Kafka supports Generative AI by providing a scalable and resilient infrastructure for managing data streams. In a Generative AI context, Kafka can be used for:

Data Ingestion: Kafka can handle the ingestion of large datasets, including the diverse and potentially high-volume data needed to train Generative AI models.
Real-time Data Processing: Kafka’s real-time data processing capabilities help in scenarios where data is constantly changing, allowing for the rapid updating and training of Generative AI models.
Event Sourcing: Event sourcing with Kafka captures and stores events that occur over time, providing a historical record of data changes. This historical data is valuable for training and improving Generative AI models.
Integration with other Tools: Kafka can be integrated into larger data processing and machine learning pipelines, facilitating the flow of data between different components and tools involved in Generative AI workflows.

While Apache Kafka itself is a tool specifically designed for Generative AI, its features and capabilities contribute to the overall efficiency and scalability of the data infrastructure. Kafka’s capabilities are crucial when working with large datasets and complex machine learning models, including those used in Generative AI applications.

Apache Kafka in the Demo

Kafka is the data fabric connecting all the different applications. Ensuring data consistency is a sweet spot of Kafka. No matter if a data source or sink is real time, batch or a request-response API.

In this demo, Kafka consumes events from Salesforce CRM as the main data source of customer data. Different applications (Flink, LangChain, Salesforce) consume the data in different steps of the business process. Kafka Connect provides the capability for data integration with no need for another ETL, ESB or iPaaS tool. This demo uses Confluent’s Change Data Capture (CDC) connector to consume changes from the Salesforce database in real-time for further processing.

Fully managed Confluent Cloud is the infrastructure for the entire Kafka and Flink ecosystem in this demo. The focus of the developer should always build business logic, not worrying about operating infrastructure.

While the heart of Kafka is event-based, real-time and scalable, it also enables domain-driven design and data mesh enterprise architectures out-of-the-box.

The Role of Apache Flink in GenAI

Apache Flink is an open-source distributed stream processing framework for real-time analytics and event-driven applications. Its primary focus is on processing continuous streams of data efficiently and at scale. While Apache Flink itself is not a specific tool for Generative AI, it plays a role in supporting certain aspects of Generative AI workflows. Here are a few ways in which Apache Flink is relevant:

Real-time Data Processing: Apache Flink can process and analyze data in real-time, which can be useful for scenarios where Generative AI models need to operate on streaming data, adapting to changes and generating responses in real-time.
Event Time Processing: Flink has built-in support for event time processing, allowing for the handling of events in the order they occurred, even if they arrive out of order. This can be beneficial in scenarios where temporal order is crucial, such as in sequences of data used for training or applying Generative AI models.
Stateful Processing: Flink supports stateful processing, enabling the maintenance of state across events. This can be useful in scenarios where the Generative AI business process needs to maintain context or memory of past events to generate coherent and context-aware outputs.
Integration with Machine Learning Libraries: While Flink itself is not a machine learning framework, it can be integrated with other tools and libraries that are used in machine learning, including those relevant to Generative AI. This integration can facilitate the deployment and execution of machine learning models within Flink-based streaming applications.

The specific role of Apache Flink in Generative AI depends on the particular use case and the architecture of the overall system.

Apache Flink in the Demo

This demo leverages Apache Flink for streaming ETL (enrichment, data quality improvements) of the incoming Salesforce CRM events.

FlinkSQL provides a simple and intuitive way to implement ETL with any Java or Python code. Fully managed Confluent Cloud is the infrastructure for Kafka and Flink in this demo. Serverless FlinkSQL allows you to scale up as much as needed, but also scale down to zero if no events are consumed and processed.

The demo is just the starting point. Many powerful applications can be built with Apache Flink. This includes streaming ETL, but also business applications like you find them at Netflix, Uber and many other tech giants.

LangChain + Fully-Managed Kafka and Flink = Simple, Powerful Real-Time GenAI

LangChain is an easy-to-use AI/ML framework to connect large language models to other data sources and create valuable applications. The flexibility and open approach enables developers and data engineers to build all sorts of applications, from chatbots to smart systems that answer your questions.

Data streaming with Apache Kafka and Flink provide a reliable and scalable data fabric for data pipelines and stream processing. The event store of Kafka ensures data consistency across real-time, batch, and request-response APIs. Domain-driven design, microservice architectures and data products build in a data mesh more and more leverage on Kafka for these reasons.

The combination of LangChain, GenAI technologies like OpenAI and data streaming with Kafka and Flink make a powerful combination for context-specific decision in real-time powered by AI.

Most enterprises have a cloud-first strategy for AI use cases. Data streaming infrastructure is available in SaaS like Confluent Cloud so that the developers can focus on business logic with much faster time-to-market. Plenty of alternatives exist for building AI applications with Python (the de facto standard for AI). For instance, you could build a user-defined function (UDF) in a FlinkSQL application executing the Python code and consuming from Kafka. Or use a separate application development framework and cloud platform like Quix Streams or Bytewax for Python apps instead of a framework like LangChain.

How do you combine Python, LangChain and LLMs with data streaming technologies like Kafka and Flink? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post GenAI Demo with Kafka, Flink, LangChain and OpenAI appeared first on Kai Waehner.

When to choose Redpanda instead of Apache Kafka?

Kai Waehner — Wed, 16 Nov 2022 03:19:39 +0000

Data streaming emerged as a new software category. It complements traditional middleware, data warehouse, and data lakes. Apache Kafka became the de facto standard. New players enter the market because of Kafka’s success. One of those is Redpanda, a lightweight Kafka-compatible C++ implementation. This blog post explores the differences between Apache Kafka and Redpanda, when to choose which framework, and how the Kafka ecosystem, licensing, and community adoption impact a proper evaluation.

Disclaimer: I work for Confluent. However, the post is not about comparing features but explaining the concepts behind the alternatives of using Apache Kafka (and related products, including Confluent) or Redpanda. I talk to enterprises across the globe every week. Below, I summarize common misunderstandings or missing knowledge about both technologies. I hope it helps you to make the right decision. Either choose to run open-source Apache Kafka, one of the various commercial Kafka offerings or cloud services, or Redpanda. All are great options with pros and cons…

Data streaming: A new software category

Data-driven applications are the new black. As part of this, data streaming is a new software category. If you don’t understand yet how it differs from other data management platforms like a data warehouse or data lake, check out the following blog series:

And if you wonder how Apache Kafka differs from other middleware, check out how Kafka fits into comparison with ETL, ESB, and iPaas.

Apache Kafka: The de facto standard for data streaming

Apache Kafka became the de facto standard for data streaming similar to Amazon S3 is the de facto standard for object storage. Kafka is used across industries for many use cases.

The adoption curve of Apache Kafka

The growth of the Apache Kafka community in the last years is impressive:

>100,000 organizations using Apache Kafka
>41,000 Kafka Meetup attendees
>32,000 Stack Overflow Questions
>12,000 Jiras for Apache Kafka
>31,000 Open Job Listings Request Kafka Skills

And look at the increased number of active monthly unique users downloading the Kafka Java client library with Maven:

Source: Sonatype

The numbers grow exponentially. That’s no surprise to me as the adoption pattern and maturity curve for Kafka are similar in most companies:

Start with one or few use cases (that prove the business value quickly)
Deploy the first applications to production and operate them 24/7
Tap into the data streams from many domains, business units, and technologies
Move to a strategic central nervous system with a decentralized data hub

Kafka use cases by business value across industries

The main reason for the incredible growth of Kafka’s adoption curve is the variety of potential use cases for data streaming. The potential is almost endless. Kafka’s characteristics of combing low latency, scalability, reliability, and true decoupling establish benefits across all industries and use cases:

Search my blog for your favorite industry to find plenty of case studies and architectures. Or to get started, read about use cases for Apache Kafka across industries.

The emergence of many Kafka vendors

The market for data streaming is enormous. With so many potential use cases, it is no surprise that more and more software vendors add Kafka support to their products. Most vendors use Kafka or implement its protocol because Kafka has become the de facto standard for data streaming.

Learn more about the various data streaming vendors in the following blog posts:

To be clear: An increasing number of Kafka vendors is a great thing! It proves the creation of a new software category. Competition pushes innovation. The market share is big enough for many vendors. And I am 100% convinced that we are still in a very early stage of the data streaming hype cycle…

After a lengthy introduction to set the context, let’s now review a new entrant into the Kafka market: Redpanda…

Introducing Redpanda: Kafka-compatible data streaming

Redpanda is a data streaming platform. Its website explains its positioning in the market and product strategy as follows (to differentiate it from Apache Kafka):

No Java: A JVM-free and ZooKeeper-free infrastructure.
Designed in C++: Designed for a better performance than Apache Kafka.
A single-binary architecture: No dependencies to other libraries or nodes.
Self-managing and self-healing: A simple but scalable architecture for on-premise and cloud deployments.
Kafka-compatible: Out-of-the-box support for the Kafka protocol with existing applications, tools, and integrations.

This sounds great. You need to evaluate whether Redpanda is the right choice for your next project or if you should stick with “real Apache Kafka”.

How to choose the proper “Kafka” implementation for your project?

A recommendation that some people find surprising: Qualify out first! That’s much easier. Similarly, like I explained when NOT to use Apache Kafka.

As part of the evaluation, the question is if Kafka is the proper protocol for you. And for Kafka, pick different offerings and begin with the comparison.

Start your evaluation with the business case requirements and define your most critical needs like uptime SLAs, disaster recovery strategy, enterprise support, operations tooling, self-managed vs. fully-managed cloud service, capabilities like messaging vs. data ingestion vs. data integration vs. applications, and so on. Based on your use cases and requirements, you can start qualifying out vendors like Confluent, Redpanda, Cloudera, Red Hat / IBM, Amazon MSK, Amazon Kinesis, Google Pub Sub, and others to create a shortlist.

The following sections compare the open-source project Apache Kafka versus the re-implementation of the Kafka protocol of Redpanda. You can use these criteria (and information from other blogs, articles, videos, and so on) to evaluate your options.

Similarities between Redpanda and Apache Kafka

The high-level value propositions are the same in Redpanda and Apache Kafka:

Data streaming to process data in real-time at scale continuously
Decouple applications and domains with a distributed storage layer
Integrate with various data sources and data sinks
Leverage stream processing to correlate data and take action in real-time
Self-managed operations or consuming a fully-managed cloud offering

However, the devil is in the details and facts. Don’t trust marketing, but look deeper into the various products and cloud services.

Deployment options: Self-managed vs. cloud service

Data streaming is required everywhere. While most companies across industries have a cloud-first strategy, some workloads must stay at the edge for different reasons: Cost, latency, or security requirements. My blog about use cases for Apache Kafka at the edge is still one of the most read articles I have written in recent years.

Besides operating Redpanda by yourself, you can buy Redpanda as a product and deploy it in your environment. Instead of just self-hosting Redpanda, you can deploy it as a data plane in your environment using Kubernetes (supported by the vendor’s external control plane) or leverage a cloud service (fully managed by the vendor).

The different deployment options for Redpanda are great. Pick what you need. This is very similar to Confluent’s deployment options for Apache Kafka. Some other Kafka vendors only provide either self-managed (e.g., Cloudera) or fully managed (e.g., Amazon MSK Serverless) deployment options.

What I miss from Redpanda: No official documentation about SLAs of the cloud service and enterprise support. I hope they do better than Amazon MSK (excluding Kafka support from their cloud offerings). I am sure you will get that information if you reach out to the Redpanda team, who will probably soon incorporate some information into their website.

Bring your own Cluster (BYOC)

There is a third option besides self-managing a data streaming cluster and leveraging a fully managed cloud service: Bring your own Cluster (BYOC). This alternative allows end users to deploy a solution partially managed by the vendor in your own infrastructure (like your data center or your cloud VPC).

Here is Redpanda’s marketing slogan: “Redpanda clusters hosted on your cloud, fully managed by Redpanda, so that your data never leaves your environment!”

This sounds very appealing in theory. Unfortunately, it creates more questions and problems than it solves:

How does the vendor access your data center or VPC?
Who decides how and when to scale a cluster?
When to act on issues? How and when do you roll a cluster to incorporate bug fixes or version upgrades?
What about cost management? What is the total cost of ownership? How much value does the vendor solution bring?
How do you guarantee SLAs? Who has to guarantee them, you or the vendor?
For regulated industries, how are security controls and compliance supported? How are you sure about what the vendor does in an environment you ostensibly control? How much harder will a bespoke third-party risk assessment be if you aren’t using pure SaaS?

For these reasons, cloud vendors only host managed services in the cloud vendor’s environment. Look at Amazon MSK, Azure Event Hubs, Google Pub Sub, Confluent Cloud, etc. All fully managed cloud services are only in the VPC of the vendor for the above reasons.

There are only two options: Either you hand over the responsibility to a SaaS offering or control it yourself. Everything in the middle is still your responsibility in the end.

Community vs. commercial offerings

The sales approach of Redpanda looks almost identical to how Confluent sells data streaming. A free community edition is available, even for production usage. The enterprise edition adds enterprise features like tiered storage, automatic data balancing, or 24/7 enterprise support.

No surprise here. And a good strategy, as data streaming is required everywhere for different users and buyers.

Technical differences between Apache Kafka and Redpanda

There are plenty of technical and non-functional differences between Apache Kafka products and Redpanda. Keep in mind that Redpanda is NOT Kafka. Redpanda uses the Kafka protocol. This is a small but critical difference. Let’s explore these details in the following sections.

Apache Kafka vs. Kafka protocol compatibility

Redpanda is NOT an Apache Kafka distribution like Confluent Platform, Cloudera, or Red Hat. Instead, Redpanda re-implements the Kafka protocol to provide API compatibility. Being Kafka-compatible is not the same as using Apache Kafka under the hood, even if it sounds great in theory.

Two other examples of Kafka-compatible offerings:

Azure Event Hubs: A Kafka-compatible SaaS cloud service offering from Microsoft Azure. The service itself works and performs well. However, its Kafka compatibility has many limitations. Microsoft lists a lot of them on its website. Some limitations of the cloud service are the consequence of a different implementation under the hood, like limited retention time and message sizes.
Apache Pulsar: An open-source framework competing with Kafka. The feature set overlaps a lot. Unfortunately, Pulsar often only has good marketing for advanced features to compete with Kafka or to differentiate. And one example is its Kafka mapper to be compatible with the Kafka protocol. Contrary to Azure Event Hubs as a serious implementation (with some limitations), Pulsar’s compatibility wrapper provides a basic implementation that is compatible with only minor parts of the Kafka protocol. So, while alleged “Kafka compatibility” sounds nice on paper, one shouldn’t seriously consider this for migrating your running Kafka infrastructure to Pulsar.

We have seen compatible products for open-source frameworks in the past. Re-implementations are usually far away from being complete and perfect. For instance, MongoDB compared the official open source protocol to its competitor Amazon DocumentDB to pinpoint the fact that DocumentDB only passes ~33% of the MongoDB integration test chain.

In summary, it is totally fine to use these non-Kafka solutions like Azure Event Hubs, Apache Pulsar, or Redpanda for a new project if they fulfill your requirements better than Apache Kafka. But keep in mind that it is not Kafka. There is no guarantee that additional components from the Kafka ecosystem (like Kafka Connect, Kafka Streams, REST Proxy, and Schema Registry) behave the same when integrated with a non-Kafka solution that only uses the Kafka protocol with its own implementation.

How good is Redpanda’s Kafka protocol compatibility?

Frankly, I don’t know. Probably and hopefully, Redpanda has better Kafka compatibility than Pulsar. The whole product is based on this value proposition. Hence, we can assume that the Redpanda team spends plenty of time on compatibility. Redpanda has NOT achieved 100% API compatibility yet.

Time will tell when we see more case studies from enterprises across industries that migrated some Apache Kafka projects to Redpanda and successfully operated the infrastructure for a few years. Why wait a few years to see? Well, I compare it to what I see from people starting with Amazon MSK. It is pretty easy to get started. However, after a few months, the first issues happen. Users find out that Amazon MSK is not a fully-managed product and does not provide serious Kafka SLAs. Hence, I see too many teams starting with Amazon MSK and then migrating to Confluent Cloud after some months.

But let’s be clear: If you run an application against Apache Kafka and migrate to a re-implementation supporting the Kafka protocol, you should NOT expect 100% the same behavior as with Kafka!

Some underlying behavior will differ even if the API is 100% compatible. This is sometimes a benefit. For instance, Redpanda focuses on performance optimization with C++. This is only possible in some workloads because of the re-implementation. C++ is superior compared to Java and the JVM for some performance and memory scenarios.

Redpanda = Apache Kafka – Kafka Connect – Kafka Streams

Apache Kafka includes Kafka Connect for data integration and Kafka Streams for stream processing.

Like most Kafka-compatible projects, Redpanda does exclude these critical pieces from its offering. Hence, even 100 percent protocol compatibility would not mean a product re-implements everything in the Apache Kafka project.

Lower latency vs. benchmarketing

Always think about your performance requirements before starting a project. If necessary, do a proof of concept (POC) with Apache Kafka, Apache Pulsar, and Redpanda. I bet that in 99% of scenarios, all three of them will show a good enough performance for your use case.

Don’t trust opinionated benchmarks from others! Your use case will have different requirements and characteristics. And performance is typically just one of many evaluation dimensions.

I am not a fan of most “benchmarks” of performance and throughput. Benchmarks are almost always opinionated and configured for a specific problem (whether a vendor, independent consultant or researcher conducts them).

My colleague Jack Vanlightly explained this concept of benchmarketing with excellent diagrams:

Source: Jack Vanlightly

Here is one concrete example you will find in one of Redpanda’s benchmarks: Kafka was not built for very high throughput producers, and this is what Redpanda is exploiting when they claim that Kafka’s throughput is inferior to Redpanda. Ask yourself this question: Of 1GB/s use cases, who would create that throughput with just 4 producers? Benchmarketing at its finest.

Hence, once again, start with your business requirements. Then choose the right tool for the job. Benchmarks are always built for winning against others. Nobody will publish a benchmark where the competition wins.

Soft real-time vs. hard real-time

When we speak about real-time in the IT world, we mean end-to-end data processing pipelines that need at least a few milliseconds. This is called soft real-time. And this is where Apache Kafka, Apache Pulsar, Redpanda, Azure Event Hubs, Apache Flink, Amazon Kinesis, and similar platforms fit into. None of these can do hard real time.

Hard real-time requires a deterministic network with zero latency and no spikes. Typical scenarios include embedded systems, field buses, and PLCs in manufacturing, cars, robots, securities trading, etc. Time-Sensitive Networking (TSN) is the right keyword if you want more research.

I wrote a dedicated blog post about why data streaming is NOT hard real-time. Hence, don’t try to use Kafka or Redpanda for these use cases. That’s OT (operational technology), not IT (information technology). OT is plain C or Rust on embedded software.

No ZooKeeper with Redpanda vs. no ZooKeeper with Kafka

Besides being implemented in C++ instead of using the JVM, the second big differentiator of Redpanda is no need for ZooKeeper and two complex distributed systems… Well, with Apache Kafka 3.3, this differentiator is gone. Kafka is now production-ready without ZooKeeper! KIP-500 was a multi-year journey and an operation at Kafka’s heart.

To be fair, it will still take some time until the new ZooKeeper-less architecture goes into production. Also, today, it is only supported by new Kafka clusters. However, migration scenarios with zero downtime and without data loss will be supported in 2023, too. But that’s how a severe release cycle works for a mature software product: Step-by-step implementation and battle-testing instead of starting with marketing and selling of alpha and beta features.

ZooKeeper-less data streaming with Kafka is not just a massive benefit for the scalability and reliability of Kafka but also makes operations much more straightforward, similar to ZooKeeper-less Redpanda.

By the way, this was one of the major arguments why I did not see the value of Apache Pulsar. The latter requires not just two but three distributed systems: Pulsar broker, ZooKeeper, and BookKeeper. That’s nonsense and unnecessary complexity for virtually all projects and use cases.

Lightweight Redpanda + heavyweight ecosystem = middleweight data streaming?

Redpanda is very lightweight and efficient because of its C++ implementation. This can help in limited compute environments like edge hardware. As an additional consequence, Redpanda has fewer latency spikes than Apache Kafka. That are significant arguments for Redpanda for some use cases!

However, you need to look at the complete end-to-end data pipeline. If you use Redpanda as a message queue, you get these benefits compared to the JVM-based Kafka engine. You might then pick a message queue like RabbitMQ or NATs instead. I don’t start this discussion here as I focus on the much more powerful and advanced data streaming use cases.

Even in edge use cases where you deploy a single Kafka broker, the hardware, like an industrial computer (IPC), usually provides at least 4GB or 8GB of memory. That is sufficient for deploying the whole data streaming platform around Kafka and other technologies.

Data streaming is more than messaging or data ingestion

My fundamental question is, what is the benefit of a C++ implementation of the data hub if all the surrounding systems are built with JVM technology or even worse and slow technologies like Python?

Kafka-compatible tools like Redpanda integrate well with the Kafka ecosystem, as they use the same protocol. Hence, tools like Kafka Connect, Kafka Streams, KSQL, Apache Flink, Faust, and all other components from the Kafka ecosystem work with Redpanda. You will find such an example for almost every existing Kafka tool on the Redpanda blog.

However, these combinations kill almost all the benefits of having a C++ layer in the middle. All integration and processing components would also need to be as efficient as Redpanda and use C++ (or Go or Rust) under the hood. These tools do not exist today (likely, as they are not needed by many people). And here is an additional drawback: The debugging, testing, and monitoring infrastructure must combine C++, Python, and JVM platforms if you combine tools like Java-based Kafka Connect and Python-based Faust with C++-based Redpanda. So, I don’t get the value proposition here.

Data replication across clusters

Having more than one Kafka cluster is the norm, not an exception. Use cases like disaster recovery, aggregation, data sovereignty in different countries, or migration from on-premise to the cloud require multiple data streaming clusters.

Replication across clusters is part of open-source Apache Kafka. MirrorMaker 2 (based on Kafka Connect) supports these use cases. More advanced (proprietary) tools from vendors like Confluent Replicator or Cluster Linking make these use cases more effortless and reliable.

Data streaming with the Kafka ecosystem is perfect as the foundation of a decentralized data mesh:

How do you build these use cases with Redpanda?

It is the same story as for data integration and stream processing: How much does it help to have a very lightweight and performant core if all other components rely on “3rd party” code bases and infrastructure? In the case of data replication, Redpanda uses Kafka’s Mirrormaker.

And make sure to compare MirrorMaker to Confluent Cluster Linking – the latter uses the Kafka protocol for replications and does not need additional infrastructure, operations, offset sync, etc.

Non-functional differences between Apache Kafka and Redpanda

Technical evaluations are dominant when talking about Redpanda vs. Apache Kafka. However, the non-functional differences are as crucial before making the strategic decision to choose the data streaming platform for your next project.

Licensing, adoption curve and the total cost of ownership (TCO) are critical for the success of establishing a data streaming platform.

Open source (Kafka) vs. source available (Redpanda)

As the name says, Apache Kafka is under the very permissive Apache license 2.0. Everyone, including cloud providers, can use the framework for building internal applications, commercial products, and cloud services. Committers and contributions are spread across various companies and individuals.

Redpanda is released under the more restrictive Source Available License (BSL). The intention is to deter cloud providers from offering Redpanda’s work as a service. For most companies, this is fine, but it limits broader adoption across different communities and vendors. The likelihood of external contributors, committers, or even other vendors picking the technology is much smaller than in Apache projects like Kafka.

This has a significant impact on the (future) adoption curve…

Maturity, community and ecosystem

The introduction of this article showed the impressive adoption of Kafka. Just keep in mind: Redpanda is NOT Apache Kafka! It just supports the Kafka protocol.

Redpanda is a brand-new product and implementation. Operations are different. The behavior of the engine is different. Experts are not available. Job offerings do not exist. And so on.

Kafka is significantly better documented, has a tremendously larger community of experts, and has a vast array of supporting tooling that makes operations more straightforward.

There are many local and online Kafka training options, including online courses, books, meetups, and conferences. You won’t find much for Redpanda beyond the content of the vendor behind it.

And don’t trust marketing! That’s true for every vendor, of course. If you read a great feature list on the Redpanda website, double-check if the feature truly exists and in what shape it is. Example: RBAC (role-based access control) is available for Redpanda. The devil lies in the details. Quote from the Redpanda RBAC documentation: “This page describes RBAC in Redpanda Console and therefore manages access only for Console users but not clients that interact via the Kafka API. To restrict Kafka API access, you need to use Kafka ACLs.” There are plenty of similar examples today. Just try to use the Redpanda cloud service. You will find many things that are more alpha than beta today. Make sure not to fall into the same myths around the marketing of product features as some users did with Apache Pulsar a few years ago.

The total cost of ownership and business value

When you define your project’s business requirements and SLAs, ask yourself how much downtime or data loss is acceptable. The RTO (recovery time objective) and RPO (recovery point objective) impact a data streaming platform’s architecture and overall process to ensure business continuity, even in the case of a disaster.

The TCO is not just about the cost of a product or cloud service. Full-time engineers need to operate and integrate the data streaming platform. Expensive project leads, architects, and developers build applications.

Project risk includes the maturity of the product and the expertise you can bring in for consulting and 24/7 support.

Similar to benchmarketing regarding latency, vendors use the same strategy for TCO calculations! Here is one concrete example you always hear from Redpanda: “C++ does enable more efficient use of CPU resources.”

This statement is correct. However, the problem with that statement is that Kafka is rarely CPU-bound and much more IO-bound. Redpanda has the same network and disk requirements as Kafka, which means Redpanda has limited differences from Kafka in terms of TCO regarding infrastructure.

When to choose Redpanda instead of Apache Kafka?

You need to evaluate whether Redpanda is the right choice for your next project or if you should stick with the “real Apache Kafka” and related products or cloud offerings. Read articles and blogs, watch videos, search for case studies in your industry, talk to different competitive vendors, and build your proof of concept or pilot project. Qualifying out products is much easier than evaluating plenty of offerings.

When to seriously consider Redpanda?

You need C++ infrastructure because your ops team cannot handle and analyze JVM logs – but be aware that this is only the messaging core, not the data integration, data processing, or other capabilities of the Kafka ecosystem
The slight performance differences matter to you – and you still don’t need hard real-time
Simple, lightweight development on your laptop and in automated test environments – but you should then also run Redpanda in production (using different implementations of an API for TEST and PROD is a risky anti-pattern)

You should evaluate Redpanda against Apache Kafka distributions and cloud services in these cases.

This post explored the trade-offs Redpanda has from a technical and non-functional perspective. If you need an enterprise-grade solution or fully-managed cloud service, a broad ecosystem (connectors, data processing capabilities, etc.), and if 10ms latency is good enough and a few p99 spikes are okay, then I don’t see many reasons why you would take the risk of adopting Redpanda instead of an actual Apache Kafka product or cloud service.

The future will tell us if Redpanda is a severe competitor…

I didn’t even cover the fact that a startup always has challenges finding great case studies, especially with big enterprises like fortune 500 companies. The first great logos are always the hardest to find. Sometimes, startups never get there. In other cases, a truly competitive technology and product are created. Such a journey takes years. Let’s revisit this blog post in one, two, and five years to see the evolution of Redpanda (and Apache Kafka).

What are your thoughts? When do you consider using Redpanda instead of Apache Kafka? Are you using Redpanda already? Why and for what use cases? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post When to choose Redpanda instead of Apache Kafka? appeared first on Kai Waehner.

Disaster Recovery with Kafka across the Edge and Hybrid Cloud (QCon Talk)

Kai Waehner — Wed, 06 Apr 2022 11:15:46 +0000

I spoke at QCon London in April 2022 about building disaster recovery and resilient real-time enterprise architectures with Apache Kafka. This blog post summarizes the use cases, architectures, and real-world examples. The slide deck and video recording of the presentation is included as well.

What is QCon?

QCon is a leading software development conference held across the globe for 16 years. It provides a realistic look at what is trending in tech. The QCon events are organized by InfoQ, a well-known website for professional software development with over two million unique visitors per month.

QCon in 2022 uncovers emerging software trends and practices. Developers and architects learn how to solve complex engineering challenges without the product pitches.

There is no Call for Papers (CfP) for QCon. The organizers invite trusted speakers to talk about trends, best practices, and real-world stories. This makes QCon so strong and respected in the software development community.

Disaster Recovery and Resiliency with Apache Kafka

Apache Kafka is the de facto data streaming platform for analytical AND transactional workloads. Multiple options exist to design Kafka for resilient applications. For instance, MirrorMaker 2 and Confluent Replicator enable uni- or bi-directional real-time replication between independent Kafka clusters in different data centers or clouds.

Cluster Linking is a more advanced and straightforward option from Confluent leveraging the native Kafka protocol instead of additional infrastructure and complexity using Kafka Connect (like MirrorMaker 2 and Replicator).

Stretching a single Kafka cluster across multiple regions is the best option to guarantee no downtime and seamless failover in the case of a disaster. However, it is hard to operate and only recommended (i.e., consistent, stable, and mission-critical) across distances with enhanced add-ons to open-source Kafka:

QCon Presentation: Disaster Recovery with Apache Kafka

In my QCon talk, I intentionally showed the broad spectrum of real-world success stories across industries for data streaming with Apache Kafka from companies such as BMW, JPMorgan Chase, Robinhood, Royal Caribbean, and Devon Energy.

Best practices explored how to build resilient enterprises architecture with disaster recovery with RPO (Recovery Point Object) and RTO (Recovery Time Objective) in mind. The audience learns how to get your SLAs and requirements for downtime and data loss right.

The examples looked at serverless cloud offerings integrating to the IoT edge, hybrid retail architectures, and the disconnected edge in military scenarios.

The agenda looks like this:

Resilient enterprise architectures
Real-time data streaming with the Apache Kafka ecosystem
Cloud-first and serverless Industrial IoT in automotive
Multi-region infrastructure for core banking
Hybrid cloud for customer experiences in retail
Disconnected edge for safety and security in the public sector

Slide Deck from QCon Talk:

Here is the slide deck of my presentation from QCon London 2022:

We also had a great panel that discussed lessons learned from building resilient applications on the code and infrastructure level, plus the organizational challenges and best practices:

Video Recording from QCon Talk:

With the risk of Covid in mind, InfoQ decided not to record QCon sessions live.

Instead, a pre-recorded video had to be submitted by the speakers. The video recording is already available for QCon attendees (no matter if on-site in London or at the QCon Plus virtual event):

Qcon makes conference talks available for free sometime after the event. I will update this post with the free link as soon as it is available.

Disaster Recovery with Apache Kafka across all Industries

I hope you enjoyed the slides and video on this exciting topic. Hybrid and global Kafka infrastructures for disaster recovery and other use cases are the norm, not exceptions.

Real-time data beats slow data. That is true in almost every use case. Hence, data streaming with the de facto standard Apache Kafka gets adopted more and more across all industries.

How do you leverage data streaming with Apache Kafka for building resilient applications and enterprise architectures? What architecture does your platform use? Which products do you combine with data streaming? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Disaster Recovery with Kafka across the Edge and Hybrid Cloud (QCon Talk) appeared first on Kai Waehner.

Streaming ETL with Apache Kafka in the Healthcare Industry

Kai Waehner — Fri, 01 Apr 2022 05:47:00 +0000

IT modernization and innovative new technologies change the healthcare industry significantly. This blog series explores how data streaming with Apache Kafka enables real-time data processing and business process automation. Real-world examples show how traditional enterprises and startups increase efficiency, reduce cost, and improve the human experience across the healthcare value chain, including pharma, insurance, providers, retail, and manufacturing. This is part three: Streaming ETL. Examples include Babylon Health and Bayer.

Blog Series – Kafka in Healthcare

Many healthcare companies leverage Kafka today. Use cases exist in every domain across the healthcare value chain. Most companies deploy data streaming in different business domains. Use cases often overlap. I tried to categorize a few real-world deployments into different technical scenarios and added a few real-world examples:

Overview – Data Streaming Use Cases and Architectures for Healthcare (including Slide Deck)
Legacy Modernization and Hybrid Cloud (Optum / UnitedHealth Group, Centene, Bayer)
THIS POST: Streaming ETL (Bayer, Babylon Health)
Real-time Analytics (Cerner, Celmatix, CDC/Centers for Disease Control and Prevention)
Machine Learning and Data Science (Recursion, Humana)
Open API and Omnichannel (Care.com, Invitae)

Stay tuned for a dedicated blog post for each of these topics as part of this blog series. I will link the blogs here as soon as they are available (in the next few weeks). Subscribe to my newsletter to get an email after each publication (no spam or ads).

Streaming ETL with Apache Kafka

Streaming ETL is similar to concepts you might know from traditional ETL tools. I have already explored how data streaming with Kafka differs from data integration tools and iPaaS cloud services. The critical difference is that you leverage a single platform for data integration and processing at scale in real-time. There is no need to combine several platforms to achieve this. The result is a Kappa architecture that enables real-time but also batch workloads with a single integration architecture.

Streaming ETL with Kafka combines different components and features:

Kafka Connect as Kafka-native integration framework
Kafka Connect source and sink connectors to consume and produce data from/to any other database, application, or API
Single Message Transform (SMT) – an optional Kafka Connect feature – to process (filter, change, remove, etc.) incoming or outgoing messages within the connector deployment
Kafka Streams or ksqlDB for continuous data processing in real-time at scale for stateless or stateful ETL jobs
Data governance via schema management, enforcement and versioning using the Schema Registry
Security and access control using features like role-based access control, audit logs, and end-to-end encryption

In the cloud, you can leverage a serverless Kafka offering for the whole Streaming ETL pipeline. Confluent Cloud fully manages Kafka’s end-to-end infrastructure, including connectors, ksqlDB workloads, data governance, and security.

One last general note: Don’t Design for Data at Rest to Reverse it! Learn more here: “When to Use Reverse ETL and when it is an Anti-Pattern“. Instead, use real-time Streaming ETL for Data in Motion and the Kappa architecture from scratch.

Let’s look at a few real-world deployments in the healthcare sector.

Babylon Health – PII and GDRP compliant Security

Babylon Health is a digital-first health service provider and value-based care company that combines an artificial intelligence-powered platform with virtual clinical operations for patients. Patients are connected with health care professionals through its web and mobile application.

Babylon’s mission is to put an accessible and affordable health service in the hands of every person on earth. For that mission, Babylon built an agile microservice architecture with the Kafka ecosystem:

Here are the “wonders of working” in Healthcare for Babylon (= reasons to choose Kafka):

Real-time data processing
Replayability of historical information
Order matters and is ensured with guaranteed ordering
GDPR and data ownership for PII compliant security
Data governance via the schema registry to provide true decoupling and access via many programming languages like Java, Python, and Ruby

Bayer – Data Integration and Processing in R&D

Bayer AG is a German multinational pharmaceutical and life sciences company and one of the largest pharmaceutical companies in the world. They leverage Kafka in various use cases and business domains.

The following scenario is from the research and development department of the pharma business unit. Their focus areas are cardiovascular diseases, oncology, and women’s health. The division employs over 7,500 R&D people and expenses over 2.75 billion euros for R&D.

The use case Bayer presented at a recent Kafka Summit is about analyzing clinical trials, patents, reports, news, and literature leveraging the Kafka ecosystem. The R&D team processes 250 Million documents from 30+ individual data sources. The data includes 7 TB of raw text-rich data with daily updates, additions, and deletions. Algorithms and data evolve. Bayer needs to completely reprocess the data regularly. Various document streams with different formats and schemas flow through several text processing and enrichment steps.

Scalable, reliable Kafka pipelines with Kafka Streams (Java) and Faust (Python) replaced custom, error-prone, non-scalable scripts. Schemas are used as the data interface to ensure data governance. Avro is the first-class citizen data format to enable compression and better throughput.

The true decoupling of Kafka in conjunction with the Schema Registry guarantees interoperability among different components and technologies (java, python, commercial tools, open-source, scientific, proprietary).

Streaming ETL with Kafka for Real-Time Data Integration at any Scale

Think about IoT sensor analytics, cybersecurity, patient communication, insurance, research, and many other domains. Real-time data beats slow data in the healthcare supply chain almost everywhere.

This blog post explored the capabilities of the Apache Kafka Ecosystem for Streaming ETL. Real-world deployments from Babylon Health and Bayer showed how enterprises successfully deploy Kafka for different enterprise architecture use cases.

How do you leverage data streaming with Apache Kafka in the healthcare industry? What architecture does your platform use? Which products do you combine with data streaming? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Streaming ETL with Apache Kafka in the Healthcare Industry appeared first on Kai Waehner.

Apache Kafka in the Insurance Industry

Kai Waehner — Mon, 07 Jun 2021 12:54:20 +0000

The rise of data in motion in the insurance industry is visible across all lines of business, including life, healthcare, travel, vehicle, and others. Apache Kafka changes how enterprises rethink data. This blog post explores use cases and architectures for event streaming. Real-world examples from Generali, Centene, Humana, and Tesla show innovative insurance-related data integration and stream processing in real-time.

Digital Transformation in the Insurance Industry

Most insurance companies have similar challenges:

Challenging market environments
Stagnating economy
Regulatory pressure
Changes in customer expectations
Proprietary and monolithic legacy applications
Emerging competition from innovative insurtechs
Emerging competition from other verticals that add insurance products

Only a good transformation strategy guarantees a successful future for traditional insurance companies. Nobody wants to become the next Nokia (mobile phone), Kodak (photo camera), or BlockBuster (video rental). If you fail to innovate in time, you are done.

Real-time beats slow data. Automation beats manual processes. The combination of these two game changers creates completely new business models in the insurance industry. Some examples:

Claims processing including review, investigation, adjustment, remittance or denial of the claim
Claim fraud detection by leveraging analytic models trained with historical data
Omnichannel customer interactions including a self-service portal and automated tools like NLP-powered chatbots
Risk prediction based on lab testing, biometric data, claims data, patient-generated health data (depending on the laws of a specific country)

These are just a few examples.

The shift to real-time data processing and automation is key for many other use cases, too. Machine learning and deep learning enable the automation of many manual and error-prone processes like document and text processing.

The Need for Brownfield Integration

Traditional insurance companies usually (have to) start with brownfield integration before building new use cases. The integration of legacy systems with modern application infrastructures and the replication between data centers and public or private cloud-native infrastructures are a key piece of the puzzle.

Common integration scenarios use traditional middleware that is already in place. This includes MQ, ETL, ESB, and API tools. Kafka is complementary to these middleware tools:

More details about this topic are available in the following two posts:

Greenfield Applications at Insurtech Companies

Insurtechs have a huge advantage: They can start greenfield. There is no need to integrate with legacy applications and monolithic architectures. Hence, some traditional insurance companies go the same way. They start from scratch with new applications instead of trying to integrate old and new systems.

This setup has a huge architectural advantage: There is no need for traditional middleware as only modern protocols and APIs need to be integrated. No monolithic and proprietary interfaces such as Cobol, EDI, or SAP BAPI/iDoc exist in this scenario. Kafka makes new applications agile, scalable, and flexible with open interfaces and real-time capabilities.

Here is an example of an event streaming architecture for claim processing and fraud detection with the Kafka ecosystem:

Real-World Deployments of Kafka in the Insurance Industry

Let’s take a look at a few examples of real-world deployments of Kafka in the insurance industry.

Generali – Kafka as Integration Platform

Generali is one of the top ten largest insurance companies in the world. The digital transformation from Generali Switzerland started with Confluent as a strategic integration platform. They started their journey by integrating with hundreds of legacy systems like relational databases. Change Data Capture (CDC) pushes changes into Kafka in real-time. Kafka is the central nervous system and integration platform for the data. Other real-time and batch applications consume the events.

From here, other applications consume the data for further processing. All applications are decoupled from each other. This is one of the unique benefits of Kafka compared to other messaging systems. Real decoupling and domain-driven design (DDD) are not possible with traditional MQ systems or SOAP / REST web services.

Design Principles of Generali’s Cloud-Native Architecture

The key design principles for the next-generation platform at Generali include agility, scalability, cloud-native, governance, data, and event processing. Hence, Generali’s architecture is powered by a cloud-native infrastructure leveraging Kubernetes and Apache Kafka:

The following integration flow shows the scalable microservice architecture of Generali. The streaming ETL process includes data integration and data processing decoupled environments:

Centene – Integration and Data Processing at Scale in Real-Time

Centene is the largest Medicaid and Medicare Managed Care Provider in the US. Their mission statement is “transforming the health of the community, one person at a time”. The healthcare insurer acts as an intermediary for both government-sponsored and privately insured health care programs.

Centene’s key challenge is growth. Many mergers and acquisitions require a scalable and reliable data integration platform. Centene chose Kafka due to the following capabilities:

highly scalable
high autonomy and decoupling
high availability and data resiliency
real-time data transfer
complex stream processing

Centene’s architecture uses Kafka for data integration and orchestration. Legacy databases, MongoDB, and other applications and APIs leverage the data in real-time, batch, and request-response:

Swiss Mobiliar – Decoupling and Orchestration

Swiss Mobiliar (Schweizerische Mobiliar aka Die Mobiliar) is is the oldest private insurer in Switzerland.

Event Streaming with Kafka supports various use cases at Swiss Mobiliar:

Orchestrator application to track the state of a billing process
Kafka as database and Kafka Streams for data processing
Complex stateful aggregations across contracts and re-calculations
Continuous monitoring in real-time

Their architecture shows the decoupling of applications and orchestration of events:

Also, check out the on-demand webinar with Mobiliar and Spoud to learn more about their Kafka usage.

Humana – Real-Time Integration and Analytics

Humana Inc. is a for-profit American health insurance. In 2020, the company ranked 52 on the Fortune 500 list.

Humana leverages Kafka for real-time integration and analytics. They built an interoperability platform to transition from an insurance company with elements of health to truly a health company with elements of insurance.

Here are the key characteristics of their Kafka-based platform:

Consumer-centric
Health plan agnostic
Provider agnostic
Cloud resilient and elastic
Event-driven and real-time

Kafka integrates conversations between the users and the AI platform powered by IBM Watson. The platform captures conversational flows and processes them with natural language processing (NLP) – a deep learning concept.

Some benefits of the platform:

Adoption of open standards
Standardized integration partners
In-workflow integration
Event-driven for real-time patient interactions
Highly scalable

freeyou – Stateful Streaming Analytics

freeyou is an insurtech for vehicle insurance. Streaming analytics for real-time price adjustments powered by Kafka and ksqlDB enable new business models. Their marketing slogan shows how they innovate and differentiate from traditional competitors:

“We make insurance simple. With our car insurance, we make sure that you stay mobile in everyday life – always and everywhere. You can take out the policy online in just a few minutes and manage it easily in your freeyou customer account. And if something should happen to your vehicle, we’ll take care of it quickly and easily.”

A key piece of freeyou’s strategy is a great user experience and automatic price adjustments in real-time in the backend. Obviously, Kafka and its stream processing ecosystem are a perfect fit here.

As discussed above, the huge advantage of an insurtech is the possibility to start from the greenfield. No surprise that freeyou’s architectures leverage cutting-edge design and technology. Kafka and KQL enable streaming analytics within the pricing engine, recalculation modules, and other applications:

Tesla – Carmaker and Utility Company, now also Car Insurer

Everybody knows: Tesla is an automotive company that sells cars, maintenance, and software upgrades.

More and more people know: Tesla is a utility company that sells energy infrastructure, solar panels, and smart home integration.

Almost nobody knows: Tesla is a car insurer for their car fleet (limited to a few regions in the early phase). That is the obvious next step if you already collect all the telemetry data from all your cars on the street.

Tesla has built a Kafka-based data platform infrastructure “to support millions of devices and trillions of data points per day”. Tesla showed an interesting history and evolution of their Kafka usage at a Kafka Summit in 2019:

Tesla’s infrastructure heavily relies on Kafka.

There is no public information about Telsa using Kafka for their specific insurance applications. But at a minimum, the data collection from the cars and parts of the data processing relies on Kafka. Hence, I thought this is a great example to think about innovation in car insurance.

Tesla: “Much Better Feedback Loop”

Elon Musk made clear: “We have a much better feedback loop” instead of being statistical like other insurers. This is a key differentiator!

There is no doubt that many vehicle insurers will use fleet data to calculate insurance quotes and provide better insurance services. For sure, some traditional insurers will partner with vehicle manufacturers and fleet providers. This is similar to smart city development, where several enterprises partner to build new innovative use cases.

Connected vehicles and V2X (Vehicle to X) integrations are the starting point for many new business models. No surprise: Kafka plays a key role in the connected vehicles space (not just for Tesla).

Many benefits are created by a real-time integration pipeline:

Shift from human experts to automation driven by big data and machine learning
Real-time telematics data from all its drivers’ behavior and the performance of its vehicle technology (cameras, sensors, …)
Better risk estimation of accidents and repair costs of vehicles
Evaluation of risk reduction through new technologies (autopilot, stability control, anti-theft systems, bullet-resistant steel)

For these reasons, event streaming should be a strategic component of any next-generation insurance platform.

Slide Deck: Kafka in the Insurance Industry

The following slide deck covers the above discussion in more detail:

Kafka Changes How to Think Insurance

Apache Kafka changes how enterprises rethink data in the insurance industry. This includes brownfield data integration scenarios and greenfield cutting-edge applications. The success stories from traditional insurance companies such as Generali and insurtechs such as freeyou prove that Kafka is the right choice everywhere.

Kafka and its ecosystem enable data processing at scale in real-time. Real decoupling allows the integration between monolith legacy systems and modern cloud-native infrastructure. Kafka runs everywhere, from edge deployments to multi-cloud scenarios.

What are your experiences and plans for low latency use cases? What use case and architecture did you implement? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka in the Insurance Industry appeared first on Kai Waehner.

Apache Kafka in the Airline, Aviation and Travel Industry

Kai Waehner — Fri, 19 Feb 2021 11:21:55 +0000

Aviation and travel are notoriously vulnerable to social, economic, and political events, as well as the ever-changing expectations of consumers. Coronavirus is just a piece of the challenge. This post explores use cases, architectures, and references for Apache Kafka in the aviation industry, including airline, airports, global distribution systems (GDS), aircraft manufacturers, and more. Kafka was relevant pre-covid and will become even more important post-covid.

Airlines and Aviation are Changing – Beyond Covid-19!

Aviation and travel are notoriously vulnerable to social, economic, and political events. These months have been particularly testing one due to the global pandemic with Covid-19. But the upcoming change is coming not just due to the Coronavirus but because of the ever-changing expectations of consumers.

Right now is the time to lay the ground for the future of the aviation and travel industry.

Consumer behaviors and expectations are changing. Whole industries are being disrupted, and the aviation industry is not immune to these sweeping forces of change.

The future business of airlines and airports will be digitally integrated into the ecosystem of partners and suppliers. Companies will provide more personalized customer experiences and be enabled by a new suite of the latest technologies, including automation, robotics, and biometrics.

For instance, new customer notification mobile apps provide customers with relevant and timely updates throughout their journeys. Other major improvements support the front line service teams at various touchpoints throughout the airports and end to end travel journey.

Apache Kafka in the Airline Industry

Apache Kafka is the de facto standard for event streaming use cases across industries. Many use cases can be applied to the aviation industry, too. Concepts like payment, customer experience, and manufacturing differ in detail. But in the end, it is about integrating systems and processing data in real-time at scale.

For instance, omnichannel retail with Apache Kafka applies to airline, airports, global distribution systems (GDS), and other aviation industry sectors.

However, it is always easier to learn from other companies in the same industry. Therefore, the following explores a few public Apache Kafka success stories from the aviation industry.

Lufthansa – Kafka Unified Streaming Cloud Operations

Lufthansa talks about the benefits of using Apache Kafka instead of traditional messaging queues (TIBCO EMS, IBM MQ) for data processing.

The journey started with the question if Lufthansa can do data processing better, cheaper, and faster.

Lufthansa’s Kafka architecture does not have any surprises. A key lesson learned from many companies: The real added value is created when you leverage Kafka not just for messaging, but its entire ecosystem, including different clients/proxies, connectors, stream processing, and data governance.

The result at Lufthansa: A better, cheaper, and faster infrastructure for real-time data processing at scale.

My two favorite statements (once again: not really a surprise, as I see the same at many other customers):

“Scaling Kafka is really inexpensive”
“Kafka adopted and integrated within 3 months”

Watch the full talk from Marcos Carballeira Rodríguez from Lufthansa recorded at the Confluent Streaming Days 2020 to see all the architectures and quotes from Lufthansa.

And check out this exciting video recording of Lufthansa discussing their Kafka use cases for middleware modernization and machine learning:

Singapore Airlines – Predictive Maintenance with Kafka Connect, Kafka Streams, and ksqlDB

Singapore Airlines is an early adopter of KSQL to continuously process sensor data and apply analytic models to the events. They already talked about their Kafka ecosystem usage (including Kafka Connect, Kafka Streams, and KSQL) back in 2018. The use case is predictive maintenance with a scalable real-time infrastructure, as you can see in my summary slide:

Check out the complete slide deck from Singapore Airlines for more details.

Air France Hop – Scalable Real-Time Microservices

I really like the Kafka Summit talk title: “Hop! Airlines Jets to Real-Time“. Air France Hop leverage Change Data Capture (CDC) with HVR and Kafka for real-time data processing and integration with legacy monoliths. A pretty common pattern to integrate the old and the new software and IT world:

The complete slide deck and on-demand video recording about this case study are available on the Kafka Summit page.

Amadeus – Real-Time and Batch Log Processing

As I said initially, Kafka is not just relevant for each airline, airport, and aircraft manufacturers. The global distribution system (GDS) from Amadeus is one of the world’s biggest (competing mainly with Sabre). Passenger name record (PNR) is a record in the computer reservation system (CRS) and a crucial part of any GDS vendor. While many end-users don’t even know about Amadeus, the aviation industry could not survive without them. Their workloads are mission-critical and need to run 24/7 in real-time, plus connect to their partners’ systems (like an airline) in a very stable and mature manner!

Amadeus is relying on Apache Kafka for both real-time and batch data processing, as they explain on the official Apache Kafka website:

Streaming Data Exchange for the Travel Industry

After looking at some examples, let’s now cover one more key topic: Data integration and correlation between partners in the aviation industry. Airline, airports, GDS, travel companies, and many other companies need to integrate very well. Obviously, this is already implemented. Otherwise, there is no way to operate flights with passengers and cargo. At least in theory. Honestly, one of the most significant pain points of the travel industry for customers is bad integration across companies. Some examples:

Late or (even worse) no notification about a delay or cancellation
Issues with the display of available seats or upgrade
Broken booking process on the website because of different flight numbers, connecting flights,
Booking class issues for upgrades or rebookings
Display of technical error messages instead of business information (for instance, I can’t count how often I had seen an “IBM WebSphere” error message when I tried to book a flight on the website of my most commonly used airline)
The list goes on and on and on… No matter which airline you pick. That’s at my experience as a frequent traveler across all continents and timezones.

There are reasons for these issues. The aviation network is very complex. For instance, Lufthansa group sells tickets for all their own brands (like Swiss or Austrian Airlines), plus tickets from Star Alliance partners (such as United or Singapore Airlines). Hence, airline, airports, GDS, and many partner systems have to work together. 24/7. In real-time. For this reason, more and more companies in the aviation industry rely on Kafka internally.

But that’s only half of the story… How do you integrate with partners?

Event Streaming vs. REST / HTTP APIs

I explored the discussion around event streaming with Kafka vs. RESTful web services with HTTP in much more detail in another article: “Comparison: Apache Kafka vs. API Management / API Gateway tools like Mulesoft or Kong“. In short: Kafka and REST APIs have their trade-offs. Both are complementary and used together in many architectures. API Management is a great add-on for many applications and microservices, no matter if they are built with HTTP or Kafka under the hood.

But one point is clear: If you need a scalable real-time integration with a partner system, then HTTP is not the right choice. You can either pick gRPC as a request-response alternative or use Kafka natively for the integration with partners, as you use it internally already anyway:

Kafka-native replication between partners works very well. No matter what Kafka vendor and version you and your partner are running. Obviously, the biggest challenge is the security (not from a technical but an organizational perspective). Kafka requires TCP. That’s much harder to get approval for opening it to a partner than HTTP ports.

But from a technical point of view, streaming replication often makes much more sense. I have seen the first customers implementing integration via tools like Confluent Replicator. I am sure that we will see this pattern much more in the future and with better out-of-the-box tool support from vendors.

Data Integration and Correlation at an Airport with Airline Data using KSQL

So, let’s assume that you have the data streams connected at an airport. No matter if just internal data or also partner data. Data correlation adds the business value. Sönke Liebau from OpenCore presented a great airport demo with Kafka and KSQL at a Kafka Summit.

Let’s take a look at some events at an airport:

These events exist in various structures and with different technologies and formats. Some data streams arrive in real-time. However, some other data sets come from a monolithic mainframe in batch via a file integration. Kafka Connect is a Kafka-native middleware to implement this integration.

Afterward, all this data needs to be correlated with historical data from a loyalty system or relational database. This is where stream processing comes into play: This concept enables the continuous data correlation in real-time at scale. Kafka-native technologies like Kafka Streams or ksqlDB exist to build streaming ETL pipelines or business applications.

The following example correlates the gate information from the airport with the airline flight information to send a delay notification to the customer who is waiting for the connection flight:

Tons of use cases exist to leverage event streams from different systems (and partners) in real-time. Some examples from an airport perspective:

Location-based services while the customer is walking through the airport and waiting for the flight. Example: Coupons for a restaurant (with many empty seats or food reserves to thrash if not sold during the day)
Airline services such as free or points-based discounted lounge entrance (because the lounge tracking systems knows that it is almost empty right now anyway)
Partner services like notifying the airport hotel that the guest can stay longer in the room because of a long delay of the upcoming flight

The list of opportunities is almost endless. However, most use cases are only possible if all systems are integrated and data is continuously correlated in real-time. If you need some more inspiration, check out the two blogs “Kafka at the Edge in a Smart Retail Store” and “Kafka in a Train for Improved Customer Experience“. All these use cases are a perfect fit for airline, airports, and their partner ecosystem.

Slides – Apache Kafka in the Aviation, Airline and Travel Industry

The following slide deck goes into more detail:

Kafka for Improved Operations and Customer Experience in the Aviation Industry

This post explored various use cases for event streaming with Apache Kafka in the aviation industry. Airline, airports, aerospace, flight safety, manufacturing, GDS, retail, and many more partners rely on Apache Kafka.

No question: Kafka is getting mainstream these months in the aviation industry. Serverless and consumption-based offerings such as Confluent Cloud boost the adoption even more. A streaming data exchange between partners is the next step I see on the horizon. I am looking forward to Kafka-native interfaces from Open APIs of enterprises, better support for streaming interfaces in API Management tools, and COTS solution from software vendors.

What are your experiences and plans for event streaming in the aviation industry? Did you already build applications with Apache Kafka? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka in the Airline, Aviation and Travel Industry appeared first on Kai Waehner.

A Hybrid Streaming Architecture for Smart Retail Stores with Apache Kafka

Kai Waehner — Mon, 01 Feb 2021 07:54:27 +0000

Event Streaming with Apache Kafka disrupts the retail industry. Walmart’s real-time inventory system and Target’s omnichannel distribution and logistics are two great examples. This blog post explores a concrete use case as part of the overall story: A hybrid streaming architecture to build smart retail stores for autonomous or disconnected edge computing and replication to the cloud with Apache Kafka.

Disruption of the Retail Industry with Apache Kafka

Various deployments across the globe leverage event streaming with Apache Kafka for very different use cases. Consequently, Kafka is the right choice, whether you need to optimize the supply chain, disrupt the market with innovative business models, or build a context-specific customer experience.

I explored the use cases of Apache Kafka in retail in a dedicated blog post: “The Disruption of Retail with Event Streaming and Apache Kafka“. Learn about the real-time inventory system from Walmart, omnichannel distribution and logistics at Target, context-specific customer 360 at AO.com, and much more.

This post shows a specific example: The smart retail store and its connection to cloud applications. The example uses AWS. Of course, any other cloud infrastructure can be used instead, such as Azure, GCP, or Alibaba. Walgreens is a great real-world example for building smart retail with 5G and mobile edge computing (MEC) deployments to their 9000 stores.

A Hybrid Streaming Architecture with Apache Kafka for the Smart Retail Store

Multiple Kafka clusters are the norm, not an exception! Hybrid architecture requires Kafka clusters in one or more clouds and in local data centers. In the meantime, the trend goes even further: Plenty of use cases exist for Kafka at the edge (i.e., outside the data center).

In retail, the best customer experience and increased revenue require edge processing with low latency. Often, the internet connection is bad, too. Hence, hybrid Kafka architectures make a lot of sense:

The bi-directional communication between each edge site and a central Kafka cluster is possible with Kafka-native tools such as Mirrormaker 2 or Confluent’s Cluster Linking.

The cloud is best for aggregation use cases, data lakes, data warehouses, integration with 3rd party SaaS, etc. However, many retail use cases need to run at the edge even if there is no internet connection.

Edge Processing and Analytics in the Retail Store

Many retail stores have a bad internet connection that is not stable and has low bandwidth. Hence, the digital transformation in retail requires data processing at the edge:

Kafka at the edge includes various use cases in a retail store:

Low latency transactions like payment processing at the point of sale (Kafka is not hard real-time, but able to process data milliseconds end-to-end)
Location-based services such as context-specific recommendations and advertisements (including machine learning and model inference)
Integration with other edge applications and devices (sensors, cameras, mobile apps, etc.)
Pre-processing before replication to the cloud (filtering, aggregations, etc.) to reduce costs
Buffering and backpressure handling (real decoupling between applications and between edge and cloud)

The Autonomous (or Disconnected) Edge: An Offline Retail Store

Many architectures don’t do real edge processing. They just connect the clients at the edge to the backends in the cloud. This is fine for some use cases. However, several good reasons exist to deploy Kafka at the edge beyond replication to the cloud:

Always on – process edge data even if you don’t have a (good) internet connection
Backpressure handling – decouple the edge from the cloud if there is no stable connection to the cloud
Reduced traffic costs – it does not make sense to replicate all sensor data etc. to the cloud
Low latency and edge data processing are key for some use cases – for instance, context-specific and location-based customer notifications don’t make sense if the person already walked away from a product or even out of your store already (please note that Kafka is NOT hard real-time, though!)
Analytics – Machine Learning in the cloud is great to train models (and Kafka is a key piece of the ML story, too), but the model inference at scale in real-time (with Kafka) can only happen at the edge

With Kafka at the edge, you can solve all these scenarios with a single technology, including non-real-time use cases:

Real-World Example: Swimming Retail Stores at Royal Caribbean

Royal Caribbean is a cruise line. It operates the four largest passenger ships in the world. As of January 2021, the line operates twenty-four ships and has six additional ships on order.

Royal Caribbean implemented one of the most famous use cases for Kafka at the edge. Each cruise ship has a Kafka cluster running locally for use cases such as payment processing, loyalty information, customer recommendations, etc.:

All the reasons I described above apply for Royal Caribbean:

Bad and costly connectivity to the internet
The requirement to do edge computing in real-time for a seamless customer experience and increased revenue
Aggregation of al the cruise trips in the cloud for analytics and reporting to improve the customer experience, upsell opportunities, and many other business processes

Hence, a Kafka cluster on each ship enables local processing and reliable, mission-critical workloads. The Kafka storage guarantees durability, no data loss, and guaranteed ordering of events – even though they are processed later. Only very critical data is sent directly to the cloud (if there is connectivity at all). All other data is replicated to the central Kafka cluster in the cloud when the ship arrives in a harbor for a few hours. A stable internet connection and high bandwidth are available before leaving for the next trip again.

Obviously, the same architecture can be applied to traditional retail stores on land in malls or other buildings.

Kafka at the Edge is the New Black!

The “Kafka at the edge” story is coming up more and more. Obviously, it is not just relevant for retail stores but also for bank branches, restaurants, factories, cell towers, stadiums, and hospitals.

5G will be a key reason for Kafka’s success at the edge (and edge computing in general). The better you can connect things at the edge, the more you can do with it there. The example of building a smart factory with Kafka and a private 5G campus network goes into more detail. Streaming machine learning with Apache Kafka at the edge is the new black! This is true for many use cases, including advanced planning, payment and fraud detection, or customer recommendations.

What are your experiences and plans for event streaming in the retail industry or with Kafka at the edge (outside the data center)? Did you already build applications with Apache Kafka? Check out the “Infrastructure Checklist for Apache Kafka at the Edge” if you plan to go that direction!

Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post A Hybrid Streaming Architecture for Smart Retail Stores with Apache Kafka appeared first on Kai Waehner.

Apache Kafka for Supply Chain Management (SCM) Optimization

Kai Waehner — Tue, 15 Dec 2020 09:43:46 +0000

Supply Chain Management (SCM) involves planning and coordinating all the people, processes, and technology involved in creating value for a company. This includes cross-cutting processes, including purchasing/procurement, logistics, operations/manufacturing, and others. Automation, robustness, flexibility, real-time, and hybrid deployment (edge + cloud) are key for future success, no matter what industry. This blog explores how Apache Kafka helps optimize a supply chain providing decoupled microservices, data integration, real-time analytics, and more…

The following topics are covered:

Definition of supply chain management
Challenges of current supply chains
Event streaming to optimize the supply chain
Use cases and real-world enterprise examples for Apache Kafka deployments

Supply Chain Management (SCM)

Supply chain management (SCM) covers the management of the flow of goods and services. It involves the movement and storage of raw materials, work-in-process inventory, and finished goods, and an end to end order fulfillment from the point of origin to the point of consumption. Interconnected, interrelated, or interlinked networks, channels, and node businesses combine to provide products and services required by end customers in a supply chain.

SCM is often defined as the design, planning, execution, control, and monitoring of supply chain activities to create net value, build a competitive infrastructure, leverage worldwide logistics, synchronize supply with demand, and measure performance globally.

Supply chain management is the broad range of activities required to plan, control, and execute a product’s flow from materials to production to distribution in the most economical way possible. SCM encompasses the integrated planning and execution of processes required to optimize the flow of materials, information, and capital in functions that broadly include demand planning, sourcing, production, inventory management, and logistics.

SCOR (Supply-Chain Operations Reference Model)

There are a variety of supply-chain models, which address both the upstream and downstream elements of SCM. The SCOR (Supply-Chain Operations Reference) model was developed by a consortium of industry and the non-profit Supply Chain Council (now part of APICS). SCOR became the cross-industry de facto standard defining the scope of supply-chain management.

SCOR measures total supply-chain performance. It is a process reference model for supply-chain management, spanning from the supplier’s supplier to the customer’s customer. It includes delivery and order fulfillment performance, production flexibility, warranty, returns processing costs, inventory and asset turns, and other factors in evaluating a supply chain’s overall effective performance.

Here is an example of the SCOR framework levels:

Challenges within the Evolving Supply Chain Processes in a Digital Era

The above definition of SCM and the related SCOR model shows how complex supply chain processes and solutions are. Here are the top 5 key challenges of supply chains:

Time Frames are Shorter
Rapid Change
Zoo of Technologies and Products
Historical Models are No Longer Viable
Lack of visibility

Let’s explore the challenges in more detail…

Challenge 1: Time Frames are Shorter

Challenge 2: Rapid Change

Challenge 3: Historical Models are No Longer Viable

Challenge 4: Lack of visibility

Lack of plan visibility leads to inventory and resource utilization imbalances. Imbalances mean waste (overproduction) and uncaptured revenue (underproduction). Here are some stats:

69% of companies lack complete visibility of their supply chains (BCI Supply Chain Resilience Report)
78% of companies lack a proactive response supply chain network (Logistics Bureau)
85% of companies report inadequate visibility into production operations (GEODIS Supply Chain Worldwide Survey)

Challenge 5: Zoo of Technologies and Products

A zoo of supply chain technologies and products needs to be integrated and modernized. Here are a few examples:

Check out my blog post about “integration alternatives and connectors for Apache Kafka and SAP standard software” to explore how complex such an integration environment typically looks like.

Are more detailed explanation of these supply chain challenges (and the related solutions) is discussed in the Confluent webinar recording done by me with experts from Expero: Supply Chain Optimization with Event Streaming and the Apache Kafka Ecosystem.

Consequences of the Supply Chain Challenges

The consequences of all these challenges are horrible for an enterprise supply chain:

Missed orders
Lost revenue
Expediting fees
Contract penalties
Frustrated customers
…

So let’s talk about how event streaming with Apache Kafka can help to fix these problems.

Why Apache Kafka for Supply Chain Optimization?

Solving the requirements described above usually requires various of the characteristics and features Kafka and its ecosystem provide with one single technology and infrastructure:

Real-time messaging (at scale, mission-critical)
Global Kafka (edge, data center, multi-cloud)
Cloud-native (open, flexible, elastic)
Data integration (legacy + modern protocols, applications, communication paradigms)
Data correlation (real-time + historical data, omni-channel)
Real decoupling (not just messaging, but also infinite storage + replayability of events)
Real-time monitoring
Transactional data and analytics data (MES, ERP, CRM, SCM, …)
Applied machine learning (model training and scoring)
Cybersecurity
Complementary to legacy and cutting-edge technology (Mainframe, PLCs, 3D printing, augmented reality, …)

Use Cases for Apache Kafka in the Supply Chain

The supply chain is obviously a huge topic. Plenty of different use cases leverage Apache Kafka. Here are just a few examples to give a feeling about the width of possibilities:

Examples for Supply Chain Optimization with Apache Kafka Across Industries

This section explores very different use cases at enterprises across industries from carmakers (Audi, BMW, Porsche), retailers (Walmart), and food manufacturing (Baader). All content comes directly from the public talks and blog posts which their employees created and published. Exciting to see how many different problems event streaming can solve!

Manufacturing of Food Machinery @ Baader

BAADER is a worldwide manufacturer of innovative machinery for the food processing industry. They run an IoT-based and data-driven food value chain on Confluent Cloud.

The Kafka-based infrastructure provides a single source of truth across the factories and regions across the food value chain. Business-critical operations are available 24/7 for tracking, calculations, alerts, etc.

Integrated Sales, Manufacturing, Connected Vehicles and Charging Stations @ Porsche

Kafka provides real decoupling between applications. Hence, Kafka became the defacto standard for microservices and Domain-driven Design (DDD). It allows to build independent and loosely coupled, but scalable, highly available, and reliable applications.

That’s exactly what Porsche describes for their usage of Apache Kafka through its supply chain:

“The recent rise of data streaming has opened new possibilities for real-time analytics. At Porsche, data streaming technologies are increasingly applied across a range of contexts, including warranty and sales, manufacturing and supply chain, connected vehicles, and charging stations” writes Sridhar Mamella (Platform Manager Data Streaming at Porsche).

The following picture shows the event hub which Heiko Scholtes from PorscheDev explored in one of their blog posts:

This architecture was published in the mid of 2017 already. Hence, Porsche already uses Kafka for a long time in their projects. That’s a pretty common pattern for Kafka: Build one pipeline. Then let more and more consumers use the data. Some real-time or near real-time, some others via batch processes or request-response interfaces.

The Confluent Podcast also features the story around Porsche’s event streaming platform Streamzilla, built on top of Kafka. Check out this podcast from December 2020 to hear directly from Porsche.

Real-Time Inventory System @ Walmart

A real-time inventory is a key piece of a modern supply chain. Many companies even require it to stay competitive and to provide a good customer experience. Business models such as “order online, pick up in the store” are impossible without real-time inventory and supply chain.

Walmart is a great example. They leverage Apache Kafka as the heart of their supply chain:

Let’s quote Suman Pattnaik, Big Data Architect @ Walmart:

“Retail shopping experiences have evolved to include multiple channels, both online and offline, and have added to a unique set of challenges in this digital era. Having an up to date snapshot of inventory position on every item is an essential aspect to deal with these challenges. We at Walmart have solved this at scale by designing an event-streaming-based, real-time inventory system leveraging Apache Kafka… Like any supply chain network, our infrastructure involved a plethora of event sources with all different types of data”.

The real-time infrastructure around Apache Kafka includes the whole supply chain, including distribution centers, stores, vendors, and customers:

Please find out more details about Walmart’s Kafka usage in their fantastic Kafka Summit talk.

Supply Chain Purchasing using Deep Learning @ BMW

BMW leverages real-time Natural Language Processing (NLP) in various use cases. For example, the implementation of digital contract intelligence enables the automation of the processing and analysis of legal documents.

BMW built an industry-ready NLP service framework based on Kafka for smart information extraction and search, automated risk assessment, plausibility checks and negotiation support:

Check out the details in BMW’s Kafka Summit talk about their use cases for Kafka and Deep Learning / NLP.

Connected Cars for Aftersales and Customer 360 @ Audi

Audi has built a connected car infrastructure with Apache Kafka. Their Kafka Summit keynote explored the use cases and architecture:

Use cases include:

Real-Time Data Analysis
Swarm Intelligence
Collaboration with Partners
Predictive AI
…

Depending on how you define the term and buzzword “Digital Twin”, this is a perfect example: All sensor data from the connected cars are processed in real-time and stored for historical analysis and reporting.

Track&Trace for Construction Management @ Bosch

The global supplier Bosch has another impressive use case for a “Digital Twin” leveraging Apache Kafka and Confluent Cloud: Construction site management analyzing sensors, machines, and workers. Use cases include collaborative planning, inventory and asset management, and track, manage, and locate tools and equipment anytime and anywhere:

If you are interested in more details about building a digital twin with the Apache Kafka ecosystem, check out more material here: “Apache Kafka for Building a Digital Twin IoT Infrastructure“.

Kafka and Blockchain for Supply Chain Management

If there is one use case where blockchain really makes sense, then it is supply chain management. Blockchain provides features to support cross-company interaction securely. However, blockchain is also very complex and immature. I have not seen many projects where the added value is bigger than the added cost and risk. Often, Kafka is “good enough”. But let’s be clear: Kafka is NOT a Blockchain:

Having said this, please be aware:

Many blockchain products are not really a blockchain, but just a distributed ledger.
Many projects don’t require all the features of a blockchain.
Tamper-proof storage on disk and end-to-end payload encryption (often applied on field/attribute level) are not part of Kafka but can be added with some nice add-ons).
Cross-company integration with non-trusted parties is the only real reason when a blockchain is needed and adds value.

Hence, make sure to define all requirements and then evaluate if you need Kafka, a blockchain, or a combination of both:

If you want to learn more about the relation between Apache Kafka and blockchain projects, check out this material: “Apache Kafka as Part of a Blockchain Project and its Relation to Frameworks like Hyperledger and Ethereum“.

Slides and Video Recording

Here are the slides and video recording exploring the optimization of supply chains with the Apache Kafka ecosystem in more detail:

Slide Deck

Video Recording

What are your experiences with Supply Chain Management architectures, applications, and optimization? Did you already implement a more automated, scalable, real-time supply chain? Which approach works best for you? What is your strategy? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka for Supply Chain Management (SCM) Optimization appeared first on Kai Waehner.

Kafka and XML Messages – Transformation, Connector, Middleware

Kai Waehner — Fri, 25 Sep 2020 07:20:15 +0000

XML messages and XML Schema are not very common in the Apache Kafka and Event Streaming world! Why? Many people call XML legacy. It is complex, verbose, and often associated with the ugly WS-* Hell (SOAP, WSDL, etc). On the other side, every company older than five years uses XML. It is well understood, provides a good structure, and is human- and machine-readable.

This post does not want to start another flame war between XML and other technologies such as JSON (which also provides JSON Schema now), Avro, or Protobuf. Instead, I will walk you through the three main approaches to integrate between Kafka and XML messages as there is still a vast demand for implementing this integration today (often for integrating legacy applications and middleware).

XML and XML Schema

Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium’s XML 1.0 Specification of 1998 and several other related specifications – all of them free open standards – define XML.

The design goals of XML emphasize simplicity, generality, and usability across the Internet. It is a textual data format with strong support via Unicode for different human languages. Although the design of XML focuses on documents, the language is widely used for the representation of arbitrary data structures such as those used in web services. Several schema systems exist to aid in defining XML-based languages, while programmers have developed many application programming interfaces (APIs) to assist the processing of XML data.

SOAP / WSDL Web Services – The WS-* Hell

Web Services use XML messages that follow the SOAP standard and have been popular with traditional enterprises for many years. In such systems, there is often a machine-readable description of the operations offered by the service written in the Web Services Description Language (WSDL). Web Services are one of the predominant use cases for XML integration. Some people call this the “WS-* Hell”:

Kafka for any Data Format (JSON, XML, Avro, Protobuf, …)

Kafka can store and process anything, including XML. The Kafka brokers are dumb. They don’t care about data formats. The implementation of Kafka under the hood stores and processes only byte arrays. This approach follows the design principle of dumb pipes and smart endpoints (coined by Martin Fowler for microservice architectures). Dumb brokers are one of the architectural reasons why Kafka scales and performs so well.

As Kafka supports any data format, XML is no problem at all. It accepts any serializable input format. XML is just text so that plain string serializers can be used. However, if you want additional validation before pushing messages into Kafka (like checking the content is actually XML or doing Schema Validation using XML Schema), then you need to write your own XML Serializer/Deserializer implementation.

XML mappings can be very complex, including referencing other documents and using plenty of ugly open standards. XML includes generic standards such as WS-Security or industry-specific standards such as XBRL for regulatory processing or HL7 for healthcare. It is a mess. Period. Even mature tools struggle as soon as any parts of such a standard are adjusted to their own needs (even though the standards support customization).

Kafka works with any programming language, and Confluent also provides a REST Proxy for HTTP(S) communication with Kafka. But these clients require the developer to implement all the complex XML mapping and processing. Hence, let’s now talk about the most common approaches to implement the integration between any XML-based application and Apache Kafka: 3rd party middleware (ETL / ESB tools) and Kafka-native Kafka Connect.

Kafka and XML via Middleware (ETL, ESB)

Apache Kafka and traditional middleware (ETL, ESB) are frenemies (friends and enemies). Check out this blog and video recording/slide deck for a more in-depth discussion and comparison. If these legacy integration tools such as TIBCO BusinessWorks or Software AG webMethods do one thing well, then it is graphical mappings of complex XML structures, including good and mature (but not 100%) support of the ugly WS-* web service standards.

Pros of 3rd Party Middleware for XML-Kafka Integration

Visual coding for a more straightforward mapping experience (especially crucial for very complex structures) – for all coders: Trust me, this is really easier and more time-efficient than writing, testing, and debugging source code!
Mature (10-20 years old technologies don’t have many bugs anymore – if they are still alive)
Support of complex XML Schema structures (but yet often issues with import, UI, and export)
(Often) already in place (implemented, tested, and deployed)
Kafka integration exists for any middleware which is still alive (i.e., maintained and supported by the vendor)

Cons of 3rd Party Middleware for XML-Kafka Integration

Products are legacy – as old as the source systems
Monolithic, inflexible architecture
Separate infrastructure to operate, test, maintain and pay
End-to-end integration is more challenging (from a technical and support perspective) as two systems in the middle instead of just one
Licensing
Point-to-point and tight coupling, and not event-based streaming with real decoupling
(Often) proprietary solution

Traditional middleware (such as TIBCO, IBM, Software AG, or Mulesoft) complements Kafka deployments. If you have the middleware already running and licensed (and do not plan to migrate away from it), then this is a viable approach to integrate between XML messages from legacy systems and Kafka.

Kafka and XML via Kafka Connect

The open Kafka ecosystem provides Kafka-native support for XML integration leveraging Kafka Connect. Kafka Connect is a Kafka-native tool for scalably and reliably streaming data integration between Apache Kafka and other data systems. It makes it simple to quickly define connectors that move large data sets into and out of Kafka. Think about it as ESB-on-Kafka.

Use cases include

Messaging integration (MQ)
Mainframe offloading
File outputs from batch processes
Web services (SOAP / WSDL / WS-*)
Legacy applications

Pros

Kafka-native (leveraging Kafka under the hood for scalability, throughput, high availability, exactly-once semantics, low latency, etc.)
Decoupled design (Domain-driven Microservice approach instead of tight coupling)
Open ecosystem and flexible integration with any data sources and sinks – check out Confluent Hub for hundreds of open source and commercial connectors
Cloud-native to be deployed in any edge / on-premise or cloud infrastructure such as Kubernetes

Cons

Limited support for complex XML Schemas and standards – not all ugly documents will work well
No visual coding – unfortunately, no Kafka-native visual coding tools exist in 2020. Let’s go, Confluent!

Let’s take a look at two Kafka Connect approaches in more detail: A dedicated XML Connector and an SMT (Single Message Transformation) embedded into any Kafka Connect source or sink connector.

Kafka Connect Connector for XML Files

An XML connector directly accesses the XML file to parse and transform the content:

Connect FilePulse is an open-source Kafka Connect connector built by streamthoughts to parse, transform and stream any XML file. Other file formats are also supported. But as many other tools support the modern data formats such as JSON, CSV, Avro, or Protobuf, I really think the highlight of this connector is the XML support.

Features:

Support for recursive scanning of local directories.
Reading and writing files into Kafka line by line.
Support multiple input file formats (e.g: CSV, JSON, AVRO, XML).
Parsing and transforming data using built-in or custom processing filters
Error handler definition
Monitoring files while they are written into Kafka
Support pluggable strategies to cleanup up completed files

Here is an excellent tutorial for using this XML connector for Kafka Connect: Streaming data into Kafka – Loading an XML file.

The Connect FilePulse Kafka Connector is the right choice for direct integration between XML files and Kafka.

SMT for Embedding XML Transformations into ANY Kafka Connect Connector

An SMT (Single Message Transformation) is part of the Kafka Connect framework. SMTs are applied to messages as they flow through Kafka Connect. They transform inbound messages after a source connector has produced them, but before they are written to Kafka. SMTs transform outbound messages before they are sent to a sink connector.

An SMT can be embedded into any Kafka Connect source or sink connector. Hence, the XML SMT for Kafka Connect allows direct integration with any interface and mapping XML messages without the need for storing the file or using a specific XML connector.

SMTs even allow to add or change metadata, e.g., by adding a new header in addition to the key and value of the message.

Here is an example: Receive XML messages from JMS-based messaging platforms and convert the XML payload to JSON, AVRO, or Protobuf for further processing and integration into the rest of the (modern) enterprise architecture. For instance, Confluent provides a generic JMS connector but also dedicated connectors for various legacy MQ products such as IBM MQ (often running on the mainframe), TIBCO EMS, and ActiveMQ. The XML SMT allows on-the-fly transformation of the incoming XML messages. I have seen integration and later replacement of these MQ tools across the globe in any kind of industry.

Dead Letter Queue (DLQ) for Handling Bad XML Messages

Just transforming messages is often not sufficient. A Dead Letter Queue (DLQ), aka Dead Letter Channel, is an Enterprise Integration Pattern (EIP) to handle bad messages. This design pattern is complementary for XML integration. For instance, a DLQ can store badly processed XML that didn’t fit the XSD in the transform. Here is an example of how to implement the rerouting to a DLQ using the above SMT.

Confluent Schema Registry for Data Governance

The Confluent Schema Registry is a complimentary (optional) tool. It provides a smart implementation of data format and content validation (including enforcement, versioning, and other features). I see it used in ~70% of Kafka projects across the globe. As soon as you do more than just data ingestion into a data lake like HDFS or S3, the added value is enormous. Today, Confluent Schema Registry supports JSON Schema, Avro, and Protobuf.

The Schema Registry provides an open interface and is pluggable. For example, some users have asked for Schema Registry to support XML. Now, you can add XML support to Schema Registry directly, and use the Schema Registry to store both XML and Avro at the same time. For more on how to add your schema formats, please refer to the documentation. The workaround is to do the discussed XML-to-another-format transformation first and start your event streaming data governance from that point.

Summary

XML is predominant in most enterprises and mostly used for legacy applications, batch processing, and SOAP / WSDL web services. A digital transformation can only be successful if the old world is connected well to the new world (as you might have learned in my example about how to integrate between Kafka and Mainframes).

This post explored the three most common options for integration between Kafka and XML:

XML integration with a 3rd party middleware
Kafka Connect connector for integration with XML files
Kafka Connect SMT to embed the transformation into ANY other Kafka Connect source or sink connector to transform XML messages on the flight

What are your experiences with XML integration for Kafka? Which implementation did you choose? What challenges did you face, and how did you or do you plan to solve this? What is your strategy? Let’s connect on LinkedIn and discuss it!

The post Kafka and XML Messages – Transformation, Connector, Middleware appeared first on Kai Waehner.