Enterprise Architecture Archives - Kai Waehner

The State of Data Streaming for Healthcare with Apache Kafka and Flink

Kai Waehner — Mon, 27 Nov 2023 13:52:35 +0000

This blog post explores the state of data streaming for the healthcare industry. The digital disruption combined with growing regulatory requirements and IT modernization efforts require a reliable data infrastructure, real-time end-to-end observability, fast time-to-market for new features, and integration with pioneering technologies like sensors, telemedicine, or AI/machine learning. Data streaming allows integrating and correlating legacy and modern interfaces in real-time at any scale to improve most business processes in the healthcare sector much more cost-efficiently.

I look at trends in the healthcare industry to explore how data streaming helps as a business enabler, including customer stories from Humana, Recursion, BHG (former Bankers Healthcare Group), Evernorth Health Services, and more. A complete slide deck and on-demand video recording are included.

General trends in the healthcare industry

The digitalization of the healthcare sector and disruptive use cases is exciting. Countries where healthcare is not part of the public administration innovate quickly. However, regulation and data privacy are crucial across the world. And even innovative technologies and cloud services need to comply with law and in parallel connect to legacy platforms and protocols.

Regulation and interoperability

Healthcare does often not have a choice. Regulations by the government must be implemented by a specific deadline. IT modernization, adoption of new technologies, and integration with the legacy world are mandatory. Many regulations demand Open APIs and interfaces. But even if not enforced, the public sector does itself a favour adopting open technologies for data sharing between different sectors and the members.

A concrete example: Interoperability and Patient Access final rule (CMS-9115-F), as explained by a US government, “aims to put patients first, giving them access to their health information when they need it most and, in a way, they can best use it.

Interoperability = Interoperability is the ability of two or more systems to exchange health information and use the information once it is received.
Patient Access = Patient Access refers to the ability of consumers to access their health care records.

Lack of seamless data exchange in healthcare has historically detracted from patient care, leading to poor health outcomes, and higher costs. The CMS Interoperability and Patient Access final rule establishes policies that break down barriers in the nation’s health system to enable better patient access to their health information, improve interoperability and unleash innovation while reducing the burden on payers and providers.

Patients and their healthcare providers will be more informed, which can lead to better care and improved patient outcomes, while reducing burden. In a future where data flows freely and securely between payers, providers, and patients, we can achieve truly coordinated care, improved health outcomes, and reduced costs.”

Digital disruption and automated workflows

Gartner has a few interesting insights about the evolution of the healthcare sector. The digital disruption is required to handle revenue reduction and revenue reinvention because of economic pressure, scarce and extensive talent, and supply challenges:

Gartner points out that real-time workflows and automation are critical across the entire health process to enable an optimal experience:

Therefore, data streaming is very helpful in implementing new digitalized healthcare processes.

Data streaming in the healthcare industry

Adopting healthcare trends like telemedicine, automated member service with Generative AI (GenAI), or automated claim processing are only possible if enterprises in the games sector can provide and correlate information at the right time in the proper context. Real-time, which means using the information in milliseconds, seconds, or minutes, is almost always better than processing data later:

Data streaming combines the power of real-time messaging at any scale with storage for true decoupling, data integration, and data correlation capabilities. Apache Kafka is the de facto standard for data streaming.

The following blog series about data streaming with Apache Kafka in the healthcare industry is a great starting point to learn more about data streaming in the health sector, including a few industry-specific and Kafka-powered case studies:

Data Streaming Use Cases and Architectures for Healthcare (including Slide Deck)
Legacy Modernization and Hybrid Cloud (Optum / UnitedHealth Group, Centene, Bayer)
Streaming ETL (Bayer, Babylon Health)
Real-time Analytics (Cerner, Celmatix, CDC/Centers for Disease Control and Prevention)
Machine Learning and Data Science (Recursion, Humana)
Open API and Omnichannel (Care.com, Invitae)

Architecture trends for data streaming

The healthcare industry applies various software development and enterprise architecture trends for cost, elasticity, security, and latency reasons. The three major topics I see these days at customers are:

Event-driven architectures (in combination with request-response communication) to enable domain-driven design and flexible technology choices
Data mesh for building new data products and real-time data sharing with internal platforms and partner APIs
Fully managed SaaS (whenever doable from compliance and security perspective) to focus on business logic and faster time-to-market

Let’s look deeper into some enterprise architectures that leverage data streaming for healthcare use cases.

Event-driven architecture for integration and IT modernization

IT modernization requires integration between legacy and modern applications. The integration challenges include different protocols (often proprietary and complex), various communication paradigms (asynchronous, request-response, batch), and SLAs (transactions, analytics, reporting).

Here is an example of a data integration workflow combining clinical health data and claims in EDI / EDIFACT format, data from legacy databases, and modern microservices:

One of the biggest problems in IT modernization is data consistency between files, databases, messaging platforms, and APIs. That is a sweet spot for Apache Kafka: Providing data consistency between applications no matter what technology, interface or API they use.

Data sharing across business units is important for any organization. The healthcare industry has to combine very interesting (different) data sets, like big data game telemetry, monetization and advertisement transactions, and 3rd party interfaces.

Data consistency is one of the most challenging problems in the games sector. Apache Kafka ensures data consistency across all applications and databases, whether these systems operate in real-time, near-real-time, or batch.

One sweet spot of data streaming is that you can easily connect new applications to the existing infrastructure or modernize existing interfaces, like migrating from an on-premise data warehouse to a cloud SaaS offering.

New customer stories for data streaming in the healthcare sector

The innovation is often slower in the healthcare sector. Automation and digitalization change how healthcare companies process member data, execute claim processing, integrate payment processors, or create new business models with telemedicine or sensor data in hospitals.

Most healthcare companies use a hybrid cloud approach to improve time-to-market, increase flexibility, and focus on business logic instead of operating all IT infrastructure on premises. The integration between legacy protocols like EDIFACT and modern applications is still one of the toughest challenges.

Here are a few customer stories from healthcare organizations for IT modernization and innovation with new technologies:

BHG Financial (formerly: Bankers Healthcare Group): Direct lender for healthcare professionals offering loans, credit card, insurance
Evernorth Health Services: Hybrid integration between on-premise mainframe and microservices on AWS cloud
Humana: Data integration and analytics at the point of care
Recursion: Accelerating drug discovery with a hybrid machine learning architecture

Resources to learn more

This blog post is just the starting point. Learn more about data streaming with Apache Kafka and Apache Flink in the healthcare industry in the following on-demand webinar recording, the related slide deck, and further resources, including pretty cool lightboard videos about use cases.

On-demand video recording

The video recording explores the healthcare industry’s trends and architectures for data streaming. The primary focus is the data streaming architectures and case studies.

I am excited to have presented this webinar in my interactive light board studio:

This creates a much better experience, especially in a time after the pandemic, where many people are “Zoom fatigue”.

Check out our on-demand recording:

Slides

If you prefer learning from slides, check out the deck used for the above recording:

Fullscreen Mode

Case studies and lightboard videos for data streaming in the healthcare industry

The state of data streaming for healthcare in 2023 is interesting. IT modernization is the most important initiative across most healthcare companies and organizations. This includes cost reduction by migrating from legacy infrastructure like the mainframe, hybrid cloud architectures with bi-directional data sharing, and innovative new use cases like telehealth.

We recorded lightboard videos showing the value of data streaming simply and effectively. These five-minute videos explore the business value of data streaming, related architectures, and customer stories. Here is an example of cost reduction through mainframe offloading.

Healthcare is just one of many industries that leverages data streaming with Apache Kafka and Apache Flink.. Every month, we talk about the status of data streaming in a different industry. Manufacturing was the first. Financial services second, then retail, telcos, gaming, and so on… Check out my other blog posts.

How do you modernize IT infrastructure in the healthcare sector? Do you already leverage data streaming with Apache Kafka and Apache Flink? Maybe even in the cloud as a serverless offering? Let’s connect on LinkedIn and discuss it! Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

The post The State of Data Streaming for Healthcare with Apache Kafka and Flink appeared first on Kai Waehner.

Data Streaming is not a Race, it is a Journey!

Kai Waehner — Thu, 13 Apr 2023 07:15:31 +0000

Data Streaming is not a race, it is a Journey! Event-driven architectures and technologies like Apache Kafka or Apache Flink require a mind shift in architecting, developing, deploying, and monitoring applications. Legacy integration, cloud-native microservices, and data sharing across hybrid and multi-cloud setups are the norm, not an exception. This blog post explores success stories from data streaming journeys across industries, including banking, retail, insurance, manufacturing, healthcare, energy & utilities, and software companies.

Data Streaming is a Journey, not a Race!

Confluent’s maturity model is used across thousands of customers to analyze the status quo, deploy real-time infrastructure and applications, and plan for a strategic event-driven architecture to ensure success and flexibility in the future of the multi-year data streaming journey:

Source: Confluent

The following sections show success stories from various companies across industries that moved through the data streaming journey. Each journey looks different. Each company has different technologies, vendors, strategies, and legal requirements.

Before we begin, I must stress that these journeys are NOT just successful because of the newest technologies like Apache Kafka, Kubernetes, and public cloud providers like AWS, Azure, and GCP. Success comes with a combination of technologies and expertise.

Consulting – Expertise from Software Vendors and System Integrators

All the below success stories combined open source technologies, software vendors, cloud providers, internal business and technical experts, system integrators, and consultants of software vendors.

Long story short: Technology and expertise are required to make your data streaming journey successful.

We not only sell data streaming products and cloud services but also offer advice and essential support. Note that the bundle numbers (1 to 5) in the following diagram are related to the above data streaming maturity curve:

Source: Confluent

Other vendors have similar strategies to support you. The same is true for the system integrators. Learn together. Bring people from different companies into the room to solve your business problems.

With this background, let’s look at the fantastic data streaming journeys we heard about at past Kafka Summits, Data in Motion events, Confluent blog posts, or other similar public knowledge-sharing alternatives.

Customer Journeys across Industries powered by Apache Kafka

Apache Kafka is the DE FACTO standard for data streaming. However, each data streaming journey looks different. Don’t underestimate how much you can learn from other industries. These companies might have different legal or compliance requirements, but the overall IT challenges are often very similar.

We cover stories from various industries, including banks, retailers, insurers, manufacturers, healthcare organizations, energy providers, and software companies.

The following data streaming journeys are explored in the below sections:

AO.com: Real-Time Clickstream Analytics
Nordstrom: Event-driven Analytics Platform
Migros: End-to-End Supply Chain with IoT
NordLB: Bank-wide Data Streaming for Core Banking
Raiffeisen Bank International: Strategic Real-Time Data Integration Platform Across 13 Countries
Allianz: Legacy Modernization and Cloud-Native Innovation
Optum (UnitedHealth Group): Self-Service Data Streaming
Intel: Real-Time Cyber Intelligence at Scale with Kafka and Splunk
Bayer: Transition from On-Premise to Multi-Cloud
Tesla: Migration from a Message Broker (RabbitMQ) to Data Streaming (Kafka)
Siemens: Integration between On-Premise SAP and Salesforce CRM in the Cloud
50Hertz: Transition from Proprietary SCADA to Cloud-Native Industrial IoT

There is no specific order besides industries. If you read the stories, you see that all are different, but you can still learn a lot from them, no matter what industry your business is in.

For more information about data streaming in a specific industry, just search my blog for case studies and architectures. Recent blog posts focus on the state of data streaming in manufacturing, in financial services, and so on.

AO.com – Real-Time Clickstream Analytics

AO.com is an electrical retailer. The hyper-personalized online retail experience turns each customer visit into a one-on-one marketing opportunity. The technical implementations correlation of historical customer data with real-time digital signals.

Years ago, Apache Hadoop and Apache Spark referred to this kind of clickstream analytics as the “Hello World” example. AO.com does real-time clickstream analytics powered by data streaming.

AO.com’s journey started with relatively simple data streaming pipelines leveraging the Schema Registry for decoupling and API contracts. Over time, the platform thinking of data streaming added more business value, and operations shifted to the fully-managed Confluent Cloud to focus on implementing business applications, not operations infrastructure.

Source: AO.com

Nordstrom – Event-driven Analytics Platform

Nordstrom is an American luxury department store chain. In the meantime, 50+% of revenue comes from online sales. They built the Nordstrom Analytical Platform (NAP) as the heart of its event-driven architecture. Nordstrom can use a singular event for analytical, functional, operational, and model-building purposes.

As Nordstrom said at Kafka Summit: “If it’s not in NAP, it didn’t happen.”

Nordstrom’s journey started in 1901 with the first stores. The last 25 years brought the retailer into the digital world and online retail. NAP has been a core component since 2017 for real-time analytics. The future is all automated decision-making in real-time.

Source: Nordstrom

Migros – End-to-End Supply Chain with IoT

Migros is Switzerland’s largest retail company, the largest supermarket chain, and the largest employer. They leverage data streaming with Confluent to distribute master data across many systems.

Migros’ supply chain is optimized with a single data streaming pipeline (including replaying entire days of events). For instance, real-time transportation information is visualized with MQTT and Kafka. Afterward, more advanced business logic was implemented, like forecasting the truck arrival time and planning, respectively, rescheduling truck tours.

The data streaming journey started with a single Kafka cluster. And grew to various independent Kafka clusters and a strategic Kafka-powered enterprise integration platform.

Source: Migros

NordLB – Bank-wide Data Streaming for Core Banking

Norddeutsche Landesbank (NordLB) is one of the largest commercial banks in Germany. They implemented an enterprise-wide transformation. The new Confluent-powered core banking platform enables event-based and truly decoupled stream processing. Use cases include improved real-time analytics, fraud detection, and customer retention.

Unfortunately, NordLB’s slide is only available in German. But I guess you can still follow their data streaming journey moving over the years from on-premise big data batch processing with Hadoop to real-time data streaming and analytics in the cloud with Kafka and Confluent:

Source: Norddeutsche Landesbank

Raiffeisen Bank International – Strategic Real-Time Data Integration Platform across 13 Countries

Raiffeisen Bank International (RBI) operates as a corporate and investment bank in Austria and as a universal bank in Central and Eastern Europe (CEE).

RBI’s bank transformation across 13 countries includes various components:

Bank-wide transformation program
Realtime Integration Center of Excellence (“RICE“)
Central platform and reference architecture for self-service re-use
Event-driven integration platform (fully-managed cloud and on-premise)
Group-wide API contracts (=schemas) and data governance

While I don’t have a nice diagram of RBI’s data streaming journey over the past few years, I can show you one of the most impressive migration stories. The context is horrible had to move from Ukraine data centers to the public cloud after the Russian invasion started in early 2022. The journey was super impressive from the IT perspective, as the migration happened without downtime or data loss.

RBI’s CTO recapped:

“We did this migration in three months, because we didn’t have a choice.”

Source: McKinsey

Allianz – Legacy Modernization and Cloud-Native Innovation

Allianz is a European multinational financial services company headquartered in Munich, Germany. Its core businesses are insurance and asset management. The company is one of the largest insurers and financial services groups.

A large organization like Allianz does not have just one enterprise architecture. This section has two independent stories of two separate Allianz business units.

One team of Allianz has built a so-called Core Insurance Service Layer (CISL). The Kafka-powered enterprise architecture is flexible and evolutionary. Why? Because it needs to integrate with hundreds of old and new applications. Some are running on the mainframe or use a batch file transfer. Others are cloud-native, running in containers or as a SaaS application in the public cloud.

Data streaming ensures the decoupling of applications via events and the reliable integration of different backend applications, APIs and communication paradigms. Legacy and new applications connect but also allow running in parallel (old V1 and new V2) before migrating away and shutting down from the legacy component.

Source: Allianz

Allianz Direct is a separate Kafka deployment. This business unit started with a greenfield approach to build insurance as a service in the public cloud. The cloud-native platform is elastic, scalable, and open. Hence, one platform can be used across countries with different legal, compliance, and data governance requirements.

This data streaming journey is best described by looking at a quote from Allianz Direct’s CTO:

Source: Linkedin (November 2022)

Optum – Self-Service Data Streaming

Optum is an American pharmacy benefit manager and healthcare provider
(UnitedHealth Group subsidiary). Optum started with a single data source connecting to Kafka. Today, they provide Kafka as a Service within the UnitedHealth Group. The service is centrally managed and used by over 200 internal application teams.

The benefits of this data streaming approach are the following characteristics: repeatable, scalable, and a cost-efficient way to standardize data. Optum leverages data streaming for many use cases, from mainframe via change data capture (CDC) to modern data processing and analytics tools.

Source: Optum

Intel – Real-Time Cyber Intelligence at Scale with Kafka and Splunk

Intel Corporation (commonly known as Intel) is an American multinational corporation and technology company headquartered in Santa Clara, California. It is one of the world’s largest semiconductor chip manufacturers by revenue.

Intel’s Cyber Intelligence Platform leverages the entire Kafka-native ecosystem, including Kafka Connect, Kafka Streams, Multi-Region Clusters (MRC), and more…

Their Kafka maturity curve shows how Intel started with a few data pipelines, for instance, getting data from various sources into Splunk for situational awareness and threat intelligence use cases. Later, Intel added Kafka-native stream processing for streaming ETL (to pre-process, filter, and aggregate data) instead of ingesting raw data (with high $$$ bills) and for advanced analytics by combining streaming analytics with Machine Learning.

Source: Intel

Brent Conran (Vice President and Chief Information Security Officer, Intel) described the benefits of data streaming:

“Kafka enables us to deploy more advanced techniques in-stream, such as machine learning models that analyze data and produce new insights. This helps us reduce the meantime to detect and respond.”

Bayer AG – Transition from On-Premise to Multi-Cloud

Bayer AG is a German multinational pharmaceutical and biotechnology company and one of the largest pharmaceutical companies in the world. They adopted a cloud-first strategy and started a multi-year transition to the cloud.

While Bayer AG has various data streaming projects powered by Kafka, my favorite is the success story of their Monsanto acquisition. The Kafka-based cross-data center DataHub was created to facilitate migration and to drive a shift to real-time stream processing.

Instead of using many words, let’s look at Bayer’s impressive data streaming journey from on-premise to multi-cloud, connecting various cloud-native Kafka and non-Kafka technologies over the years.

Source: Bayer AG

Tesla – Migration from a Message Broker (RabbitMQ) to Data Streaming (Kafka)

Tesla is a carmaker, utility company, insurance provider, and more. The company processes trillions of messages per day for IoT use cases.

Tesla’s data streaming journey is fascinating because it focuses on migrating from a message broker to stream processing. The initial reasons for Kafka were the scalability and reliability of processing high volumes of IoT sensor data.

But as you can see, more use cases were added quickly, such as Kafka Connect for data integration.

Source: Tesla

Tesla’s blog post details its advanced stream processing use cases. I often call streaming analytics the “secret sauce” (as data is only valuable if you correlate it in real-time instead of just ingesting it into a batch data warehouse or data lake).

Siemens – Integration between On-Premise SAP and Salesforce CRM in the Cloud

Siemens is a German multinational conglomerate corporation and the largest industrial manufacturing company in Europe.

One strategic data-streaming use case allowed Siemens to move “from batch to faster“ data processing. For instance, Siemens connected its SAP PMD to Kafka on-premise. The SAP infrastructure is very complex, with 80% ”customized ERP”. They improved the business processes and integration workflow from daily or weekly batches to real-time communication by integrating SAP’s proprietary IDoc messages within the event streaming platform.

Siemens later migrated from self-managed on-premise Kafka to Confluent Cloud via Confluent Replicator. Integrating Salesforce CRM via Kafka Connect was the first step of Siemens’ cloud strategy. As usual, more and more projects and applications join the data streaming journey as it is super easy to tap into the event stream and connect it to your favorite tools, APIs, and SaaS products.

Source: Siemens

A few important notes on this migration:

Confluent Cloud was chosen because it enables focusing on business logic and handing over complex operations to the expert, including mission-critical SLAs and support.
Migration from one Kafka cluster to another (in this case, from open source to Confluent Cloud) is possible with zero downtime and no data loss. Confluent Replicator and Cluster Linking, in conjunction with the expertise and skills of consultants, make such a migration simple and safe.

50Hertz – Transition from Proprietary SCADA to Cloud-Native Industrial IoT

50hertz is a transmission system operator for electricity in Germany. They built a cloud-native 24/7 SCADA system with Confluent. The solution is developed and tested in the public cloud but deployed in safety-critical air-gapped edge environments. A unidirectional hardware gateway helps replicate data from the edge into the cloud safely and securely.

This data streaming journey is one of my favorites as it combines so many challenges well known in any OT/IT environment across industries like manufacturing, energy and utilities, automotive, healthcare, etc.:

From monolithic and proprietary systems to open architectures and APIs
From batch with mainframe, files, and old Windows servers to real-time data processing and analytics
From on-premise data centers (and air-gapped production facilities) to a hybrid edge and cloud strategy, often with a cloud-first focus for new use cases.

The following graphic is fantastic. It shows the data streaming journey from left to right, including the bridge in the middle (as such a project is not a big bang but usually takes years, at a minimum):

Source: 50Hertz

Technical Evolution during the Data Streaming Journey

The different data streaming journeys showed you a few things: It is not a big bang. Old and new technology and processes need to work together. Only technology AND expertise drive successful projects.

Therefore, I want to share a few more resources to think about this from a technical perspective:

The data streaming journey never ends. Apache Kafka and Apache Flink increasingly join forces to build innovative real-time stream processing applications. We are just at the beginning.

What does your data streaming journey look like? Where are you right now, and what are your plans for the next few years? Let’s connect on LinkedIn and discuss it! Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

The post Data Streaming is not a Race, it is a Journey! appeared first on Kai Waehner.

The Heart of the Data Mesh Beats Real-Time with Apache Kafka

Kai Waehner — Thu, 28 Jul 2022 06:08:38 +0000

If there were a buzzword of the hour, it would undoubtedly be “data mesh“! This new architectural paradigm unlocks analytic and transactional data at scale and enables rapid access to an ever-growing number of distributed domain datasets for various usage scenarios. The data mesh addresses the most common weaknesses of the traditional centralized data lake or data platform architecture. And the heart of a decentralized data mesh infrastructure must be real-time, reliable, and scalable. Learn how the de facto standard for data streaming, Apache Kafka, plays a crucial role in building a data mesh.

There is no single technology or product for a data mesh!

This post explores how Apache Kafka, as an open and scalable decentralized real-time platform, can be the basis of a data mesh infrastructure and – complemented by many other data platforms like a data warehouse, data lake, and lakehouse – solve real business problems.

There is no silver bullet or single technology/product/cloud service for implementing a data mesh. The key outcome of a data mesh architecture is the ability to build data products; with the right tool for the job. A good data mesh combines data streaming technology like Apache Kafka or Confluent Cloud with cloud-native data warehouse and data lake architectures from Snowflake, Databricks, Google BigQuery, et al.

What is a data mesh?

I won’t write yet another article describing the concepts of a data mesh. Zhamak Dehghani coined the term in 2019. The following data mesh architecture from 30,000-foot view explains the basic idea well:

I summarize data mesh as the following three bullet points:

An architecture paradigm with several historical influences (domain-driven design, microservices, data marts, data streaming)
Not specific to a single technology or product; no single vendor can implement a data mesh alone
Handling data as a product is a fundamental change, enabling a more flexible architecture and independent solving of separate business problems
Decentralized services, not just analytics but also transactional workloads

Why handle data as a product?

Talking about innovative technology is insufficient to introduce a new architectural paradigm. Consequently, measuring the business value of the enterprise architecture is critical, too.

McKinsey finds that “when companies instead manage data like a consumer product—be it digital or physical—they can realize near-term value from their data investments and pave the way for quickly getting more value tomorrow. Creating reusable data products and patterns for piecing together data technologies enables companies to derive value from data today and tomorrow”:

For McKinsey, the benefits of this approach can be significant:

New business use cases can be delivered as much as 90 percent faster.
The total cost of ownership, including technology, development, and maintenance, can decline by 30 percent.
The risk and data-governance burden can be reduced.

What is data streaming with Apache Kafka and its relation to data mesh?

A data mesh enables flexibility through decentralization and best-of-breed data products. The heart of data sharing requires reliable real-time data at any scale between data producers and data consumers. Additionally, true decoupling between the decentralized data products is key to the success of the data mesh paradigm. Each domain must have access to shared data but also the ability to choose the right tool (i.e., technology, API, product, or SaaS) to solve its business problems.

That’s where data streaming fits into the data mesh story:

The de facto standard for data streaming is Apache Kafka. A cloud-native data streaming infrastructure that can link clusters with each other out-of-the-box enables building a modern data mesh. No Data Mesh will use just one technology or vendor. Learn from inspiring posts from your favorite data products vendors like AWS, Snowflake, Databricks, Confluent, and many more to successfully define and build your custom Data Mesh. Data Mesh is a journey, not a big bang. A data warehouse or data lake (or in modern days, a lakehouse) cannot be the only infrastructure for data mesh and data products.

I covered how to leverage the capabilities of Apache Kafka and its ecosystem like Kafka Connect, ksqlDB, Cluster Linking, etc. to build the heart of a data mesh in a separate blog post: Streaming Data Exchange with Kafka and a Data Mesh in Motion.

Example: Real-time data fabric in hybrid cloud

Here is one example spanning a streaming Data Mesh across multiple cloud providers like AWS, Azure, GCP, or Alibaba, and on-premise / edge sites:

This example shows all the characteristics discussed in the above sections for a Data Mesh:

Decentralized real-time infrastructure across domains and infrastructures
True decoupling between domains within and between the clouds
Several communication paradigms, including data streaming, RPC, and batch
Data integration with legacy and cloud-native technologies
Continuous stream processing where it adds value, and batch processing in some analytics sinks

Presentation: Building a decentralized data mesh with data streaming at its heart

The following slide deck walks you through the motivation, principles, and architectures of building a real-time data mesh powered by Apache Kafka using the Kappa architecture, hybrid cloud, and stream data sharing:

The data mesh provides flexibility and freedom of technology choice for each data product

The heart of a decentralized data mesh infrastructure must be real-time, reliable, and scalable. As the de facto standard for data streaming, Apache Kafka plays a crucial role in a cloud-native data mesh architecture. Nevertheless, data mesh is not bound to a specific technology. The beauty of the decentralized architecture is the freedom of technology choice for each business unit when building its data products.

Data sharing between domains within and across organizations is another aspect where data streaming helps in a data mesh. Real-time data beats slow data. That is not just true for most business problems across industries but also for replicating data between data centers, clouds, regions, or organizations. A streaming data exchange enables data sharing in real-time to build a data mash in motion.

Did you already start building your Data Mesh? What does the enterprise architecture look like? What frameworks, products, and cloud services do you use? Is the heart of your data mesh real-time in motion or some lakehouse at rest? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post The Heart of the Data Mesh Beats Real-Time with Apache Kafka appeared first on Kai Waehner.

Data Warehouse and Data Lake Modernization: From Legacy On-Premise to Cloud-Native Infrastructure

Kai Waehner — Fri, 15 Jul 2022 06:03:28 +0000

The concepts and architectures of a data warehouse, a data lake, and data streaming are complementary to solving business problems. Storing data at rest for reporting and analytics requires different capabilities and SLAs than continuously processing data in motion for real-time workloads. Many open-source frameworks, commercial products, and SaaS cloud services exist. Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let’s explore this dilemma in a blog series. Learn how to build a modern data stack with cloud-native technologies. This is part 3: Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure.

Blog Series: Data Warehouse vs. Data Lake vs. Data Streaming

This blog series explores concepts, features, and trade-offs of a modern data stack using data warehouse, data lake, and data streaming together:

Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
THIS POST: Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
Lessons Learned from Building a Cloud-Native Data Warehouse

Stay tuned for a dedicated blog post for each topic as part of this blog series. I will link the blogs here as soon as they are available (in the next few weeks). Subscribe to my newsletter to get an email after each publication (no spam or ads).

Data warehouse modernization: From legacy on-premise to cloud-native infrastructure

Many people talk about data warehouse modernization when they move to a cloud-native data warehouse. Though, what does data warehouse modernization mean? Why do people move away from their on-premise data warehouse? What are the benefits?

Many projects I have seen in the wild went through the following steps:

Select a cloud-native data warehouse
Get data into the new data warehouse
[Optional] Migrate from the old to the new data warehouse

Let’s explore these steps in more detail and understand the technology and architecture options.

1. Selection of a cloud-native data warehouse

Many years ago, cloud computing was a game-changer for operating infrastructure. AWS innovated by providing not just EC2 virtual machines but also storage, like AWS S3 as a service.

Cloud-native data warehouse offerings are built on the same fundamental change. Cloud providers brought their analytics cloud services, such as AWS Redshift, Azure Synapse, or GCP BigQuery. Independent vendors rolled out a cloud-native data warehouse or data lake SaaS such as Snowflake, Databricks, and many more. While each solution has its trade-offs, a few general characteristics are true for most of them:

Cloud-native: A modern data warehouse is elastic, scales for small up to extreme workloads, and automates most business processes around development, operations, and monitoring.
Fully managed: The vendor takes over the operations burden. This includes scaling, failover handling, upgrades, and performance tuning. Some offerings are truly serverless, while many services require capacity planning and manual or automated scaling up and down.
Consumption-based pricing: Pay-as-you-go enables getting started in minutes and scaling costs with broader software usage. Most enterprise deployments allow commitment to getting price discounts.
Data sharing: Replicating data sets across regions and environments is a common feature to offer data locality, privacy, lower latency, and regulatory compliance.
Multi-cloud and hybrid deployments: While cloud providers usually only offer the 1st party service on their cloud infrastructure, 3rd party vendors provide a multi-cloud strategy. Some vendors even offer hybrid environments, including on-premise and edge deployments.

Plenty of comparisons exist in the community, plus analyst research from Gartner, Forrester, et al. Looking at vendor information and trying out the various cloud products using free credits is crucial, too. Finding the right cloud-native data warehouse is its own challenge and not in this blog post.

2. Data streaming as (near) real-time and hybrid integration layer

Data ingestion into data warehouses and data lakes was already covered in part two of this blog series. The more real-time, the better for most business applications. Near real-time ingestion is possible with specific tools (like AWS Kinesis or Kafka) or as part of the data fabric (the streaming data hub where a tool like Kafka plays a bigger role than just data ingestion).

The often more challenging part is data integration. Most data warehouse and data lake pipelines require ETL to ingest data. As the next-generation analytics platform is crucial for making the right business decisions, the data ingestion and integration platform must also be cloud-native! Tools like Kafka provide the reliable and scalable integration layer to get all required data into the data warehouse.

Integration of legacy on-premise data into the cloud-native data warehouse

In a greenfield project, the project team is lucky. Data sources run in the same cloud, using open and modern APIs, and scale as well as the cloud-native data warehouse.

Unfortunately, the reality is brownfield almost always, even if all applications run in public cloud infrastructure. Therefore, the integration and replication of data from legacy and on-premise applications is a general requirement.

Data is typically consumed from legacy databases, data warehouses, applications, and other proprietary platforms. The replication into the cloud data warehouse usually needs to be near real-time and reliable.

A data streaming platform like Kafka is perfect for replicating data across data centers, regions, and environments because of its elastic scalability and true decoupling capabilities. Kafka enables connectivity to modern AND legacy systems via connectors, proprietary APIs, programming languages, or open REST interfaces:

A common scenario in such a brownfield project is the clear separation of concerns and true decoupling between legacy on-premise and modern cloud workloads. Here, Kafka is deployed on-premise to connect to legacy applications.

Tools like MirrorMaker, Replicator, or Confluent Cluster Linking replicate events in real-time into the Kafka cluster in the cloud. The Kafka brokers provide access to the incoming events. Downstream consumers read the data into the data sinks at their own pace; real-time, near real-time, batch, or request-response via API. Streaming ETL is possible at any site – where it makes the most sense from a business or security perspective and is the most cost-efficient.

Example: Confluent Cloud + Snowflake = Cloud-native Data Warehouse Modernization

Here is a concrete example of data warehouse modernization using cloud-native data streaming and data warehousing with Confluent Cloud and Snowflake:

For modernizing the data warehouse, data is ingested from various legacy and modern data sources using different communication paradigms, APIs, and integration strategies. The data is transmitted in motion from data sources via Kafka (and optional preprocessing) into the Snowflake data warehouse. The whole end-to-end pipeline is scalable, reliable, and fully managed, including the connectivity and ingestion between the Kafka and Snowflake clusters.

However, there is more to the integration and ingestion layer: The data streaming platform stores the data for true decoupling and slow downstream applications; not every consumer is or can be real-time. Most enterprise architectures do not ingest data into a single data warehouse or data lake or lakehouse. The reality is that different downstream applications need access to the same information; even though vendors of data warehouses and data lakes tell you differently, of course

By consuming events from the streaming data hub, each application domain decides by itself if it

processes events within Kafka with stream processing tools like Kafka Streams or ksqlDB
builds own downstream applications with its code and technologies (like Java, .NET, Golang, Python)
integrates with 3rd party business applications like Salesforce or SAP
ingests the raw or preprocessed and curated data from Kafka into the sink system (like a data warehouse or data lake)

3. Data warehouse modernization and migration from legacy to cloud-native

An often misunderstood concept is the buzz around data warehouse modernization: Companies rarely take the data of the existing on-premise data warehouse or data lake, write a few ETL jobs, and put the data into a cloud-native data warehouse for the sake of doing it.

If you think about a one-time lift-and-shift from an on-premise data warehouse to the cloud, then a traditional ETL tool or a replication cloud service might be the easiest. However, usually, data warehouse modernization is more than that!

What is data warehouse modernization?

A data warehouse modernization can mean many things, including replacing and migrating the existing data warehouse, building a new cloud-native data warehouse from scratch, or optimizing a legacy ETL pipeline of a cloud-native data warehouse.

In all these cases, data warehouse modernization requires business justification, for instance:

Application issues in the legacy data warehouse, such as too slow data processing with legacy batch workloads, result in wrong or conflicting information for the business people.
Scalability issues in the on-premise data warehouse as the data volume grows too much.
Cost issues because the legacy data warehouse does not offer reasonable pricing with pay-as-you-go models.
Connectivity issues as legacy data warehouses were not built with an open API and data sharing in mind. Cloud-native data warehouses run on cost-efficient and scalable object storage, separate storage from computing, and allow data consumption and sharing. (but keep in mind that Reverse ETL is often an anti-pattern!)
A strategic move to the cloud with all infrastructure. The analytics platform is no exception if all old and new applications go to the cloud.

Cloud-native applications usually come with innovation, i.e., new business processes, data formats, and data-driven decision-making. From a data warehouse perspective, the best modernization is to start from scratch. Consume data directly from the existing data sources, ETL it, and do business intelligence on top of the new data structures.

I have seen many more projects where customers use change data capture (CDC) from Oracle databases (i.e., the leading core system) instead of trying to replicate data from the legacy data warehouse (i.e., the analytics platform) as scalability, cost, and later shutdown of legacy infrastructure benefits from this approach.

Data warehouse migration: Continuous vs. cut-over

The project is usually a cut-over when you need to do a real modernization (i.e., migration) from a legacy data warehouse to a cloud-native one. This way, the first project phase integrates the legacy data sources with the new data warehouse. The old and new data warehouse platforms operate in parallel, so that old and new business processes go on. After some time (months or years later), when the business is ready to move, the old data warehouse will be shut down after legacy applications are either migrated to the new data warehouse or replaced with new applications:

My article “Mainframe Integration, Offloading and Replacement with Apache Kafka” illustrates this offloading and long-term migration process. Just scroll to the section “Mainframe Offloading and Replacement in the Next 5 Years” in that post and replace the term ‘mainframe’ with ‘legacy data warehouse’ in your mind.

A migration and cut-over is its project and can include the legacy data warehouse; or not. Data lake modernization (e.g., from a self- or partially managed Cloudera cluster running on-premise in the data center to a fully managed Databricks or Snowflake cloud infrastructure) follows the same principles. And mixing the data warehouse (reporting) and data lake (big data analytics) into a single infrastructure does not change this either.

Data warehouse modernization is NOT a big bang and NOT a single tool approach!

Most data warehouse modernization projects are ongoing efforts over a long period. You must select a cloud-native data warehouse, get data into the new data warehouse from various sources, and optionally migrate away from legacy on-premise infrastructure.

Data streaming for data ingestion, business applications, or data sharing in real-time should always be a separate component in the enterprise architecture. It has very different requirements regarding SLAs, uptime, through, latency, etc. Putting all real-time and analytical workloads into the same cluster makes little sense from a cost, risk, or business value perspective. The idea of a modern data flow and building a data mesh is the separation of concerns with domain-driven design and focusing on data products (using different, independent APIs, technologies, and cloud services).

For more details, browse other posts of this blog series:

Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
THIS POST: Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
Lessons Learned from Building a Cloud-Native Data Warehouse

What cloud-native data warehouse(s) do you use? How does data streaming fit into your journey? Did you integrate or replace your legacy on-premise data warehouse(s); or start from greenfield in the cloud? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Data Warehouse and Data Lake Modernization: From Legacy On-Premise to Cloud-Native Infrastructure appeared first on Kai Waehner.

When to use Apache Camel vs. Apache Kafka?

Kai Waehner — Fri, 28 Jan 2022 06:31:02 +0000

Should I use Apache Camel or Apache Kafka for my next integration project? The question is very valid and comes up regularly. This blog post explores both open-source frameworks and explains the difference between application integration and event streaming. The comparison discusses when to use Kafka or Camel, when to combine them, when not to use them at all. A decision tree shows how you can quickly qualify out one for the other.

The history of application integration and event streaming

My personal history and experience in application integration and event streaming are the following. It shows my background and how I see the integration and data streaming markets.

A discussion that started over a decade ago…

With my background of work in the last decade at Talend, TIBCO, and Confluent, the comparison between Camel and Kafka is very exciting as I have spent a lot of time with both open-source frameworks:

Apache Camel powered Talend ESB. Talend had a visual coding tool to design Camel routes with code generation. Unfortunately, the tool’s primary focus was Talend Data Integration (ETL and batch). The Camel-powered ESB code was integrated, but it was neither perfect nor complete.

TIBCO BusinessWorks competed with Talend ESB while TIBCO StreamBase competed with other stream processing solutions. The Kafka ecosystem came up more and more in conversations with customers.

I posted about “When to use Apache Camel” in 2011 already. In 2012, I did my first talk at an international software conference in the US. The name of the conference? CamelOne! A forum only about Apache Camel. What an exciting time. Claus Ibsen, THE Camel guy, wrote an excellent summary of CamelOne 2012 in Boston.

In my conference summary, I talked about my two talks. One of them covered a comparison between Apache Camel, Spring Integration, and Mulesoft ESB. The presentation has over 35000 views, and the number still goes up today.

… from application integration to event streaming

Over time, the buzzword “big data” came up more and more. I spent some time at Talend and TIBCO to learn new programming concepts such as Map-Reduce and Shuffling, mainly powered by Apache Hadoop and Apache Spark. The big data ecosystem snowballed with tens of frameworks such as Hive, HBase, Pig, and many more.

However, the first people realized that real-time data beats slow data in almost all use cases. The Lambda architecture was invented to separate real-time workloads from batch workloads. Event Streaming was born. Apache Kafka became the de facto standard for data streaming. Like CamelOne a decade ago, Kafka Summit is the one-stop-show for Kafka use cases, architectures, and success stories. Contrary to the small CamelOne, Kafka Summit is a global event with events across the globe, plus online events.

Data in motion with the Kappa architecture replacing Lambda

In 2014, a guy called Jay Kreps (few people knew him) was already questioning the Lambda architecture. Instead, he proposed to provide a single real-time layer to provide data for real-time and batch consumers. The Kappa architecture was born. Today, the Kappa architecture is mainstream, replacing Lambda. Various vendors adopt Kafka in the meantime.

Confluent became the clear leader in the event streaming software category. Confluent Platform is powered by Apache Kafka. The focus is on event streaming. That’s different from most other vendors like Cloudera; they focus on 10-20 frameworks or products and try to combine and integrate them somehow. Today, Confluent Cloud is a complete game-changer providing Apache Kafka and its ecosystem for application integration and stream processing as a serverless cloud offering.

This is where we are today in 2022. Application integration (= Camel) and event streaming (= Kafka) play a critical role in every modern enterprise architecture. Open-source is widely adopted and usually preferred compared to proprietary solutions for various reasons, including avoiding vendor lock-in. That’s true for self-managed and serverless cloud offerings.

Hence, the question arises: Should I use Apache Camel for application integration or Apache Kafka for event streaming? Or both? Or does one solve the other, too? These questions will be answered in the following sections, concluding with a decision tree to help you make the right choice for your project.

Let’s look at the similarities between Camel and Kafka, when to use which framework, when and how to combine them, and when not to use them at all.

Features in Apache Camel AND Apache Kafka

Camel and Kafka have many positive and negative characteristics in common. Hence, it is no surprise that people compare the two frameworks:

Open source under Apache 2.0 license
Vibrant community and adoption in the industry
Mature framework with deployments in enterprises across the globe
Fixing point-to-point spaghetti architectures with a central integration backbone
Open architecture and extensibility with custom functions and connectors
Small and big deployments possible, plus single-node deployments for non-mission-critical use cases
Re-engineered and optimized for cloud-native deployments (container, Kubernetes, cloud)
Connectivity to any technology, API, communication paradigm, and SaaS
Transformation of any data types and formats
Processes transactional and analytical workloads
Domain-specific language (DSL) for message at a time processing, with similar logic such as aggregation, filtering, conditional processing
Relative complex frameworks because of their robust feature set, hence not suitable for solving a minor problem
Not a replacement of a database, data warehouse, or data lake

Beyond the similarities, Kafka and Camel have very different sweet spots built to solve distinct problems. Hence, comparing these two tools is a bit comparison of apples and oranges. Some minor projects might use one or the other to solve the problem, but critical enterprise projects show the differences more quickly.

When to use Apache Camel?

The mission of Camel

Apache Camel is an integration framework. It solves a particular problem: Data integration between different applications, APIs, protocols, and communication paradigms. This concept is often called application integration or enterprise integration. Camel implements the famous Enterprise Integration Patterns (EIP). EIPs are based on messaging principles.

Camel’s strengths

Event-based backbone based on well-known and adopted EIP concepts
Connectivity to almost any API
Integration, processing, and routing of information with an intuitive domain-specific language (DSL) with a focus on integration; providing the ability of composability in a programming context for finer grain control in code for doing conditional logic or transformations/reformatting
Powerful routing capabilities with many built-in EIPs
Many deployment options (standalone, web container, application server, Spring, OSGi, Kubernetes via the Camel K sub-project) – okay, I guess some options are not relevant in this decade anymore
Lightweight alternative to proprietary ETL and ESB tools

Camel’s weaknesses

Only a “routing machine”, i.e., not built for long-term storage (additional cache or storage needed), for that reason, Camel is not the right choice for a central nervous system like Kafka
No stream processing (like you know it from Kafka Streams or Apache Flink)
Limited scalability, not built for massive volumes of data
No powerful visual coding like you know it from proprietary ETL/ESB/iPaaS tools
No serverless cloud offering, with that also not competing with other iPaaS offerings
Red Hat is the only vendor supporting it
Built to be deployed in a single data center or cloud region, not across hybrid or multi-cloud scenarios

The evolution of Apache Camel

Camel is widely adopted and has a strong community. Unfortunately, from a vendor and support perspective, the offerings declined in the last few years. One of the most significant pain points: I still don’t see a serverless cloud offering anywhere today:

Camel TL;DR

Camel is an application integration framework to connect different applications and interfaces. Camel is NOT built for processing data in motion continuously, i.e., stream processing. Hence, it should be compared to ETL and ESB tools, not data streaming technologies like Kafka, Kinesis, or Flink. If you look for a serverless cloud offering, you are out of luck. If you look for vendor support, Red Hat is the only option.

When to use Apache Kafka?

The mission of Kafka

Real-time data beats slow data at any scale. The event streaming platform enables processing data in motion. Kafka is the de facto standard for event streaming, including messaging, data integration, stream processing, and storage. Kafka provides all capabilities in one infrastructure at scale. It is reliable and allows to process analytics and transactional workloads.

Kafka’s strengths

Event-based streaming platform
A unique combination of pub/sub messaging, data processing, data integration, and storage in a single framework
Built for massive volumes of data and extreme scale from the beginning, with that a single framework can be used for transactional (low volume) and analytics (high volume) use cases
True decoupling between producers and consumers because of its storage component makes it the de facto standard for microservice architectures
Guaranteed ordering of events in the distributed commit log
Distributed data processing with fault-tolerance and recoverability built-in
Replayability of events
The de facto standard for event streaming
Built with hybrid and multi-cloud data replication in mind (with included tools like MirrorMaker and separate, more advanced, and more straightforward tools like Confluent Cluster Linking)
Support from many vendors, including Confluent, Cloudera, IBM, Red Hat, Amazon, Microsoft, and many more
Paradigm shift: Built to process data in motion end-to-end from source to one or more sinks

Kafka’s weaknesses

Paradigm shift: Enterprises need to learn and understand the added value of event streaming, a new software category that enables new use cases but also requires different design patterns and operations approaches
No powerful visual coding like you know it from proprietary ETL/ESB/iPaaS tools
Limited out-of-the-box routing capabilities (Kafka Connect SMT or Kafka Streams / ksqlDB app do the job very well, but not as simple as Camel)
Complex operations (if you run it by yourself instead of using 3rd party tools or even better a serverless cloud offering)

The evolution of Apache Kafka

Kafka was built at LinkedIn to process high volumes of data, as no other open-source framework could do this. Kafka found quick adoption after LinkedIn open-sourced it. Several vendors adopted Kafka and added it to their product portfolio. Some vendors just added Kafka for the sake of having it. Others innovated and used additional tools to make Kafka cloud-native for the next generation of event streaming. Kafka as a serverless cloud offering is a critical piece of many modern enterprise architectures today:

Kafka TL;DR

Kafka is an event streaming platform to process data in motion continuously. If you “just” need an integration framework to route data from a source to one or more sinks (= ETL / ESB), then Camel can be used, too. However, Kafka kills two birds with one stone (= integrating data AND processing it in motion where needed).

Plenty of Kafka offerings are available on the market. Check out the Apache Kafka landscape and comparison to understand the differences between offerings from Confluent, Cloudera, IBM, Red Hat, Amazon, Microsoft, and others.

Decision tree – Camel or Kafka?

The above sections explored when to use Camel and Kafka. So far, so good. Nevertheless, both frameworks overlap with their capabilities. Let’s get some help to choose the right one in that case.

Qualify out – the easiest way to start an evaluation!

The easiest way to decide on a specific option is to qualify out other frameworks that cannot fulfill the requirements.

Therefore, do you need

Big data processing?
A storage component for true decoupling and replayability of events?
Stateless or stateful stream processing?
A serverless cloud offering?

The above section discussed these differentiators of Kafka. In all these cases, you can qualify out Camel. It does not fulfill these requirements. These requirements are not necessarily a complete list. And you might also find a few aspects to qualify out Kafka from the beginning. Hence, you could also start from the Camel perspective and ask yourself: When should I not use Kafka. But I think it is easier the other way round.

Qualifying out solutions because of their limitations makes the decision tree and evaluation process much easier from the beginning.

Decision Tree for Camel and Kafka

Here is my decision tree to find out if Camel or Kafka is the right choice and what vendors you could evaluate:

When to use Camel and Kafka together?

It is possible to use Camel and Kafka together in a single integration architecture. Should you do that? Two options exist. One makes more sense than the other:

Kafka for event streaming and Camel for ETL

Camel and Kafka integrate well with each other. The native Kafka component of Camel is the best native integration point as a bridge between both environments:

The above architecture shows how Camel and Kafka live next to each other. Camel is used in a business domain for application integration. Kafka is the central nervous system between the Camel integration application and many other applications. I also added Kong as API Gateway to clarify that Camel or Kafka is not a silver bullet to solve every problem.

Once again, the vast advantage of Kafka as central integration layer is its unique combination of characteristics within a single infrastructure, including:

Real-time messaging at any scale
Storage for true decoupling between different applications and communication paradigms
Built-in backpressure handling and replayability of events
Data integration
Stream processing

Real-time data replication across hybrid and multi-cloud is not shown in the above picture but is also part of the enterprise architecture out-of-the-box leveraging take Kafka protocol.

With true decoupling within modern microservice architecture, each business team can decide whether they need application integration (using Camel) or event streaming (using Kafka). Often, both could be used. Additional questions around single vs. multi frameworks and APIs, vendor support, scalability needs, and other characteristics need to be evaluated to make the right choice for your business problem.

Camel connectors embedded into Kafka Connect

There is another way to combine Kafka and Camel: The “Camel Kafka Connector” sub-project of Apache Camel. Don’t get confused. This feature is not the Kafka component (= connector) of Camel! Instead, it is a relatively new initiative to deploy camel components into the Kafka Connect infrastructure.

The obvious benefit: This way, you get hundreds of new connectors “for free” within the Kafka ecosystem. This capability sounds excellent. And it is!

However, consider the total cost of ownership and the overall efforts using this approach. Application integration is one of the most challenging problems in computer science – especially if you talk about transactional data sets that require zero data loss, exactly-once semantics, and no downtime. The more components you combine in the end-to-end data flow, the harder it gets to keep your performance and reliability SLAs.

Hence, using Camel components within Kafka Connect has a considerable disadvantage: Combining two frameworks with complexities and different design concepts. Just a few examples:

Kafka world: Partitions, Offsets, Leader and Follower, Key/Value/Header, connectors (based on Kafka Connect), Bootstrap Server, ConsumerRecord, Retention Time, etc.
Camel world: Routes, RouteBuilder CamelContext, Exchange, Processor, components (Camel connectors), Endpoints, Type Converters, Registry, etc.

Please think twice before mixing two integration tools that are powerful but complex on their own. Getting this running is just one piece of the puzzle (the simple part). Don’t forget end-to-end testing, resiliency, SLAs, support across technologies and APIs. Even buying support for Camel and Kafka from Red Hat (i.e., a single vendor) does not improve this approach.

It is likely better to take the business logic and API calls out of the Camel component and copy it into a Kafka Connect connector template to run the integration natively with only Kafka code. This workaround allows a clean architecture, end-to-end integration with a single framework, a single vendor behind it, and much easier testing / debugging / monitoring.

TL;DR: I recommend only using the “Camel Kafka Connector” sub-project if the following options do not work:

Use only Apache Camel for application integration
Leverage Apache Kafka for event streaming and application integration
Choose separate deployments of Camel and Kafka and use the Camel-Kafka-Bridge

When NOT to use Camel or Kafka at all?

Once again, the easiest way for your evaluation to start is qualifying out tools that do not work to solve the problem.

Both Camel and Kafka are NOT built for the following scenarios:

A proxy for millions of clients (like mobile apps) – but native proxies (like a REST or MQTT Proxy for Kafka) exist for some use cases.
An API Management platform – but these tools are usually complementary and used to create life cycle management or monetize APIs deployed with Camel or Kafka.
A database for complex queries and batch analytics workloads
an IoT platform with features such as device management – but direct native integration with (some) IoT protocols such as MQTT or OPC-UA is possible and the approach for (some) use cases.
A technology for hard real-time applications such as safety-critical or deterministic systems – but that’s true for any other IT framework, too. Embedded systems are a different software than Camel or Kafka!

I wrote a very detailed post about this topic from a Kafka perspective. It maps almost 1:1 to the Camel world, too (and any related technology such as Flink, Spark, Pulsar, etc.): “When NOT to use Apache Kafka?”

Apache Camel vs. Apache Kafka – Who is the winner?

Simple answer: Both!

When you compare apples and oranges, you might become happy when you are hungry as both are good to eat. The same is true for Camel and Kafka. Both can do application integration. But they serve very different needs.

Many integration scenarios can use Camel or Kafka.

Camel is the right tool if you need to integrate data within an application context or business unit (with no need for stream processing, true decoupling, replayability, large scale, replication across data centers or cloud regions).

Kafka is the central event-based nervous system across business units, regions, and hybrid clouds. Kafka is all about event streaming. Application integration is just a piece of this puzzle. On the other side, I have seen plenty of integration projects powered by Apache Kafka. It is often replacing other middleware. That’s true for ETL/ESB legacy modernization and in discussions about using a cloud-native iPaaS.

Do you use Camel or Kafka today? What use cases? How do you decide which one to choose? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post When to use Apache Camel vs. Apache Kafka? appeared first on Kai Waehner.

When NOT to use Apache Kafka?

Kai Waehner — Tue, 04 Jan 2022 07:24:59 +0000

Apache Kafka is the de facto standard for event streaming to process data in motion. With its significant adoption growth across all industries, I get a very valid question every week: When NOT to use Apache Kafka? What limitations does the event streaming platform have? When does Kafka simply not provide the needed capabilities? How to qualify Kafka out as it is not the right tool for the job? This blog post explores the DOs and DONTs. Separate sections explain when to use Kafka, when NOT to use Kafka, and when to MAYBE use Kafka.

Market Trends – A Connected World

Let’s begin with understanding why Kafka comes up everywhere in the meantime. This clarifies the huge market demand for event streaming but also shows that there is no silver bullet solving all problems. Kafka is NOT the silver bullet for a connected world, but a crucial component!

The world gets more and more connected. Vast volumes of data are generated and need to be correlated in real-time to increase revenue, reduce costs, and reduce risks. I could pick almost any industry. Some are faster. Others are slower. But the connected world is coming everywhere. Think about manufacturing, smart cities, gaming, retail, banking, insurance, and so on. If you look at my past blogs, you can find relevant Kafka use cases for any industry.

I picked two market trends that show this insane growth of data and the creation of innovation and new cutting-edge use cases (and why Kafka’s adoption is insane across industries, too).

Connected Cars – Insane volume of telemetry data and aftersales

Here is the “Global Opportunity Analysis and Industry Forecast, 2020–2027” by Allied Market Research:

The Connected Car market includes a much wider variety of use cases and industries than most people think. A few examples: Network infrastructure and connectivity, safety, entertainment, retail, aftermarket, vehicle insurance, 3rd party data usage (e.g., smart city), and so much more.

Gaming – Billions of players and massive revenues

The gaming industry is already bigger than all other media categories combined, and this is still just the beginning of a new era – as Bitkraft depicts:

Millions of new players join the gaming community every month across the globe. Connectivity and cheap smartphones are sold in less wealthy countries. New business models like “play to earn” change how the next generation of gamers plays a game. More scalable and low latency technologies like 5G enable new use cases. Blockchain and NFT (Non-Fungible Token) are changing the monetization and collection market forever.

These market trends across industries clarify why the need for real-time data processing increases significantly quarter by quarter. Apache Kafka established itself as the de facto standard for processing analytical and transactional data streams at scale. However, it is crucial to understand when (not) to use Apache Kafka and its ecosystem in your projects.

What is Apache Kafka, and what is it NOT?

Kafka is often misunderstood. For instance, I still hear way too often that Kafka is a message queue. Part of the reason is that some vendors only pitch it for a specific problem (such as data ingestion into a data lake or data warehouse) to sell their products. So, in short:

Kafka is…

a scalable real-time messaging platform to process millions of messages per second.
an event streaming platform for massive volumes of big data analytics and small volumes of transactional data processing.
a distributed storage provides true decoupling for backpressure handling, support of various communication protocols, and replayability of events with guaranteed ordering.
a data integration framework for streaming ETL.
a data processing framework for continuous stateless or stateful stream processing.

This combination of characteristics in a single platform makes Kafka unique (and successful).

Kafka is NOT…

a proxy for millions of clients (like mobile apps) – but Kafka-native proxies (like REST or MQTT) exist for some use cases.
an API Management platform – but these tools are usually complementary and used for the creation, life cycle management, or the monetization of Kafka APIs.
a database for complex queries and batch analytics workloads – but good enough for transactional queries and relatively simple aggregations (especially with ksqlDB).
an IoT platform with features such as device management – but direct Kafka-native integration with (some) IoT protocols such as MQTT or OPC-UA is possible and the appropriate approach for (some) use cases.
a technology for hard real-time applications such as safety-critical or deterministic systems – but that’s true for any other IT framework, too. Embedded systems are a different software!

For these reasons, Kafka is complementary, not competitive, to these other technologies. Choose the right tool for the job and combine them!

Case studies for Apache Kafka in a connected world

This section shows a few examples of fantastic success stories where Kafka is combined with other technologies because it makes sense and solves the business problem. The focus here is case studies that need more than just Kafka for the end-to-end data flow.

No matter if you follow my blog, Kafka Summit conferences, online platforms like Medium or Dzone, or any other tech-related news. You find plenty of success stories around real-time data streaming with Apache Kafka for high volumes of analytics and transactional data from connected cars, IoT edge devices, or gaming apps on smartphones.

A few examples across industries and use cases:

Audi: Connected car platform rolled out across regions and cloud providers
BMW: Smart factories for the optimization of the supply chain and logistics
SolarPower: Complete solar energy solutions and services across the globe
Royal Caribbean: Entertainment on cruise ships with disconnected edge services and hybrid cloud aggregation
Disney+ Hotstar: Interactive media content and gaming/betting for millions of fans on their smartphone
The list goes on and on and on.

So what is the problem with all these great IoT success stories? Well, there is no problem. But some clarification is needed to explain when to use event streaming with the Apache Kafka ecosystem and where other complementary solutions usually complement it.

When to use Apache Kafka?

Before we discuss when NOT to use Kafka, let’s understand where to use it to get more clear how and when to complement it with other technologies if needed.

I will add real-world examples to each section. In my experience, this makes it much easier to understand the added value.

Kafka consumes and processes high volumes of IoT and mobile data in real-time

Processing massive volumes of data in real-time is one of the critical capabilities of Kafka.

Tesla is not just a car maker. Tesla is a tech company writing a lot of innovative and cutting-edge software. They provide an energy infrastructure for cars with their Tesla Superchargers, solar energy production at their Gigafactories, and much more. Processing and analyzing the data from their vehicles, smart grids, and factories and integrating with the rest of the IT backend services in real-time is a crucial piece of their success.

Tesla has built a Kafka-based data platform infrastructure “to support millions of devices and trillions of data points per day”. Tesla showed an exciting history and evolution of their Kafka usage at a Kafka Summit in 2019:

Keep in mind that Kafka is much more than just messaging. I repeat this in almost every blog post as too many people still don’t get it. Kafka is a distributed storage layer that truly decouples producers and consumers. Additionally, Kafka-native processing tools like Kafka Streams and ksqlDB enable real-time processing.

Kafka correlates IoT data with transactional data from the MES and ERP systems

Data integration in real-time at scale is relevant for analytics and the usage of transactional systems like an ERP or MES system. Kafka Connect and non-Kafka middleware complement the core of event streaming for this task.

BMW operates mission-critical Kafka workloads across the edge (i.e., in the smart factories) and public cloud. Kafka enables decoupling, transparency, and innovation. The products and expertise from Confluent add stability. The latter is vital for success in manufacturing. Each minute of downtime costs a fortune. Read my related article “Apache Kafka as Data Historian – an IIoT / Industry 4.0 Real-Time Data Lake” to understand how Kafka improves the Overall Equipment Effectiveness (OEE) in manufacturing.

BMW optimizes its supply chain management in real-time. The solution provides information about the right stock in place, both physically and in transactional systems like BMW’s ERP powered by SAP. “Just in time, just in sequence” is crucial for many critical applications. The integration between Kafka and SAP is required for almost 50% of customers I talk to in this space. Beyond the integration, many next-generation transactional ERP and MES platforms are powered by Kafka, too.

Kafka integrates with all the non-IoT IT in the enterprise at the edge and hybrid or multi-cloud

Multi-cluster and cross-data center deployments of Apache Kafka have become the norm rather than an exception. Learn about several scenarios that may require multi-cluster solutions and see real-world examples with their specific requirements and trade-offs, including disaster recovery, aggregation for analytics, cloud migration, mission-critical stretched deployments, and global Kafka.

The true decoupling between different interfaces is a unique advantage of Kafka vs. other messaging platforms such as IBM MQ, RabbitMQ, or MQTT brokers. I also explored this in detail in my article about Domain-driven Design (DDD) with Kafka.

Infrastructure modernization and hybrid cloud architectures with Apache Kafka are typical across industries.

One of my favorite examples is the success story from Unity. The company provides a real-time 3D development platform focusing on gaming and getting into other industries like manufacturing with their Augmented Reality (AR) / Virtual Reality (VR) features.

The data-driven company already had content installed 33 billion times in 2019, reaching 3 billion devices worldwide. Unity operates one of the largest monetization networks in the world. They migrated this platform from self-managed Kafka to fully-managed Confluent Cloud. The cutover was executed by the project team without downtime or data loss. Read Unity’s post on the Confluent Blog: “How Unity uses Confluent for real-time event streaming at scale “.

Kafka is the scalable real-time backend for mobility services and gaming/betting platforms

Many gaming and mobility services leverage event streaming as the backbone of their infrastructure. Use cases include the processing of telemetry data, location-based services, payments, fraud detection, user/player retention, loyalty platform, and so much more. Almost all innovative applications in this sector require real-time data streaming at scale.

A few examples:

Mobility services: Uber, Lyft, FREE NOW, Grab, Otonomo, Here Technologies, …
Gaming services: Disney+ Hotstar, Sony Playstation, Tencent, Big Fish Games, …
Betting services: William Hill, Sky Betting, …

Just look at the job portals of any mobility or gaming service. Not everybody is talking about their Kafka usage in public. But almost everyone is looking for Kafka experts to develop and operate their platform.

These use cases are just as critical as a payment process in a core banking platform. Regulatory compliance and zero data loss are crucial. Multi-Region Clusters (i.e., a Kafka cluster stretched across regions like US East, Central, and West) enable high availability with zero downtime and no data loss even in the case of a disaster.

Vehicles, machines, or IoT devices embed a single Kafka broker

The edge is here to stay and grow. Some use cases require the deployment of a Kafka cluster or single broker outside a data center. Reasons for operating a Kafka infrastructure at the edge include low latency, cost efficiency, cybersecurity, or no internet connectivity.

Examples for Kafka at the edge:

Disconnected edge in logistics to store logs, sensor data, and images while offline (e.g., a truck on the street or a drone flying around a ship) until a good internet connection is available in the distribution center
Vehicle-to-Everything (V2X) communication in a local small data center like AWS Outposts (via a gateway like MQTT if large area, a considerable number of vehicles, or lousy network), or via direct Kafka client connection for a few hundreds of machines, e.g., in a smart factory )
Offline mobility services like integrating a car infrastructure with gaming, maps, or a recommendation engine with locally processed partner services (e.g., the next Mc Donalds comes in 10 miles, here is a coupon).

The cruise line Royal Caribbean is a great success story for this scenario. It operates the four largest passenger ships in the world. As of January 2021, the line operates twenty-four ships and has six additional ships on order.

Royal Caribbean implemented one of Kafka’s most famous use cases at the edge. Each cruise ship has a Kafka cluster running locally for use cases such as payment processing, loyalty information, customer recommendations, etc.:

I covered this example and other Kafka edge deployments in various blogs. I talked about use cases for Kafka at the edge, showed architectures for Kafka at the edge, and explored low latency 5G deployments powered by Kafka.

When NOT to use Apache Kafka?

Finally, we are coming to the section everybody was looking for, right? However, it is crucial first to understand when to use Kafka. Now, it is easy to explain when NOT to use Kafka.

For this section, let’s assume that we talk about production scenarios, not some ugly (?) workarounds to connect Kafka to something for a proof of concept directly; there is always a quick and dirty option to test something – and that’s fine for that goal. But things change when you need to scale and roll out your infrastructure globally, be compliant to law, and guarantee no data loss for transactional workloads.

With this in mind, it is relatively easy to qualify out Kafka as an option for some use cases and problems:

Kafka is NOT hard real-time

The definition of the term “real-time” is difficult. It is often a marketing term. Real-time programs must guarantee a response within specified time constraints.

Kafka – and all other frameworks, products, and cloud services used in this context – is only soft real-time and built for the IT world. Many OT and IoT applications require hard real-time with zero latency spikes.

Soft real-time is used for applications such as

Point-to-point messaging between IT applications
Data ingestion from various data sources into one or more data sinks
Data processing and data correlation (often called event streaming or event stream processing)

If your application requires sub-millisecond latency, Kafka is not the right technology. For instance, high-frequency trading is usually implemented with purpose-built proprietary commercial solutions.

Always keep in mind: The lowest latency would be to not use a messaging system at all and just use shared memory. In a race to the lowest latency, Kafka will lose every time. However, for the audit log and transaction log or persistence engine parts of the exchange, it is no data loss that becomes more important than latency and Kafka wins.

Most real-time use cases “only” require data processing in the millisecond to the second range. In that case, Kafka is a perfect solution. Many FinTechs, such as Robinhood, rely on Kafka for mission-critical transactional workloads, even financial trading. Multi-access edge computing (MEC) is another excellent example of low latency data streaming with Apache Kafka and cloud-native 5G infrastructure.

Kafka is NOT deterministic for embedded and safety-critical systems

This one is pretty straightforward and related to the above section. Kafka is not a deterministic system. Safety-critical applications cannot use it for a car engine control system, a medical system such as a heart pacemaker, or an industrial process controller.

A few examples where Kafka CANNOT be used for:

Safety-critical data processing in the car or vehicle. That’s Autosar / MINRA C / Assembler and similar technologies.
CAN Bus communication between ECUs.
Robotics. That’s C / C++ or similar low-level languages combined with frameworks such as Industrial ROS (Robot Operating System).
Safety-critical machine learning / deep learning (e.g., for autonomous driving)
Vehicle-to-Vehicle (V2V) communication. That’s 5G sidelink without an intermediary like Kafka.

My post “Apache Kafka is NOT Hard Real-Time BUT Used Everywhere in Automotive and Industrial IoT” explores this discussion in more detail.

TL;DR: Safety-related data processing must be implemented with dedicated low-level programming languages and solutions. That’s not Kafka! The same is true for any other IT software, too. Hence, don’t replace Kafka with IBM MQ, Flink, Spark, Snowflake, or any other similar IT software.

Kafka is NOT built for bad networks

Kafka requires good stable network connectivity between the Kafka clients and the Kafka brokers. Hence, if the network is unstable and clients need to reconnect to the brokers all the time, then operations are challenging, and SLAs are hard to reach.

There are some exceptions, but the basic rule of thumb is that other technologies are built specifically to solve the problem of bad networks. MQTT is the most prominent example. Hence, Kafka and MQTT are friends, not enemies. The combination is super powerful and used a lot across industries. For that reason, I wrote a whole blog series about Kafka and MQTT.

We built a connected car infrastructure that processes 100,000 data streams for real-time predictions using MQTT, Kafka, and TensorFlow in a Kappa architecture.

Kafka does NOT provide connectivity to tens of thousands of client applications

Another specific point to qualify Kafka out as an integration solution is that Kafka cannot connect to tens of thousands of clients. If you need to build a connected car infrastructure or gaming platform for mobile players, the clients (i.e., cars or smartphones) will not directly connect to Kafka.

A dedicated proxy such as an HTTP gateway or MQTT broker is the right intermediary between thousands of clients and Kafka for real-time backend processing and the integration with further data sinks such as a data lake, data warehouse, or custom real-time applications.

Where are the limits of Kafka client connections? As so often, this is hard to say. I have seen customers connect directly from their shop floor in the plant via .NET and Java Kafka clients via a direct connection to the cloud where the Kafka cluster is running. Direct hybrid connections usually work well if the number of machines, PLCs, IoT gateways, and IoT devices is in the hundreds. For higher numbers of client applications, you need to evaluate if you a) need a proxy in the middle or b) deploy “edge computing” with or without Kafka at the edge for lower latency and cost-efficient workloads.

When to MAYBE use Apache Kafka?

The last section covered scenarios where it is relatively easy to quality Kafka out as it simply cannot provide the required capabilities. I want to explore a few less apparent topics, and it depends on several things if Kafka is a good choice or not.

Kafka does (usually) NOT replace another database

Apache Kafka is a database. It provides ACID guarantees and is used in hundreds of companies for mission-critical deployments. However, most times, Kafka is not competitive with other databases. Kafka is an event streaming platform for messaging, storage, processing, and integration at scale in real-time with zero downtime or data loss.

Kafka is often used as a central streaming integration layer with these characteristics. Other databases can build materialized views for their specific use cases like real-time time-series analytics, near real-time ingestion into a text search infrastructure, or long-term storage in a data lake.

In summary, when you get asked if Kafka can replace a database, then there are several answers to consider:

Kafka can store data forever in a durable and high available manner providing ACID guarantees
Further options to query historical data are available in Kafka
Kafka-native add-ons like ksqlDB or Tiered Storage make Kafka more potent than ever before for data processing and event-based long-term storage
Stateful applications can be built leveraging Kafka clients (microservices, business applications) with no other external database
Not a replacement for existing databases, data warehouses, or data lakes like MySQL, MongoDB, Elasticsearch, Hadoop, Snowflake, Google BigQuery, etc.
Other databases and Kafka complement each other; the right solution has to be selected for a problem; often, purpose-built materialized views are created and updated in real-time from the central event-based infrastructure
Different options are available for bi-directional pull and push-based integration between Kafka and databases to complement each other

My blog post “Can Apache Kafka replace a database, data warehouse, or data lake?” discusses the usage of Kafka as a database in much more detail.

Kafka does (usually) NOT process large messages

Kafka was not built for large messages. Period.

Nevertheless, more and more projects send and process 1Mb, 10Mb, and even much bigger files and other large payloads via Kafka. One reason is that Kafka was designed for large volume/throughput – which is required for large messages. A very common example that comes up regularly is the ingestion and processing of large files from legacy systems with Kafka before ingesting the processed data into a Data Warehouse.

However, not all large messages should be processed with Kafka. Often you should use the right storage system and just leverage Kafka for the orchestration. Reference-based messaging (i.e. storing the file in another storage system and sending the link and metadata) is often the better design pattern:

Know the different design patterns and choose the right technology for your problem.

For more details and use cases about handling large files with Kafka, check out this blog post: “Handling Large Messages with Apache Kafka (CSV, XML, Image, Video, Audio, Files)“.

Kafka is (usually) NOT the IoT gateway for the last-mile integration of industrial protocols…

The last-mile integration with IoT interfaces and mobile apps is a tricky space. As discussed above, Kafka cannot connect to thousands of Kafka clients. However, many IoT and mobile applications only require tens or hundreds of connections. In that case, a Kafka-native connection is straightforward using one of the various Kafka clients available for almost any programming language on the planet.

Suppose a connection on TCP level with a Kafka client makes little sense or is not possible. In that case, a very prevalent workaround is the REST Proxy as the intermediary between the clients and the Kafka cluster. The clients communicate via synchronous HTTP(S) with the streaming platform.

Use cases for HTTP and REST APIs with Apache Kafka include the control plane (= management), the data plane (= produce and consume messages), and automation, respectively DevOps tasks.

Unfortunately, many IoT projects require much more complex integrations. I am not just talking about a relatively straightforward integration via an MQTT or OPC-UA connector. Challenges in Industrial IoT projects include:

The automation industry does often not use open standards but is slow, insecure, not scalable, and proprietary.
Product Lifecycles are very long (tens of years), with no simple changes or upgrades.
IIoT usually uses incompatible protocols, typically proprietary and built for one specific vendor.
Proprietary and expensive monoliths that are not scalable and not extendible.

Therefore, many IoT projects complement Kafka with a purpose-built IoT platform. Most IoT products and cloud services are proprietary but provide open interfaces and architectures. The open-source space is small in this industry. A great alternative (for some use cases) is Apache PLC4X. The framework integrates with many proprietary legacy protocols, such as Siemens S7, Modbus, Allen Bradley, Beckhoff ADS, etc. PLC4X also provides a Kafka Connect connector for native and scalable Kafka integration.

A modern data historian is open and flexible. The foundation of many strategic IoT modernization projects across the shop floor and hybrid cloud is powered by event streaming:

Kafka is NOT a blockchain (but relevant for web3, crypto trading, NFT, off-chain, sidechain, oracles)

Kafka is a distributed commit log. The concepts and foundations are very similar to a blockchain. I explored this in more detail in my post “Apache Kafka and Blockchain – Comparison and a Kafka-native Implementation“.

A blockchain should be used ONLY if different untrusted parties need to collaborate. For most enterprise projects, a blockchain is unnecessary added complexity. A distributed commit log (= Kafka) or a tamper-proof distributed ledger (= enhanced Kafka) is sufficient.

Having said this, more interestingly, I see more and more companies using Kafka within their crypto trading platforms, market exchanges, and NFT token trading marketplaces.

To be clear: Kafka is NOT the blockchain on these platforms. The blockchain is a cryptocurrency like Bitcoin or a platform providing smart contracts like Ethereum where people build new distributed applications (dApps) like NFTs for the gaming or art industry. Kafka is the streaming platform to connect these blockchains with other Oracles (= the non-blockchain apps) like the CRM, data lake, data warehouse, and so on:

TokenAnalyst is an excellent example that leverages Kafka to integrate blockchain data from Bitcoin and Ethereum with their analytics tools. Kafka Streams provides a stateful streaming application to prevent using invalid blocks in downstream aggregate calculations. For example, TokenAnalyst developed a block confirmer component that resolves reorganization scenarios by temporarily keeping blocks, and only propagates them when a threshold of a number of confirmations (children to that block are mined) is reached.

In some advanced use cases, Kafka is used to implementing a sidechain or off-chain platform as the original blockchain does not scale well enough (blockchain is known as on-chain data). Not just Bitcoin has the problem of only processing single-digit (!) transactions per second. Most modern blockchain solutions cannot scale even close to the workloads Kafka processes in real-time.

From DAOs to blue chip companies, measuring the health of blockchain infrastructure and IOT components is still necessary even in a distributed network to avoid downtime, secure the infrastructure, and make the blockchain data accessible. Kafka provides an agentless and scalable way to present that data to the parties involved and make sure that the relevant data is exposed to the right teams before a node is lost. This is relevant for cutting-edge Web3 IoT projects like Helium, or simpler closed distributed ledgers (DLT) like R3 Corda.

My recent post about live commerce powered by event streaming and Kafka transforming the retail metaverse shows how the retail and gaming industry connects virtual and physical things. The retail business process and customer communication happen in real-time; no matter if you want to sell clothes, a smartphone, or a blockchain-based NFT token for your collectible or video game.

TL;DR: Kafka is NOT…

… a replacement for your favorite database or data warehouse.

… hard real-time for safety-critical embedded workloads.

… a proxy for thousands of clients in bad networks.

… an API Management solution.

… an IoT gateway.

… a blockchain.

It is easy to qualify Kafka out for some use cases and requirements.

However, analytical and transactional workloads across all industries use Kafka. It is the de-facto standard for event streaming everywhere. Hence, Kafka is often combined with other technologies and platforms.

Where do you (not) use Apache Kafka? What other technologies do you combine Kafka with? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post When NOT to use Apache Kafka? appeared first on Kai Waehner.

Apache Kafka in the Public Sector – Part 2: Smart City

Kai Waehner — Tue, 12 Oct 2021 07:48:48 +0000

The public sector includes many different areas. Some groups leverage cutting-edge technology, like military leverage. Others like the public administration are years or even decades behind. This blog series explores how the public sector leverages data in motion powered by Apache Kafka to add value for innovative new applications and modernizing legacy IT infrastructures. This post is part 2: Use cases and architectures for a Smart City.

Blog series: Apache Kafka in the Public Sector and Government

This blog series explores why many governments and public infrastructure sectors leverage event streaming for various use cases. Learn about real-world deployments and different architectures for Kafka in the public sector:

Subscribe to my newsletter to get updates immediately after the publication. Besides, I will also update the above list with direct links to this blog series’s posts once published.

As a side note: If you wonder why healthcare is not on the above list. Healthcare is another blog series on its own. While the government can provide public health care through national healthcare systems, it is part of the private sector in many other cases.

Real-time is Mandatory for a Smart City Everywhere

I wrote a lot about event streaming and Apache Kafka for smart city infrastructure and use cases. I won’t repeat myself. Check out the following event Streaming with Kafka as Foundation for a Smart City and Apache Kafka and MQTT for the Last Mile IoT integration in a Smart City.

This post dives deeper into architectural questions and how collaboration with 3rd party services can look from the government’s perspective and public administration of a smart city.

The Need for Real-time Data Processing Everywhere in a Smart City and how Kafka helps

A smart city is a very complex beast. I am glad that I only cover technology and not regulatory or political discussions. However, even the technology standpoint is not straightforward. A smart city needs to correlate data across data centers, devices, vehicles, and many other things. This scenario is an actual internet of things (IoT) and therefore includes plenty of different technologies, communication paradigms, and infrastructures:

Smart city projects require the integration of various 1st party and 3rd party services. Most use cases only work well if that data is correlated in real-time; think about traffic routing, emergency alerts, predictive monitoring and maintenance, mobility services such as ride-hailing, and other fancy smart city use cases. Without real-time data processing, the use case is either a bad user experience or not cost-efficient. Hence, Kafka is adopted more and more for these scenarios.

Low Latency and 5G Networks for (some) Data Streaming Use Cases

The term “real-time” needs to be defined. Processing data in a few seconds is good enough in most use cases and a significant game-changer compared to hourly, daily, or weekly batch processing.

Having said this, some use cases like location-based upselling in retail or condition monitoring in equipment and manufacturing require lower latency, meaning sub-second end-to-end data processing.

Here is an example of leveraging 5G networks for low latency. The demo was built by the AWS Wavelength team, Verizon, and Confluent:

Most real-world deployments use separation of concerns: Low-latency use cases run at the edge and everything else in the regular data center or public cloud region. Read the article “Low Latency Data Streaming with Apache Kafka and Cloud-Native 5G Infrastructure” for more details.

At this point, it is important to remind everybody that Kafka (and any IT software) is not hard real-time and not built for the OT world and embedded systems. Learn more in the article “Kafka is NOT hard real-time but soft real-time“. Also, (soft) real-time is not competitive to batch processing and data warehouse/data lake architecture. As you can learn in “Serverless Kafka in a Cloud-native Data Lake Architecture” it is complimentary.

Collaboration between Government, City, and 3rd Party via Open API

Real-time data processing is crucial in implementing smart city use cases. Additionally, most smart city projects require collaboration between different teams, infrastructures, and 3rd party services.

Let’s take a look at three very different real-world event streaming deployments to see the broad spectrum of use cases and integration challenges:

Ohio Department of Transportation’s government-owned event streaming platform
Deutsche Bahn’s single source of truth for customer communication in real-time and 3rd party integration with the Google Maps API
Free Now’s mobility service in the cloud for real-time data correlation in compliance with regional laws and independent vehicles/drivers.

Ohio Department of Transportation (ODOT) – A Government-Owned Event Streaming Platform

Ohio Department of Transportation (ODOT) has an exciting initiative: DriveOhio. It aims to organize and accelerate smart vehicle and connected vehicle projects in the State of Ohio. DriveOhio offers to be the single point of contact for policymakers, agencies, researchers, and private companies to collaborate with one another on intelligent transportation efforts around the state.

ODOT presented their real-time data transportation data platform at the last Kafka Summit Americas:

The whole Kafka ecosystem powers ODOT’s cloud-native Event Streaming Platform (ESP). The platform enables continuous data integration and stream processing for transactional and analytical workloads. The ESP runs on Kubernetes to provide an elastic, flexible, and scalable infrastructure for real-time data processing.

Deutsche Bahn – Single Source of Truth and Google Maps Integration in Real-time

Deutsche Bahn is a German railway company. It is a private joint-stock company (AG), with the Federal Republic of Germany being its single shareholder. I already talked about their real-time traveler information system in another blog post: “Mobility Services and Transportation powered by Apache Kafka“.

They leverage the Apache Kafka ecosystem powered by Confluent because it combines several characteristics that you would have to integrate with different technologies otherwise:

Real-time messaging
Data integration
Data correlation
Storage and caching
Replication and high availability
Elastic scalability

This example is excellent for this blog. It shows how an existing solution needs connectivity to other internal applications and 3rd party services to provide a better customer experience and expand the customer base.

Recently, Deutsche Bahn integrated its platform with Google Maps via Google’s Open API. In addition to a better customer experience, the railway company can reach out to many new end-users to expand their business. The Railway-News has a good article about this integration. Here is my summary:

Free Now – Mobility Service in the Cloud Connected to Regional Laws and Vehicles

Free Now (former MyTaxi) is a mobility service. Their app uses mobile and GPS technology to match taxi drivers with passengers based on availability and proximity. Mobility services need to integrate with other 3rd party services for routing, payment, tax implications, and many different use cases.

Here is one example from Free Now’s Kafka Summit talk where they explain the added value of continuous stream processing for calculating context-specific dynamic pricing:

The public administration is always involved when a new mobility service is released to the public. While some cities build their mobility services, the reality is that most governments provide the infrastructure together with the Telco providers, and 3rd party vendors provide the mobility service. The specific relationship between the government, city, and mobility service provider differs across regions, countries, and continents.

Almost every mobility service uses Kafka as its backbone. Google for your favorite mobility service across the globe and add “Kafka” to the search. Chances are very high that you find some excellent blog posts, conferences talks, or at least job offers from the mobility service’s recruiting page. Here are just a few examples that posted great content about their Kafka usage: Uber, Lyft, Grab, Otonomo, Here Technologies, and many more.

Data in Motion with Kafka for a Connected and Innovative Smart City

Smart City is a vast topic. Many stakeholders are involved. Collaboration and Open APIs are critical for success. In most cases, governments work together with telco providers, infrastructure providers such as the cloud hyperscalers, and software vendors (including an event streaming platform like Kafka).

Most valuable and innovative smart city use cases require data processing in real-time. The use cases require data integration, storage, and backpressure handling, and data correlation. Event Streaming is the ideal technology for these use cases. Examples from the Ohio Department of Transportation, Deutsche Bahn and its Google Maps integration, and Free Now showed a few different angles to realize successful smart city projects.

How do you leverage event streaming in the public sector? Are you working on smart city projects? What technologies and architectures do you use? What projects did you already work on or are in the planning? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka in the Public Sector – Part 2: Smart City appeared first on Kai Waehner.

When to Use Reverse ETL and when it is an Anti-Pattern

Kai Waehner — Thu, 30 Sep 2021 16:50:21 +0000

Most enterprises store their massive volumes of transactional and analytics data at rest in data warehouses or data lakes. Sales, marketing, and customer success teams require access to these data sets. Reverse ETL is a buzzword that defines the concept of collecting data from existing data stores to provide it easy and quick for business teams.

This blog post explores why software vendors (try to) introduce new solutions for Reverse ETL, when it is needed, and how it fits into the enterprise architecture. The involvement of event streaming with tools like Apache Kafka to process data in motion is a crucial piece of Reverse ETL for real-time use cases.

What are ETL and Reverse ETL?

Let’s begin with the terms. What do ETL and Reverse ETL mean?

ETL (Extract-Transform-Load)

Extract-Transform-Load (ETL) is a common term for data integration. Vendors like Informatica or Talend provide visual coding to implement robust ETL pipelines. The cloud brought new SaaS players and the term Integration Platform as a Service (iPaaS) into the ETL market with vendors such as Boomi, SnapLogic, or Mulesoft Anypoint.

Most ETL tools operate in batch processes for big data workloads or use SOAP/REST web services and APIs for non-scalable real-time communication. ETL pipelines consume data from various data sources, transform or aggregate it, and store the processed data at rest in data sinks such as databases, data warehouses, or data lakes:

ELT (Extract-Load-Transform)

Extract-Load-Transform (ELT) is a very similar approach. However, the transformations and aggregations happen after the ingestion into the datastore:

It is no surprise that modern data storage and analytics vendors such as Databricks and Snowflake promote the ELT approach. For instance, Snowflake pitches the “internal dash mesh” where all the domains and data products are built within their cloud service.

Reverse ETL

As the name says, Reverse ETL turns the story from ETL around. It means the process of moving data from a data store into third-party systems to “make data operational”, as the marketing of these solutions says:

The data is consumed from long-term storage systems (data warehouse, data lake). The data is then pushed into business applications such as Salesforce (CRM), Marketo (marketing), or Service Now (customer success) to leverage it for pipeline generation, marketing campaigns, or customer communication.

Products and SaaS Offerings for Reverse ETL

Just google for “Reverse ETL” to find vendors specifically pitching their solutions. They also pay ads for the “normal data integration terms”. Therefore, the chances are high that you already saw them even if you did not search for them.

Most of these companies are young companies and startups building a new business around Reverse ETL products. Software vendors I found in my research include Hightouch, Census, Grouparoo (open source), Rudderstack, Omnata, and Seekwell.

Fun fact: If you search for Snowflake’s Reverse ETL, you will not find any google hit as they want to keep the data in their data warehouse.

A key strength and selling point of all ETL tools is visual coding, and therefore time to market for the development and maintenance of ETL pipelines. Some solutions target the citizen integrator (a term coined by Gartner), i.e., businesspeople building their integrations.

Reverse ETL == Real-Time Data for Sales, Marketing, Customer Success

Most Reverse ETL success stories talk about focus on sales, marketing, or customer success. These use cases attract business divisions. These teams do not want to buy a technical ETL tool like Informatica or Talend. Business people expect straightforward and intuitive user interfaces, like a citizen integrator.

The vendors target the businesspeople and promise a simplified technical infrastructure. For instance, one vendor promotes “Cut out legacy middleware and reduce ETL jobs”. My first thought: Welcome, shadow IT!

Nevertheless, let’s take a look at the use cases for Reverse ETL:

Identify customers at-risk and potential customer churn before it happens
Drive new sales by correlating data from the CRM and other interfaces
Hyper-personalized marketing for cross-selling and upselling to existing customers
Operational analytics to monitor the changes in business applications and data faster
Data replication to modern cloud applications for better reporting capabilities and finding insights

Additionally, all of the vendors also talk about real-time data for the above use cases. That’s great. BUT: Unfortunately, Reverse ETL is a huge ANTI-PATTERN to build real-time use cases. Let’s explore in more detail why.

Reverse ETL + Data Lake + Real-Time == Myth

Those use cases described above are great with tremendous business value. If you follow my blog or presentations, you have probably seen precisely the same real-time use cases built natively with event streaming processing data in motion.

If you store data in a data warehouse or data lake, you cannot process it in real-time anymore as it is already stored at rest. These data stores are built for indexing, search, batch processing, reporting, model training, and other use cases that make sense in the storage system. But you cannot consume the data in real-time in motion from storage at rest:

That’s where event streaming comes into play. Platforms like Apache Kafka enable processing data in motion in real-time for transactional and analytical workloads.

So, let’s take a look at a modern enterprise architecture that leverages event streaming for data processing in motion AND a data warehouse or data lake for data processing at rest.

Reverse ETL in the Enterprise Architecture

Let’s explore how Reverse ETL fits into the enterprise architecture and when you need a separate tool for this. For this, let’s go one step back first. What does Reverse ETL do? It takes data out of the storage, transforms or aggregates the data, and then ingests it into business applications.

Two options exist for Reverse ETL: SQL queries and Change Data Capture (CDC).

Reverse ETL == SQL Queries vs Change Data Capture

If a Reverse ETL tool uses SQL, then it is usually a query to data at rest. This use case enables businesspeople to create queries in intuitive user interfaces. Use cases include the creation of new marketing campaigns or analyze the customer success journey. SQL-based Reverse ETL requires intuitive tools that are simple to use.

If a Reverse ETL tool provides real-time data correlation and push notifications, it uses change data capture (CDC). CDC is automated and enables acting on changes in the data storage in real-time. The pipeline includes data correlation from different data sources and sending real-time push messages into business applications. CDC-based Reverse ETL requires a scalable, reliable event streaming infrastructure.

As you can see, both SQL and CDC approaches have their use cases and sometimes overlap in tooling and infrastructure. Change-log-based CDC is often the preferred technical approach instead of synchronizing data on a recurring schedule with SQL or when triggering by calling an API, no matter if you use “just” an event streaming platform or a particular Revere ETL product.

However, the more important question is how to design an enterprise architecture to AVOID the need for Reverse ETL.

Event-driven Architecture + Streaming ETL == Reverse ETL built-in

Real-time data beats slow data. That’s true for most use cases. Hence, the rise of event-driven architectures is unstoppable:

Reverse ETL is not needed in modern event-driven architecture! It is “built-in” into the architecture out-of-the-box. Each consumer directly consumes the data in real-time if it is appropriate and technically feasible. And data warehouses or data lakes still consume it in their own pace in near-real-time or batch:

The Kafka-native streaming SQL engine ksqlDB provides CDC capabilities and continuous stream processing. Therefore, you could even call ksqlDB a Reverse ETL tool if your marketing asks for it.

If you want to learn more about building real-time data platforms, check out the article “Kappa architecture is mainstream replacing Lambda“. It explores how companies like Uber, Shopify, and Disney built an event-driven Kappa architecture for any use case, including real-time, near-real-time, batch, and request-response.

When do you need Reverse ETL?

A greenfield architecture built from the ground up with an event streaming platform at its heart does not need Reverse ETL to consume data from a data warehouse or data lake as every consumer can already consume the data in real-time.

However, providing an interface for business users is NOT solved out-of-the-box with an event streaming platform like Apache Kafka. You need to add additional tools like Kafka CDC connectors, or 3rd party tools with intuitive user interfaces.

Hence, Reverse ETL can be helpful in two scenarios: Brownfield integration and simple tools for business users.

Brownfield architectures where data is stored at rest and businesspeople need to consume it in business applications. Data needs to be pushed out of the data storage for sales, marketing, or customer success use cases:

Simple integration tools for business people are much more intuitive and easy to use than traditional ETL and iPaaS solutions. Even in a greenfield approach, Reverse ETL tools might still be the easiest solution and provide the best time to market.

Also, keep in mind that modern tools such as Salesforce or SAP provide event-based interfaces already. Data storage vendors such as Elastic, Splunk, or Snowflake also heavily invest in streaming layers to natively integrate with tools such as Apache Kafka. The integration with business applications is possible via event streaming in real-time instead of integration via Reverse ETL from the data store.

For these reasons, evaluate your business problem and if you need an event streaming platform, a Reverse ETL tool, or a combination of both.

Kafka Examples for Reverse ETL

Let’s take a look at two concrete examples.

Apache Kafka + Salesforce + Oracle CDC + Snowflake

The following architecture combines real-time data streaming, change data capture, data lake, and a Reverse ETL cloud service:

A few notes on this architecture:

The central nervous system is an event streaming platform (Confluent Cloud) that provides scalable real-time data streaming and true decoupling between any data source and sink.
A SaaS cloud service (Salesforce) natively provides an asynchronous API for event-based real-time integration.
A traditional relational database (Oracle) is integrated with Reverse ETL via change data capture using Confluent’s Oracle CDC connector for Kafka Connect.
Data from all the data sources are processed continuously with stream processing tools such as Kafka Streams and ksqlDB.
Data ingestion into a data warehouse (Snowflake) configured as part of Confluent Cloud’s fully managed Kafka Connect connector.
A business user leverages a dedicated Reverse ETL solution (Seekwell) for getting data out of the data warehouse (Snowflake) into a business application (Google Sheets).

The whole infrastructure provides an event-based, scalable, reliable real-time nervous system. Each application can consume and process data in motion in real-time (if needed). Data storage at rest is complementary for batch use cases and integrated with the event-based platform.

TL;DR: This architecture truly decouples applications, avoids point-to-point spaghetti communication, and supports all technologies, cloud services, and communication paradigms.

Tapping into the Splunk Ingestion Layer in Motion with Kafka

Another option of avoiding the need for Reverse ETL from a storage system is tapping into the existing storage ingestion layer.

Confluent’s Splunk S2S connector is a great example. Suppose organizations already have hundreds or thousands of universal forwarders (UF) and heavy forwarders (HF). In that case, this approach allows users to cost-effectively and reliably read data from Splunk Forwarders to Kafka. It enables users to forward data from universal forwarders into a Kafka topic to unlock the analytical capabilities of the data:

For more details about this use case, check out my blog “Apache Kafka in Cybersecurity for SIEM / SOAR Modernization“.

Don’t Design for Data at Rest to Reverse it!

Good enterprise architecture should never have the goal to plan for reverse ETL from the beginning! It is only needed in brownfield architecture where the data is stored at rest instead of building an event-based architecture for real-time and batch data sinks. Reverse ETL enables Shadow IT and spaghetti architectures. Event streaming enables data integration in real-time by nature.

Nevertheless, Reverse ETL tooling is appropriate for brownfield approaches (ideally via continuous change data capture, not recurring SQL) or if business users need a simple, intuitive user interface. Hence, event streaming and Reverse ETL are complementary. In the same way, event streaming and data warehouses/data lakes are complementary. Read this if you want to learn more: “Serverless Kafka in a Cloud-native Data Lake Architecture“.

What is your point of view on this new ETL buzzword? How do you integrate it into an enterprise architecture? What are your experiences and opinions? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post When to Use Reverse ETL and when it is an Anti-Pattern appeared first on Kai Waehner.

Enterprise Architecture Archives - Kai Waehner

The State of Data Streaming for Healthcare with Apache Kafka and Flink

General trends in the healthcare industry

Regulation and interoperability

Digital disruption and automated workflows

Data streaming in the healthcare industry

Architecture trends for data streaming

Event-driven architecture for integration and IT modernization

Data mesh for real-time data sharing and consistency

New customer stories for data streaming in the healthcare sector

Resources to learn more

On-demand video recording

Slides

Case studies and lightboard videos for data streaming in the healthcare industry

Data Streaming is not a Race, it is a Journey!

Data Streaming is a Journey, not a Race!

Consulting – Expertise from Software Vendors and System Integrators

Customer Journeys across Industries powered by Apache Kafka

AO.com – Real-Time Clickstream Analytics

Nordstrom – Event-driven Analytics Platform

Migros – End-to-End Supply Chain with IoT

NordLB – Bank-wide Data Streaming for Core Banking

Raiffeisen Bank International – Strategic Real-Time Data Integration Platform across 13 Countries

Allianz – Legacy Modernization and Cloud-Native Innovation

Optum – Self-Service Data Streaming

Intel – Real-Time Cyber Intelligence at Scale with Kafka and Splunk

Bayer AG – Transition from On-Premise to Multi-Cloud

Tesla – Migration from a Message Broker (RabbitMQ) to Data Streaming (Kafka)

Siemens – Integration between On-Premise SAP and Salesforce CRM in the Cloud

50Hertz – Transition from Proprietary SCADA to Cloud-Native Industrial IoT

Technical Evolution during the Data Streaming Journey

The Heart of the Data Mesh Beats Real-Time with Apache Kafka

There is no single technology or product for a data mesh!

What is a data mesh?

Why handle data as a product?

What is data streaming with Apache Kafka and its relation to data mesh?

Example: Real-time data fabric in hybrid cloud

Presentation: Building a decentralized data mesh with data streaming at its heart

The data mesh provides flexibility and freedom of technology choice for each data product

Data Warehouse and Data Lake Modernization: From Legacy On-Premise to Cloud-Native Infrastructure

Blog Series: Data Warehouse vs. Data Lake vs. Data Streaming

Data warehouse modernization: From legacy on-premise to cloud-native infrastructure

1. Selection of a cloud-native data warehouse

2. Data streaming as (near) real-time and hybrid integration layer

Integration of legacy on-premise data into the cloud-native data warehouse

Example: Confluent Cloud + Snowflake = Cloud-native Data Warehouse Modernization

3. Data warehouse modernization and migration from legacy to cloud-native

What is data warehouse modernization?

Data warehouse migration: Continuous vs. cut-over

Data warehouse modernization is NOT a big bang and NOT a single tool approach!

When to use Apache Camel vs. Apache Kafka?

The history of application integration and event streaming

A discussion that started over a decade ago…

… from application integration to event streaming

Data in motion with the Kappa architecture replacing Lambda

Features in Apache Camel AND Apache Kafka

When to use Apache Camel?

The mission of Camel

Camel’s strengths

Camel’s weaknesses

The evolution of Apache Camel

Camel TL;DR

When to use Apache Kafka?

The mission of Kafka

Kafka’s strengths

Kafka’s weaknesses

The evolution of Apache Kafka

Kafka TL;DR

Decision tree – Camel or Kafka?

Qualify out – the easiest way to start an evaluation!

Decision Tree for Camel and Kafka

When to use Camel and Kafka together?

Kafka for event streaming and Camel for ETL

Camel connectors embedded into Kafka Connect

When NOT to use Camel or Kafka at all?

Apache Camel vs. Apache Kafka – Who is the winner?

When NOT to use Apache Kafka?

Market Trends – A Connected World

Connected Cars – Insane volume of telemetry data and aftersales

Gaming – Billions of players and massive revenues