Architecture Archives - Kai Waehner

Apache Kafka Cluster Type Deployment Strategies

Kai Waehner — Mon, 29 Jul 2024 06:34:49 +0000

Organizations start their data streaming adoption with a single Apache Kafka cluster to deploy the first use cases. The need for group-wide data governance and security but different SLAs, latency, and infrastructure requirements introduce new Kafka clusters. Multiple Kafka clusters are the norm, not an exception. Use cases include hybrid integration, aggregation, migration, and disaster recovery. This blog post explores real-world success stories and cluster strategies for different Kafka deployments across industries.

Apache Kafka – The De Facto Standard for Event-Driven Architectures and Data Streaming

Apache Kafka is an open-source, distributed event streaming platform designed for high-throughput, low-latency data processing. It allows you to publish, subscribe to, store, and process streams of records in real time.

Kafka serves as a popular choice for building real-time data pipelines and streaming applications. The Kafka protocol became the de facto standard for event streaming across various frameworks, solutions, and cloud services. It supports operational and analytical workloads with features like persistent storage, scalability, and fault tolerance. Kafka includes components like Kafka Connect for integration and Kafka Streams for stream processing, making it a versatile tool for various data-driven use cases.

While Kafka is famous for real-time use cases, many projects leverage the data streaming platform for data consistency across the entire enterprise architecture, including databases, data lakes, legacy systems, Open APIs, and cloud-native applications.

Different Apache Kafka Cluster Types

Kafka is a distributed system. A production setup usually requires at least four brokers. Hence, most people automatically assume that all you need is a single distributed cluster you scale up when you add throughput and use cases. This is not wrong in the beginning. But…

One Kafka cluster is NOT the right answer for every use case. Various characteristics influence the architecture of a Kafka cluster:

Availability: Zero downtime? 99.99% uptime SLA? Non-critical analytics?
Latency: End-to-end processing in <100ms (including processing)? 10-minute end-to-end data warehouse pipeline? Time travel for re-processing historical events?
Cost: Value vs. cost? Total Cost of Ownership (TCO) matters! For instance, in the public cloud, networking can be up to 80% of the total Kafka cost!
Security and Data Privacy: Data privacy (PCI data, GDPR, etc.)? Data governance and compliance? End-to-end encryption on the attribute level? Bring your own key? Public access and data sharing? Air-gapped edge environment?
Throughput and Data Size: Critical transactions (typically low volume)? Big data feeds (clickstream, IoT sensors, security logs, etc.)?

Related topics like on-premise vs. public cloud, regional vs. global, and many other requirements also affect the Kafka architecture.

Apache Kafka Cluster Strategies and Architectures

A single Kafka cluster is often the right starting point for your data streaming journey. It can onboard multiple use cases from different business domains and process gigabytes per second (if operated and scaled the right way). However, depending on your project requirements, you need an enterprise architecture with multiple Kafka clusters. Here are a few common examples:

Hybrid Architecture: Data integration and uni- or bi-directional data synchronization between multiple data centers. Often, connectivity between an on-premise data center and a public cloud service provider. Offloading from legacy into cloud analytics is one of the most common scenarios. But command & control communication is also possible, i.e., sending decisions/recommendations/transactions into a regional environment (e.g., storing a payment or order from a mobile app in the mainframe).
Multi-Region / Multi-Cloud: Data replication for compliance, cost, or data privacy reasons. Data sharing usually only includes a fraction of the events, not all Kafka Topics. Healthcare is one of many industries that goes this direction.
Disaster Recovery: Replication of critical data in active-active or active-passive mode between different data centers or cloud regions. Includes strategies and tooling for fail-over and fallback mechanisms in the case of a disaster to guarantee business continuity and compliance.
Aggregation: Regional clusters for local processing (e.g., pre-processing, streaming ETL, stream processing business applications) and replication of curated data to the big data center or cloud. Retail stores are an excellent example.
Migration: IT modernization with a migration from on-premise into the cloud or from self-managed open source into a fully managed SaaS. Such migrations can be done with zero downtime or data loss while the business continues during the cut-over.
Edge (Disconnected / Air-Gapped): Security, cost, or latency require edge deployments, e.g. in a factory or retail store. Some industries deploy in safety-critical environments with unidirectional hardware gateway and data diode.
Single Broker: Not resilient, but sufficient for scenarios like embedding a Kafka broker into a machine or on an Industrial PC (IPC) and replicating aggregated data into a large cloud analytics Kafka cluster. One nice example is the installation of data streaming (including integration and processing) on a computer of a soldier on the battlefield.

Bridging Hybrid Kafka Clusters

These options can be combined. For instance, a single broker at the edge typically replicates some curated data to a remote data center. And hybrid clusters have such different architectures depending on how they are bridged: connections over public internet, private link, VPC peering, and transit gateway, etc.

Having seen the development of Confluent Cloud over the years, I totally underestimated how much engineering time needs to be spent on security and connectivity. However, missing security bridges are the main blocker for the adoption of a Kafka cloud service. So, there is no way around providing various security bridges between Kafka clusters beyond just public internet.

There are even use cases where organizations need to replicate data from the data center to the cloud but the cloud service is NOT allowed to initiative the connection. Confluent built a specific feature “source-initiated link” for such security requirements where the source (i.e., the on-premise Kafka cluster) always initiates the connection – even though the cloud Kafka clusters is consuming the data:

Source: Confluent

As you see, it gets complex quickly. Find the right experts to help you from the beginning; not after you already deployed the first clusters and applications.

A long time ago, I already described in a detailed presentation of the architecture patterns for distributed, hybrid, edge, and global Apache Kafka deployments. Look at that slide deck and video recording for more details about the deployment options and trade-offs.

RPO vs. RTO = Data Loss vs. Downtime

RPO and RTO are two critical KPIs you need to discuss before deciding for a Kafka cluster strategy:

RPO (Recovery Point Objective) is the maximum acceptable amount of data loss measured in time, indicating how frequently backups should occur to minimize data loss.
RTO (Recovery Time Objective) is the maximum acceptable duration of time it takes to restore business operations after a disruption. Together, they help organizations plan their data backup and disaster recovery strategies to balance cost and operational impact.

While people often start with the goal of RPO = 0 and RTO = 0, they quickly realize how hard (but not impossible) it is to get this. You need to decide how much data are you okay to lose in a disaster? You need a disaster recovery plan if disaster strikes. The legal and compliance teams will have to tell you if it is okay to lose a few data sets in case of disaster or not. These any many other challenges need to be discussed when evaluating your Kafka cluster strategy.

The replication between Kafka clusters with tools like MIrrorMaker or Cluster Linking is asynchronous and RPO > 0. Only a stretched Kafka cluster provides RPO = 0.

Stretched Kafka Cluster – Zero Data Loss with Synchronous Replication across Data Centers

Most deployments with multiple Kafka clusters use asynchronous replication across data centers or clouds via tools like MirrorMaker or Confluent Cluster Linking. This is good enough for most use cases. But in case of a disaster, you lose a few messages. The RPO is > 0.

A stretched Kafka cluster deploys Kafka brokers of ONE SINGLE CLUSTER across three data centers. The replication is synchronous (as this is how Kafka replicates data within one cluster) and guarantees zero data loss (RPO = 0) – even in the case of a disaster!

Why shouldn’t you always do stretched clusters?

Low latency (<~50ms) and stable connection required between data centers
Three (!) data centers are needed, two is not enough as a majority (quorum) must acknowledge writes and reads to ensure the system’s reliability
Hard to set up, operate, and monitor – much harder than a cluster running in one data center
Cost vs. value is not worth it in many use cases – during a real disaster, most organizations and use cases have bigger problems than losing a few messages (even if it is critical data like a payment or order).

To be clear: In the public cloud, a region usually has three data centers (= availability zones). Hence, in the cloud, it depends on your SLAs if one cloud region counts as a stretched cluster or not. Most SaaS Kafka offerings deploy in a stretched cluster here. However, many compliance scenarios do NOT see a Kafka cluster in one cloud region as good enough for guaranteeing SLAs and business continuity if a disaster strikes.

Confluent built a dedicated product to solve (some of) these challenges: Multi-Region Clusters (MRC). It provides capabilities to do synchronous and asynchrounous replication within a stretched Kafka cluster.

For example, in a financial services scenario, MRC replicates low-volume critical transactions synchronously, but high-volume logs asynchronously:

handles ‘Payment’ transactions enter from us-east and us-west with fully synchronous replication
‘Log’ and ‘Location’ information in the same cluster use async – optimized for latency
Automated disaster recovery (zero downtime, zero data loss)

More details about stretched Kafka clusters vs. active-active / active-passive replication between two Kafka clusters in my global Kafka presentation.

Pricing of Kafka Cloud Offerings (vs. Self-Managed)

The above sections explain why you need to consider different Kafka architectures depending on your project requirements. Self-managed Kafka clusters can be configured the way you need. In the public cloud, fully managed offerings look different (the same way as any other fully managed SaaS). Pricing is different because SaaS vendors need to configure reasonable limits. The vendor has to provide specific SLAs.

The data streaming landscape includes various Kafka cloud offerings. Here is an example of Confluent’s current cloud offerings, including multi-tenant and dedicated environments with different SLAs, security features, and cost models.

Source: Confluent

Make sure to evaluate and understand the various cluster types from different vendors available in the public cloud, including TCO, provided uptime SLAs, replication costs across regions or cloud providers, and so on. The gaps and limitations are often intentionally hidden in the details.

For instance, if you use Amazon Managed Streaming for Apache Kafka (MSK), you should be aware that the terms and conditions tell you that “The service commitment does not apply to any unavailability, suspension or termination … caused by the underlying Apache Kafka or Apache Zookeeper engine software that leads to request failures”.

But pricing and support SLAs are just one critical piece of such a comparison. There are lots of “build vs. buy” decisions you have to make as part of evaluating a data streaming platform, as I pointed out in my detailed article comparing Confluent to Amazon MSK Serverless.

Kafka Storage – Tiered Storage and Iceberg Table Format to Store Data Only Once

Apache Kafka added Tiered Storage to separate compute and storage. The capability enables more scalable, reliable, and cost-efficient enterprise architectures. Tiered Storage for Kafka enables a new Kafka cluster type: Storing Petabytes of data in the Kafka commit log in a cost-efficient way (like in your data lake) with timestamps and guaranteed ordering to travel back in time for re-processing historical data. KOR Financial is a nice example of using Apache Kafka as a database for long-term persistence.

Kafka enables a Shift Left Architecture to store data only once for operational and analytical datasets:

With this in mind, think again about the use cases I described above for multiple Kafka clusters. Should you still replicate data in batch at rest in the database, data lake, or lakehouse from one data center or cloud region to another? No. You should synchronize data in real-time, store the data once (usually in an object store like Amazon S3), and then connect all analytical engines like Snowflake, Databricks, Amazon Athena, Google Cloud BigQuery, and so on to this standard table format.

Learn more about the unification of operational and analytical data in my article “Apache Iceberg – The Open Table Format for Lakehouse AND Data Streaming“.

Real-World Success Stories for Multiple Kafka Clusters

Most organizations have multiple Kafka clusters. This section explores four success stories across different industries:

Paypal (Financial Services) – US: Instant payments, fraud prevention.
JioCinema (Telco/Media) – APAC: Data integration, clickstream analytics, advertisement, personalization.
Audi (Automotive/Manufacturing) – EMEA: Connected cars with critical and analytical requirements.
New Relic (Software/Cloud) – US: Observability and application performance management (APM) across the world.

Paypal – Separation by Security Zone

PayPal is a digital payment platform that allows users to send and receive money online securely and conveniently around the world in real time. This requires a scalable, secure and compliant Kafka infrastructure.

During the 2022 Black Friday, Kafka traffic volume peaked at about 1.3 trillion messages daily! At present, PayPal has 85+ Kafka clusters, and every holiday season they flex up their Kafka infrastructure to handle the traffic surge. The Kafka platform continues to seamlessly scale to support this traffic growth without any impact on their business.

Today, PayPal’s Kafka fleet consists of over 1,500 brokers that host over 20,000 topics. The events are replicated among the clusters, offering 99.99% availability.

Kafka cluster deployments are separated into different security zones within a data center:

Source: Paypal

The Kafka clusters are deployed across these security zones, based on data classification and business requirements. Real-time replication with tools such as MirrorMaker (in this example, running on Kafka Connect infrastructure) or Confluent Cluster Linking (using a simpler and less error-prone approach directly using the Kafka protocol for replication) is used to mirror the data across the data centers, which helps with disaster recovery and to achieve inter-security zone communication.

JioCinema – Separation by Use Case and SLA

JioCinema is a rapidly growing video streaming platform in India. The telco OTT service is known for its expansive content offerings, including live sports like the Indian Premier League (IPL) for cricket, a newly launched Anime Hub, and comprehensive plans for covering major events like the Paris 2024 Olympics.

The data architecture leverages Apache Kafka, Flink, and Spark for data processing, as presented at Kafka Summit India 2024 in Bangalore:

Source: JioCinema

Data streaming plays a pivotal role in various use cases to transform user experiences and content delivery. Over ten million messages per second enhance analytics, user insights, and content delivery mechanisms.

JioCinema’s use cases include:

Inter Service Communication
Clickstream / Analytics
Ad Tracker
Machine Learning and Personalization

Kushal Khandelwal, Head of Data Platform, Analytics, and Consumption at JioCinema, explained that not all data is equal and the priorities and SLAs differ per use case:

Source: JioCinema

Data streaming is a journey. Like so many other organizations worldwide, JioCinema started with one large Kafka cluster using 1000+ Kafka Topics and 100,000+ Kafka Partitions for various use cases. Over time, a separation of concerns regarding use cases and SLAs developed into multiple Kafka clusters:

Source: JioCinema

The success story of JioCinema shows the common evolution of a data streaming organization. Let’s now explore another example where two very different Kafka clusters were deployed from the beginning for one use case.

Audi – Operations vs. Analytics for Connected Cars

The car manufacturer Audi provides connected cars featuring advanced technology that integrates internet connectivity and intelligent systems. Audi’s cars enable real-time navigation, remote diagnostics, and enhanced in-car entertainment. These vehicles are equipped with Audi Connect services. Features include emergency calls, online traffic information, and integration with smart home devices, to enhance convenience and safety for drivers.

Source: Audi

Audi presented their connected car architecture in the keynote of Kafka Summit in 2018. The Audi enterprise architecture relies on two Kafka clusters with very different SLAs and use cases.

Source: Audi

The Data Ingestion Kafka cluster is very critical. It needs to run 24/7 at scale. It provides last-mile connectivity to millions of cars using Kafka and MQTT. Backchannels from the IT side to the vehicle help with service communication and over-the-air updates (OTA).

ACDC Cloud is the analytics Kafka cluster of Audi’s connected car architecture. The cluster is the foundation of many analytical workloads. These process enormous volumes of IoT and log data at scale with batch processing frameworks, like Apache Spark.

This architecture was already presented in 2018. Audi’s slogan “Progress through Technology” shows how the company applied new technology for innovation long before most car manufacturers deployed similar scenarios. All sensor data from the connected cars is processed in real time and stored for historical analysis and reporting.

New Relic – Worldwide Multi-Cloud Observability

New Relic is a cloud-based observability platform that provides real-time performance monitoring and analytics for applications and infrastructure to customers around the world.

Andrew Hartnett, VP of Software Engineering, at New Relic explains how data streaming is crucial for the entire business model of New Relic:

“Kafka is our central nervous system. It is a part of everything that we do. Most services across 110 different engineering teams with hundreds of services touch Kafka in some way, shape, or form in our company, so it really is mission-critical. What we were looking for is the ability to grow, and Confluent Cloud provided that.”

New Relic ingested up to 7 billion data points per minute; on track to ingest 2.5 exabytes of data in 2023. As New Relic expands its multi-cloud strategies, teams will use Confluent Cloud for a single pane of glass view across all environments.

“New Relic is multi-cloud. We want to be where our customers are. We want to be in those same environments, in those same regions, and we wanted to have our Kafka there with us.” says Artnett in a Confluent case study.

Multiple Kafka Clusters are the Norm; Not an Exception!

Event-driven architectures and stream processing have existed for decades. The adoption grows with open source frameworks like Apache Kafka and Flink in combination with fully managed cloud services. More and more organizations struggle with their Kafka scale. Enterprise-wide data governance, center of excellence, automation of deployment and operations, and enterprise architecture best practices help to successfully provide data streaming with multiple Kafka clusters for independent or collaborating business domains.

Multiple Kafka clusters are the norm, not an exception. Use cases such as hybrid integration, disaster recovery, migration or aggregation enable real-time data streaming everywhere with the needed SLAs.

How does your enterprise architecture look like? How many Kafka clusters do you have? And how do you decide about data governance, separation of concerns, multi-tenancy, security, and similar challenges in your data streaming organization? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka Cluster Type Deployment Strategies appeared first on Kai Waehner.

Apache Kafka in the Public Sector – Part 2: Smart City

Kai Waehner — Tue, 12 Oct 2021 07:48:48 +0000

The public sector includes many different areas. Some groups leverage cutting-edge technology, like military leverage. Others like the public administration are years or even decades behind. This blog series explores how the public sector leverages data in motion powered by Apache Kafka to add value for innovative new applications and modernizing legacy IT infrastructures. This post is part 2: Use cases and architectures for a Smart City.

Blog series: Apache Kafka in the Public Sector and Government

This blog series explores why many governments and public infrastructure sectors leverage event streaming for various use cases. Learn about real-world deployments and different architectures for Kafka in the public sector:

Subscribe to my newsletter to get updates immediately after the publication. Besides, I will also update the above list with direct links to this blog series’s posts once published.

As a side note: If you wonder why healthcare is not on the above list. Healthcare is another blog series on its own. While the government can provide public health care through national healthcare systems, it is part of the private sector in many other cases.

Real-time is Mandatory for a Smart City Everywhere

I wrote a lot about event streaming and Apache Kafka for smart city infrastructure and use cases. I won’t repeat myself. Check out the following event Streaming with Kafka as Foundation for a Smart City and Apache Kafka and MQTT for the Last Mile IoT integration in a Smart City.

This post dives deeper into architectural questions and how collaboration with 3rd party services can look from the government’s perspective and public administration of a smart city.

The Need for Real-time Data Processing Everywhere in a Smart City and how Kafka helps

A smart city is a very complex beast. I am glad that I only cover technology and not regulatory or political discussions. However, even the technology standpoint is not straightforward. A smart city needs to correlate data across data centers, devices, vehicles, and many other things. This scenario is an actual internet of things (IoT) and therefore includes plenty of different technologies, communication paradigms, and infrastructures:

Smart city projects require the integration of various 1st party and 3rd party services. Most use cases only work well if that data is correlated in real-time; think about traffic routing, emergency alerts, predictive monitoring and maintenance, mobility services such as ride-hailing, and other fancy smart city use cases. Without real-time data processing, the use case is either a bad user experience or not cost-efficient. Hence, Kafka is adopted more and more for these scenarios.

Low Latency and 5G Networks for (some) Data Streaming Use Cases

The term “real-time” needs to be defined. Processing data in a few seconds is good enough in most use cases and a significant game-changer compared to hourly, daily, or weekly batch processing.

Having said this, some use cases like location-based upselling in retail or condition monitoring in equipment and manufacturing require lower latency, meaning sub-second end-to-end data processing.

Here is an example of leveraging 5G networks for low latency. The demo was built by the AWS Wavelength team, Verizon, and Confluent:

Most real-world deployments use separation of concerns: Low-latency use cases run at the edge and everything else in the regular data center or public cloud region. Read the article “Low Latency Data Streaming with Apache Kafka and Cloud-Native 5G Infrastructure” for more details.

At this point, it is important to remind everybody that Kafka (and any IT software) is not hard real-time and not built for the OT world and embedded systems. Learn more in the article “Kafka is NOT hard real-time but soft real-time“. Also, (soft) real-time is not competitive to batch processing and data warehouse/data lake architecture. As you can learn in “Serverless Kafka in a Cloud-native Data Lake Architecture” it is complimentary.

Collaboration between Government, City, and 3rd Party via Open API

Real-time data processing is crucial in implementing smart city use cases. Additionally, most smart city projects require collaboration between different teams, infrastructures, and 3rd party services.

Let’s take a look at three very different real-world event streaming deployments to see the broad spectrum of use cases and integration challenges:

Ohio Department of Transportation’s government-owned event streaming platform
Deutsche Bahn’s single source of truth for customer communication in real-time and 3rd party integration with the Google Maps API
Free Now’s mobility service in the cloud for real-time data correlation in compliance with regional laws and independent vehicles/drivers.

Ohio Department of Transportation (ODOT) – A Government-Owned Event Streaming Platform

Ohio Department of Transportation (ODOT) has an exciting initiative: DriveOhio. It aims to organize and accelerate smart vehicle and connected vehicle projects in the State of Ohio. DriveOhio offers to be the single point of contact for policymakers, agencies, researchers, and private companies to collaborate with one another on intelligent transportation efforts around the state.

ODOT presented their real-time data transportation data platform at the last Kafka Summit Americas:

The whole Kafka ecosystem powers ODOT’s cloud-native Event Streaming Platform (ESP). The platform enables continuous data integration and stream processing for transactional and analytical workloads. The ESP runs on Kubernetes to provide an elastic, flexible, and scalable infrastructure for real-time data processing.

Deutsche Bahn – Single Source of Truth and Google Maps Integration in Real-time

Deutsche Bahn is a German railway company. It is a private joint-stock company (AG), with the Federal Republic of Germany being its single shareholder. I already talked about their real-time traveler information system in another blog post: “Mobility Services and Transportation powered by Apache Kafka“.

They leverage the Apache Kafka ecosystem powered by Confluent because it combines several characteristics that you would have to integrate with different technologies otherwise:

Real-time messaging
Data integration
Data correlation
Storage and caching
Replication and high availability
Elastic scalability

This example is excellent for this blog. It shows how an existing solution needs connectivity to other internal applications and 3rd party services to provide a better customer experience and expand the customer base.

Recently, Deutsche Bahn integrated its platform with Google Maps via Google’s Open API. In addition to a better customer experience, the railway company can reach out to many new end-users to expand their business. The Railway-News has a good article about this integration. Here is my summary:

Free Now – Mobility Service in the Cloud Connected to Regional Laws and Vehicles

Free Now (former MyTaxi) is a mobility service. Their app uses mobile and GPS technology to match taxi drivers with passengers based on availability and proximity. Mobility services need to integrate with other 3rd party services for routing, payment, tax implications, and many different use cases.

Here is one example from Free Now’s Kafka Summit talk where they explain the added value of continuous stream processing for calculating context-specific dynamic pricing:

The public administration is always involved when a new mobility service is released to the public. While some cities build their mobility services, the reality is that most governments provide the infrastructure together with the Telco providers, and 3rd party vendors provide the mobility service. The specific relationship between the government, city, and mobility service provider differs across regions, countries, and continents.

Almost every mobility service uses Kafka as its backbone. Google for your favorite mobility service across the globe and add “Kafka” to the search. Chances are very high that you find some excellent blog posts, conferences talks, or at least job offers from the mobility service’s recruiting page. Here are just a few examples that posted great content about their Kafka usage: Uber, Lyft, Grab, Otonomo, Here Technologies, and many more.

Data in Motion with Kafka for a Connected and Innovative Smart City

Smart City is a vast topic. Many stakeholders are involved. Collaboration and Open APIs are critical for success. In most cases, governments work together with telco providers, infrastructure providers such as the cloud hyperscalers, and software vendors (including an event streaming platform like Kafka).

Most valuable and innovative smart city use cases require data processing in real-time. The use cases require data integration, storage, and backpressure handling, and data correlation. Event Streaming is the ideal technology for these use cases. Examples from the Ohio Department of Transportation, Deutsche Bahn and its Google Maps integration, and Free Now showed a few different angles to realize successful smart city projects.

How do you leverage event streaming in the public sector? Are you working on smart city projects? What technologies and architectures do you use? What projects did you already work on or are in the planning? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka in the Public Sector – Part 2: Smart City appeared first on Kai Waehner.

Apache Kafka in the Public Sector – Blog Series about Use Cases and Architectures

Kai Waehner — Thu, 07 Oct 2021 14:13:24 +0000

The public sector includes many different areas. Some groups leverage cutting-edge technology, like military leverage. Others like the public administration are years or even decades behind. This blog series explores how the public sector leverages data in motion powered by Apache Kafka to add value for innovative new applications and modernizing legacy IT infrastructures. Life is a stream of events. Therefore, examples include a broad spectrum of use cases across smart cities, citizen services, energy and utilities, and national security deployed across the edge, hybrid, and multi-cloud scenarios.

Blog series: Apache Kafka in the Public Sector and Government

Life is a Stream of Events (THIS POST)
Smart City
Citizen Services
Energy and Utilities
National Security

Subscribe to my newsletter to get updates immediately after the publication. Besides, I will also update the above list with direct links to this blog series’s posts once published.

The Public Sector is a Broad Spectrum of Use Cases

The public sector covers so many different areas. Examples include defense, law enforcement, national security, healthcare, public administration, police, judiciary, finance and tax, research, aerospace, agriculture, etc. Many of these terms and sectors overlap. In many countries, some of these sectors are private or a combination of public and private. For these reasons, my blog series does not cover specific sectors. Instead, I focus on use cases. Many of these are applicable across many sectors.

Real-time Data Beats Slow Data in the Public Sector

I won’t do yet another long introduction about the added value of real-time data. Check out my blog about “Use Cases across Industries for Data in Motion powered by Apache Kafka” to understand the broad spectrum and benefits. The public sector is not different: Real-time data beats slow data in almost every use case! Here are a few examples:

But think about your use cases! How often can you say that getting data late (like in one hour or the following day) is better than getting data when it happens (now, in a few milliseconds or seconds)? Probably not very often.

An important fact is that the added business value comes from correlating the events from different data sources. As an example, let’s look at the processes in a smart city:

The sensor data from the car is only valuable if an application correlates it with data from other vehicles in the traffic planning system. Intelligent parking is only reasonable if it integrates with the overall city planning. Emergency service needs to receive an alert in real-time if a crash happens. All of that needs to happen in real-time! It does not matter if the use case is about transactional workloads (usually smaller data sets) or analytical workloads (usually more extensive data sets).

Open API and Partnerships are Mandatory

Governments can build great applications. At least in theory. In practice, they rely on external data from partners and 3rd party applications for many potential use cases:

Governments and cities need to work with several other stakeholders, including carmakers, suppliers, telcos, mobility Services, cloud providers, software providers, etc. Standards and open APIs are mandatory for successful cross-cutting projects. The foundation of such an enterprise architecture is an open, reliable, scalable platform that can process data in real-time. Apache Kafka became the de facto standard for event streaming.

An example that shows the added value of data integration across stakeholders and processing the data in real-time: Transportation Services. A mobile app needs context. Think about hailing a taxi ride. It doesn’t help you if you see the position of each taxi on the city map in real-time. You want to know the estimated time of arrival, the estimated cost, the estimated time of arrival at your destination, the car model that will pick you up, and so much more.

This use case – like many others – is only possible if you integrate and correlate the data from many different interfaces like a mapping service, all taxi drivers, all customers in a city, the weather service, backend analytics services, and much more:

The left side of the picture shows a dashboard built with a real-time message queue like RabbitMQ. The right side shows data correlation of data from different sources in real-time with an event streaming platform like Apache Kafka.

I hope you agree on the added value of the event streaming platform. Just sending data from A to B in real-time is not enough. Only the data processing in real-time adds true value.

Data in Motion as Paradigm Shift in the Public Sector

Real-time beats slow data. No matter if you think about cutting-edge use cases in national security or modernizing the IT infrastructure in the public administration. Event Streaming is the foundation of this paradigm shift moving towards real-time data processing in the public sector. The upcoming posts of this blog series explore many different use cases and architectures. If you also want to learn more about Apache Kafka offerings on the market, check out my comparison of Apache Kafka products and cloud services.

How do you leverage event streaming in the public sector? What technologies and architectures do you use? What projects did you already work on or are in the planning? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka in the Public Sector – Blog Series about Use Cases and Architectures appeared first on Kai Waehner.

When to Use Reverse ETL and when it is an Anti-Pattern

Kai Waehner — Thu, 30 Sep 2021 16:50:21 +0000

Most enterprises store their massive volumes of transactional and analytics data at rest in data warehouses or data lakes. Sales, marketing, and customer success teams require access to these data sets. Reverse ETL is a buzzword that defines the concept of collecting data from existing data stores to provide it easy and quick for business teams.

This blog post explores why software vendors (try to) introduce new solutions for Reverse ETL, when it is needed, and how it fits into the enterprise architecture. The involvement of event streaming with tools like Apache Kafka to process data in motion is a crucial piece of Reverse ETL for real-time use cases.

What are ETL and Reverse ETL?

Let’s begin with the terms. What do ETL and Reverse ETL mean?

ETL (Extract-Transform-Load)

Extract-Transform-Load (ETL) is a common term for data integration. Vendors like Informatica or Talend provide visual coding to implement robust ETL pipelines. The cloud brought new SaaS players and the term Integration Platform as a Service (iPaaS) into the ETL market with vendors such as Boomi, SnapLogic, or Mulesoft Anypoint.

Most ETL tools operate in batch processes for big data workloads or use SOAP/REST web services and APIs for non-scalable real-time communication. ETL pipelines consume data from various data sources, transform or aggregate it, and store the processed data at rest in data sinks such as databases, data warehouses, or data lakes:

ELT (Extract-Load-Transform)

Extract-Load-Transform (ELT) is a very similar approach. However, the transformations and aggregations happen after the ingestion into the datastore:

It is no surprise that modern data storage and analytics vendors such as Databricks and Snowflake promote the ELT approach. For instance, Snowflake pitches the “internal dash mesh” where all the domains and data products are built within their cloud service.

Reverse ETL

As the name says, Reverse ETL turns the story from ETL around. It means the process of moving data from a data store into third-party systems to “make data operational”, as the marketing of these solutions says:

The data is consumed from long-term storage systems (data warehouse, data lake). The data is then pushed into business applications such as Salesforce (CRM), Marketo (marketing), or Service Now (customer success) to leverage it for pipeline generation, marketing campaigns, or customer communication.

Products and SaaS Offerings for Reverse ETL

Just google for “Reverse ETL” to find vendors specifically pitching their solutions. They also pay ads for the “normal data integration terms”. Therefore, the chances are high that you already saw them even if you did not search for them.

Most of these companies are young companies and startups building a new business around Reverse ETL products. Software vendors I found in my research include Hightouch, Census, Grouparoo (open source), Rudderstack, Omnata, and Seekwell.

Fun fact: If you search for Snowflake’s Reverse ETL, you will not find any google hit as they want to keep the data in their data warehouse.

A key strength and selling point of all ETL tools is visual coding, and therefore time to market for the development and maintenance of ETL pipelines. Some solutions target the citizen integrator (a term coined by Gartner), i.e., businesspeople building their integrations.

Reverse ETL == Real-Time Data for Sales, Marketing, Customer Success

Most Reverse ETL success stories talk about focus on sales, marketing, or customer success. These use cases attract business divisions. These teams do not want to buy a technical ETL tool like Informatica or Talend. Business people expect straightforward and intuitive user interfaces, like a citizen integrator.

The vendors target the businesspeople and promise a simplified technical infrastructure. For instance, one vendor promotes “Cut out legacy middleware and reduce ETL jobs”. My first thought: Welcome, shadow IT!

Nevertheless, let’s take a look at the use cases for Reverse ETL:

Identify customers at-risk and potential customer churn before it happens
Drive new sales by correlating data from the CRM and other interfaces
Hyper-personalized marketing for cross-selling and upselling to existing customers
Operational analytics to monitor the changes in business applications and data faster
Data replication to modern cloud applications for better reporting capabilities and finding insights

Additionally, all of the vendors also talk about real-time data for the above use cases. That’s great. BUT: Unfortunately, Reverse ETL is a huge ANTI-PATTERN to build real-time use cases. Let’s explore in more detail why.

Reverse ETL + Data Lake + Real-Time == Myth

Those use cases described above are great with tremendous business value. If you follow my blog or presentations, you have probably seen precisely the same real-time use cases built natively with event streaming processing data in motion.

If you store data in a data warehouse or data lake, you cannot process it in real-time anymore as it is already stored at rest. These data stores are built for indexing, search, batch processing, reporting, model training, and other use cases that make sense in the storage system. But you cannot consume the data in real-time in motion from storage at rest:

That’s where event streaming comes into play. Platforms like Apache Kafka enable processing data in motion in real-time for transactional and analytical workloads.

So, let’s take a look at a modern enterprise architecture that leverages event streaming for data processing in motion AND a data warehouse or data lake for data processing at rest.

Reverse ETL in the Enterprise Architecture

Let’s explore how Reverse ETL fits into the enterprise architecture and when you need a separate tool for this. For this, let’s go one step back first. What does Reverse ETL do? It takes data out of the storage, transforms or aggregates the data, and then ingests it into business applications.

Two options exist for Reverse ETL: SQL queries and Change Data Capture (CDC).

Reverse ETL == SQL Queries vs Change Data Capture

If a Reverse ETL tool uses SQL, then it is usually a query to data at rest. This use case enables businesspeople to create queries in intuitive user interfaces. Use cases include the creation of new marketing campaigns or analyze the customer success journey. SQL-based Reverse ETL requires intuitive tools that are simple to use.

If a Reverse ETL tool provides real-time data correlation and push notifications, it uses change data capture (CDC). CDC is automated and enables acting on changes in the data storage in real-time. The pipeline includes data correlation from different data sources and sending real-time push messages into business applications. CDC-based Reverse ETL requires a scalable, reliable event streaming infrastructure.

As you can see, both SQL and CDC approaches have their use cases and sometimes overlap in tooling and infrastructure. Change-log-based CDC is often the preferred technical approach instead of synchronizing data on a recurring schedule with SQL or when triggering by calling an API, no matter if you use “just” an event streaming platform or a particular Revere ETL product.

However, the more important question is how to design an enterprise architecture to AVOID the need for Reverse ETL.

Event-driven Architecture + Streaming ETL == Reverse ETL built-in

Real-time data beats slow data. That’s true for most use cases. Hence, the rise of event-driven architectures is unstoppable:

Reverse ETL is not needed in modern event-driven architecture! It is “built-in” into the architecture out-of-the-box. Each consumer directly consumes the data in real-time if it is appropriate and technically feasible. And data warehouses or data lakes still consume it in their own pace in near-real-time or batch:

The Kafka-native streaming SQL engine ksqlDB provides CDC capabilities and continuous stream processing. Therefore, you could even call ksqlDB a Reverse ETL tool if your marketing asks for it.

If you want to learn more about building real-time data platforms, check out the article “Kappa architecture is mainstream replacing Lambda“. It explores how companies like Uber, Shopify, and Disney built an event-driven Kappa architecture for any use case, including real-time, near-real-time, batch, and request-response.

When do you need Reverse ETL?

A greenfield architecture built from the ground up with an event streaming platform at its heart does not need Reverse ETL to consume data from a data warehouse or data lake as every consumer can already consume the data in real-time.

However, providing an interface for business users is NOT solved out-of-the-box with an event streaming platform like Apache Kafka. You need to add additional tools like Kafka CDC connectors, or 3rd party tools with intuitive user interfaces.

Hence, Reverse ETL can be helpful in two scenarios: Brownfield integration and simple tools for business users.

Brownfield architectures where data is stored at rest and businesspeople need to consume it in business applications. Data needs to be pushed out of the data storage for sales, marketing, or customer success use cases:

Simple integration tools for business people are much more intuitive and easy to use than traditional ETL and iPaaS solutions. Even in a greenfield approach, Reverse ETL tools might still be the easiest solution and provide the best time to market.

Also, keep in mind that modern tools such as Salesforce or SAP provide event-based interfaces already. Data storage vendors such as Elastic, Splunk, or Snowflake also heavily invest in streaming layers to natively integrate with tools such as Apache Kafka. The integration with business applications is possible via event streaming in real-time instead of integration via Reverse ETL from the data store.

For these reasons, evaluate your business problem and if you need an event streaming platform, a Reverse ETL tool, or a combination of both.

Kafka Examples for Reverse ETL

Let’s take a look at two concrete examples.

Apache Kafka + Salesforce + Oracle CDC + Snowflake

The following architecture combines real-time data streaming, change data capture, data lake, and a Reverse ETL cloud service:

A few notes on this architecture:

The central nervous system is an event streaming platform (Confluent Cloud) that provides scalable real-time data streaming and true decoupling between any data source and sink.
A SaaS cloud service (Salesforce) natively provides an asynchronous API for event-based real-time integration.
A traditional relational database (Oracle) is integrated with Reverse ETL via change data capture using Confluent’s Oracle CDC connector for Kafka Connect.
Data from all the data sources are processed continuously with stream processing tools such as Kafka Streams and ksqlDB.
Data ingestion into a data warehouse (Snowflake) configured as part of Confluent Cloud’s fully managed Kafka Connect connector.
A business user leverages a dedicated Reverse ETL solution (Seekwell) for getting data out of the data warehouse (Snowflake) into a business application (Google Sheets).

The whole infrastructure provides an event-based, scalable, reliable real-time nervous system. Each application can consume and process data in motion in real-time (if needed). Data storage at rest is complementary for batch use cases and integrated with the event-based platform.

TL;DR: This architecture truly decouples applications, avoids point-to-point spaghetti communication, and supports all technologies, cloud services, and communication paradigms.

Tapping into the Splunk Ingestion Layer in Motion with Kafka

Another option of avoiding the need for Reverse ETL from a storage system is tapping into the existing storage ingestion layer.

Confluent’s Splunk S2S connector is a great example. Suppose organizations already have hundreds or thousands of universal forwarders (UF) and heavy forwarders (HF). In that case, this approach allows users to cost-effectively and reliably read data from Splunk Forwarders to Kafka. It enables users to forward data from universal forwarders into a Kafka topic to unlock the analytical capabilities of the data:

For more details about this use case, check out my blog “Apache Kafka in Cybersecurity for SIEM / SOAR Modernization“.

Don’t Design for Data at Rest to Reverse it!

Good enterprise architecture should never have the goal to plan for reverse ETL from the beginning! It is only needed in brownfield architecture where the data is stored at rest instead of building an event-based architecture for real-time and batch data sinks. Reverse ETL enables Shadow IT and spaghetti architectures. Event streaming enables data integration in real-time by nature.

Nevertheless, Reverse ETL tooling is appropriate for brownfield approaches (ideally via continuous change data capture, not recurring SQL) or if business users need a simple, intuitive user interface. Hence, event streaming and Reverse ETL are complementary. In the same way, event streaming and data warehouses/data lakes are complementary. Read this if you want to learn more: “Serverless Kafka in a Cloud-native Data Lake Architecture“.

What is your point of view on this new ETL buzzword? How do you integrate it into an enterprise architecture? What are your experiences and opinions? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post When to Use Reverse ETL and when it is an Anti-Pattern appeared first on Kai Waehner.

Apache Kafka in the Insurance Industry

Kai Waehner — Mon, 07 Jun 2021 12:54:20 +0000

The rise of data in motion in the insurance industry is visible across all lines of business, including life, healthcare, travel, vehicle, and others. Apache Kafka changes how enterprises rethink data. This blog post explores use cases and architectures for event streaming. Real-world examples from Generali, Centene, Humana, and Tesla show innovative insurance-related data integration and stream processing in real-time.

Digital Transformation in the Insurance Industry

Most insurance companies have similar challenges:

Challenging market environments
Stagnating economy
Regulatory pressure
Changes in customer expectations
Proprietary and monolithic legacy applications
Emerging competition from innovative insurtechs
Emerging competition from other verticals that add insurance products

Only a good transformation strategy guarantees a successful future for traditional insurance companies. Nobody wants to become the next Nokia (mobile phone), Kodak (photo camera), or BlockBuster (video rental). If you fail to innovate in time, you are done.

Real-time beats slow data. Automation beats manual processes. The combination of these two game changers creates completely new business models in the insurance industry. Some examples:

Claims processing including review, investigation, adjustment, remittance or denial of the claim
Claim fraud detection by leveraging analytic models trained with historical data
Omnichannel customer interactions including a self-service portal and automated tools like NLP-powered chatbots
Risk prediction based on lab testing, biometric data, claims data, patient-generated health data (depending on the laws of a specific country)

These are just a few examples.

The shift to real-time data processing and automation is key for many other use cases, too. Machine learning and deep learning enable the automation of many manual and error-prone processes like document and text processing.

The Need for Brownfield Integration

Traditional insurance companies usually (have to) start with brownfield integration before building new use cases. The integration of legacy systems with modern application infrastructures and the replication between data centers and public or private cloud-native infrastructures are a key piece of the puzzle.

Common integration scenarios use traditional middleware that is already in place. This includes MQ, ETL, ESB, and API tools. Kafka is complementary to these middleware tools:

More details about this topic are available in the following two posts:

Greenfield Applications at Insurtech Companies

Insurtechs have a huge advantage: They can start greenfield. There is no need to integrate with legacy applications and monolithic architectures. Hence, some traditional insurance companies go the same way. They start from scratch with new applications instead of trying to integrate old and new systems.

This setup has a huge architectural advantage: There is no need for traditional middleware as only modern protocols and APIs need to be integrated. No monolithic and proprietary interfaces such as Cobol, EDI, or SAP BAPI/iDoc exist in this scenario. Kafka makes new applications agile, scalable, and flexible with open interfaces and real-time capabilities.

Here is an example of an event streaming architecture for claim processing and fraud detection with the Kafka ecosystem:

Real-World Deployments of Kafka in the Insurance Industry

Let’s take a look at a few examples of real-world deployments of Kafka in the insurance industry.

Generali – Kafka as Integration Platform

Generali is one of the top ten largest insurance companies in the world. The digital transformation from Generali Switzerland started with Confluent as a strategic integration platform. They started their journey by integrating with hundreds of legacy systems like relational databases. Change Data Capture (CDC) pushes changes into Kafka in real-time. Kafka is the central nervous system and integration platform for the data. Other real-time and batch applications consume the events.

From here, other applications consume the data for further processing. All applications are decoupled from each other. This is one of the unique benefits of Kafka compared to other messaging systems. Real decoupling and domain-driven design (DDD) are not possible with traditional MQ systems or SOAP / REST web services.

Design Principles of Generali’s Cloud-Native Architecture

The key design principles for the next-generation platform at Generali include agility, scalability, cloud-native, governance, data, and event processing. Hence, Generali’s architecture is powered by a cloud-native infrastructure leveraging Kubernetes and Apache Kafka:

The following integration flow shows the scalable microservice architecture of Generali. The streaming ETL process includes data integration and data processing decoupled environments:

Centene – Integration and Data Processing at Scale in Real-Time

Centene is the largest Medicaid and Medicare Managed Care Provider in the US. Their mission statement is “transforming the health of the community, one person at a time”. The healthcare insurer acts as an intermediary for both government-sponsored and privately insured health care programs.

Centene’s key challenge is growth. Many mergers and acquisitions require a scalable and reliable data integration platform. Centene chose Kafka due to the following capabilities:

highly scalable
high autonomy and decoupling
high availability and data resiliency
real-time data transfer
complex stream processing

Centene’s architecture uses Kafka for data integration and orchestration. Legacy databases, MongoDB, and other applications and APIs leverage the data in real-time, batch, and request-response:

Swiss Mobiliar – Decoupling and Orchestration

Swiss Mobiliar (Schweizerische Mobiliar aka Die Mobiliar) is is the oldest private insurer in Switzerland.

Event Streaming with Kafka supports various use cases at Swiss Mobiliar:

Orchestrator application to track the state of a billing process
Kafka as database and Kafka Streams for data processing
Complex stateful aggregations across contracts and re-calculations
Continuous monitoring in real-time

Their architecture shows the decoupling of applications and orchestration of events:

Also, check out the on-demand webinar with Mobiliar and Spoud to learn more about their Kafka usage.

Humana – Real-Time Integration and Analytics

Humana Inc. is a for-profit American health insurance. In 2020, the company ranked 52 on the Fortune 500 list.

Humana leverages Kafka for real-time integration and analytics. They built an interoperability platform to transition from an insurance company with elements of health to truly a health company with elements of insurance.

Here are the key characteristics of their Kafka-based platform:

Consumer-centric
Health plan agnostic
Provider agnostic
Cloud resilient and elastic
Event-driven and real-time

Kafka integrates conversations between the users and the AI platform powered by IBM Watson. The platform captures conversational flows and processes them with natural language processing (NLP) – a deep learning concept.

Some benefits of the platform:

Adoption of open standards
Standardized integration partners
In-workflow integration
Event-driven for real-time patient interactions
Highly scalable

freeyou – Stateful Streaming Analytics

freeyou is an insurtech for vehicle insurance. Streaming analytics for real-time price adjustments powered by Kafka and ksqlDB enable new business models. Their marketing slogan shows how they innovate and differentiate from traditional competitors:

“We make insurance simple. With our car insurance, we make sure that you stay mobile in everyday life – always and everywhere. You can take out the policy online in just a few minutes and manage it easily in your freeyou customer account. And if something should happen to your vehicle, we’ll take care of it quickly and easily.”

A key piece of freeyou’s strategy is a great user experience and automatic price adjustments in real-time in the backend. Obviously, Kafka and its stream processing ecosystem are a perfect fit here.

As discussed above, the huge advantage of an insurtech is the possibility to start from the greenfield. No surprise that freeyou’s architectures leverage cutting-edge design and technology. Kafka and KQL enable streaming analytics within the pricing engine, recalculation modules, and other applications:

Tesla – Carmaker and Utility Company, now also Car Insurer

Everybody knows: Tesla is an automotive company that sells cars, maintenance, and software upgrades.

More and more people know: Tesla is a utility company that sells energy infrastructure, solar panels, and smart home integration.

Almost nobody knows: Tesla is a car insurer for their car fleet (limited to a few regions in the early phase). That is the obvious next step if you already collect all the telemetry data from all your cars on the street.

Tesla has built a Kafka-based data platform infrastructure “to support millions of devices and trillions of data points per day”. Tesla showed an interesting history and evolution of their Kafka usage at a Kafka Summit in 2019:

Tesla’s infrastructure heavily relies on Kafka.

There is no public information about Telsa using Kafka for their specific insurance applications. But at a minimum, the data collection from the cars and parts of the data processing relies on Kafka. Hence, I thought this is a great example to think about innovation in car insurance.

Tesla: “Much Better Feedback Loop”

Elon Musk made clear: “We have a much better feedback loop” instead of being statistical like other insurers. This is a key differentiator!

There is no doubt that many vehicle insurers will use fleet data to calculate insurance quotes and provide better insurance services. For sure, some traditional insurers will partner with vehicle manufacturers and fleet providers. This is similar to smart city development, where several enterprises partner to build new innovative use cases.

Connected vehicles and V2X (Vehicle to X) integrations are the starting point for many new business models. No surprise: Kafka plays a key role in the connected vehicles space (not just for Tesla).

Many benefits are created by a real-time integration pipeline:

Shift from human experts to automation driven by big data and machine learning
Real-time telematics data from all its drivers’ behavior and the performance of its vehicle technology (cameras, sensors, …)
Better risk estimation of accidents and repair costs of vehicles
Evaluation of risk reduction through new technologies (autopilot, stability control, anti-theft systems, bullet-resistant steel)

For these reasons, event streaming should be a strategic component of any next-generation insurance platform.

Slide Deck: Kafka in the Insurance Industry

The following slide deck covers the above discussion in more detail:

Kafka Changes How to Think Insurance

Apache Kafka changes how enterprises rethink data in the insurance industry. This includes brownfield data integration scenarios and greenfield cutting-edge applications. The success stories from traditional insurance companies such as Generali and insurtechs such as freeyou prove that Kafka is the right choice everywhere.

Kafka and its ecosystem enable data processing at scale in real-time. Real decoupling allows the integration between monolith legacy systems and modern cloud-native infrastructure. Kafka runs everywhere, from edge deployments to multi-cloud scenarios.

What are your experiences and plans for low latency use cases? What use case and architecture did you implement? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka in the Insurance Industry appeared first on Kai Waehner.

Low Latency Data Streaming with Apache Kafka and Cloud-Native 5G Infrastructure

Kai Waehner — Sun, 23 May 2021 08:06:59 +0000

Many mission-critical use cases require low latency data processing. Running these workloads close to the edge is mandatory if the applications cannot run in the cloud. This blog post explores architectures for low latency deployments leveraging a combination of cloud-native infrastructure at the edge, such as AWS Wavelength, 5G networks from Telco providers, and event streaming with Apache Kafka to integrate and process data in motion.

The blog post is structured as follows:

Definition of “low latency data processing” and the relation to Apache Kafka
Cloud-native infrastructure for low latency computing
Low latency mission-critical use cases for Apache Kafka and its relation to analytical workloads
Example for a hybrid architecture with AWS Wavelength, Verizon 5G, and Confluent

Low Latency Data Processing

Let’s begin with a definition. “Real-time” and “low latency” are terms that different industries, vendors, and consultants use very differently.

What is real-time and low latency data processing?

For the context of this blog, real-time data processing with low latency means processing low or high volumes of data in ~5 to 50 milliseconds end-to-end. On a high level, this includes three parts:

Consume events from one or more data sources, either directly from a Kafka client or indirectly via a gateway or proxy.
Process and correlate events from one or more data sources, either stateless or stateful, with the internal state in the application and stream processing features like sliding windows.
Produce events to one or more data sinks, either directly from a Kafka client or indirectly via a gateway or proxy. The data sinks can include the data sources and/or other applications.

These parts are the same as for “traditional event streaming use cases”. However, for low latency use cases with zero downtime and data loss, the architecture often looks different to reach the defined goals and SLAs. A single infrastructure is usually the better choice than using a best-of-breed approach with many different frameworks or products. That’s where the Kafka ecosystem shines! The Kafka vs. MQ/ETL/ESB/API blog explores this discussion in more detail.

Low latency = soft real-time; NOT hard real-time

Make sure to understand that real-time in the IT world (that includes Kafka) is not hard real-time. Latency spikes and non-deterministic network behavior exist. The chosen software or framework does not matter. Hence, in the IT world, real-time means soft real-time. Contrarily, in the OT world and Industrial IoT, real-time means zero latency and deterministic networks. This is embedded software for sensors, robots, or cars.

For more details, read the blog post “Kafka is NOT hard-real-time“.

Kafka support for low latency processing

Apache Kafka provides very low end-to-end latency for large volumes of data. This means the amount of time it takes for a record that is produced to Kafka to be fetched by the consumer is short.

For example, detecting fraud for online banking transactions has to happen in real-time to deliver business value without adding more than 50—100 ms of overhead to each transaction to maintain a good customer experience.

Here is the technical architecture for end-to-end latency with Kafka:

Latency objectives are expressed as both target latency and the importance of meeting this target. For instance, a latency objective says: “I would like to get 99th percentile end-to-end latency of 50 ms from Kafka.” The right Kafka configuration options need to be optimized to achieve this. The blog post “99th Percentile Latency at Scale with Apache Kafka” shares more details.

After exploring what low latency and real-time data processing mean in Kafka’s context, let’s now discuss the infrastructure options.

Infrastructure for Low Latency Data Processing

Low latency always requires a short distance between data sources, data processing platforms, and data sinks due to physics. Latency optimization is relatively straightforward if all your applications run in the same public cloud. Low end-to-end latency gets much more difficult as soon as some software, mobile apps, sensors, machines, etc., run elsewhere. Think about connected cars, mobile apps for mobility services like ride-hailing, location-based services in retail, machines/robots in factories, etc.

The remote data center or remote cloud region cannot provide low latency data processing! The focus of this post is software that has to provide low end-to-end latency outside a central data center or public cloud. This is where edge computing and 5G networks come into play.

Edge infrastructure for low latency data processing

As for real-time and low latency, we need to define the term first, as everyone uses it differently. When I talk about the edge in the context of Kafka, it means:

Edge is NOT a regular data center or cloud region, but limited compute, storage, network bandwidth.
Edge can be a regional cloud-native infrastructure enabled for low-latency use cases – often provided by Telco enterprises in conjunction with cloud providers.
Kafka clients AND the Kafka broker(s) deployed here, not just the client applications.
Often 100+ locations, like restaurants, coffee shops, or retail stores, or even embedded into 1000s of devices or machines.
Offline business continuity, i.e., the workloads continue to work even if there is no connection to the cloud.
Low-footprint and low-touch, i.e., Kafka can run as a normal highly available cluster or as a single broker (no cluster, no high availability); often shipped “as a preconfigured box” in OEM hardware (e.g., Hivecell).
Hybrid integration, i.e., most use cases require uni- or bidirectional communication with a remote Kafka cluster in a data center or the cloud.

Check out my infrastructure checklist for Apache Kafka at the edge and use cases for Kafka at the edge across industries for more details.

Mobile Edge Compute / Multi-access Edge Compute (MEC)

In addition to edge computing, a few industries (especially everyone related to the Telco sector) uses the terms Mobile Edge Compute / Multi-access Edge Compute (MEC) to describe use cases around edge computing, low latency, 5G, and data processing.

MEC is an ETSI-defined network architecture concept that enables cloud computing capabilities and an IT service environment at the edge of the cellular network and, more generally, at the edge of any network. The basic idea behind MEC is that by running applications and performing related processing tasks closer to the cellular customer, network congestion is reduced, and applications perform better.

MEC technology is designed to be implemented at the cellular base stations or other edge nodes. It enables flexible and rapid deployment of new applications and services for customers. Combining elements of information technology and telecommunications networking, MEC also allows cellular operators to open their radio access network (RAN) to authorized third parties, such as application developers and content providers.

5G and cloud-native Infrastructure are a key piece of a MEC infrastructure!

Low-latency data processing outside a cloud region requires a cloud-native infrastructure and 5G networks. Let’s explore this combination in more detail.

5G infrastructure for low latency and high throughput SLAs

On a high level from a use case perspective, it is important to understand that 5G is much more than just higher speed and lower latency:

Public 5G telco infrastructure: That’s what Verizon, AT&T, T-Mobile, Dish, Vodafone, Telefonica, and all the other telco providers talk about in their TV spots. The end consumer gets higher download speeds and lower latency (at least in theory). This infrastructure integrates vehicles (e.g., cars) and devices (e.g., mobile phones) to the 5G network (V2N).
Private 5G campus networks: That’s what many enterprises are most interested in. The enterprise can set up private 5G networks with guaranteed quality of service (QoS) using acquired 5G slices from the 5G spectrum. Enterprise work with telco providers, telco hardware vendors, and sometimes also with cloud providers to provide cloud-native infrastructure (e.g., AWS Outposts, Azure Edge Zones, Google Anthos). This infrastructure is used similarly to the public 5G but deployed, e.g., in a factory or hospital. The trade-offs are guaranteed SLAs and increased security vs. higher cost. Lufthansa Technik and Vodafone’s standalone private 5G campus network at the aircraft hangar is a great example for various use cases like maintenance via video streaming and augmented reality.
Direct connection between devices: That’s for interlinking the communication between two or more vehicles (V2V) or vehicles and infrastructure (V2I) via unicast or multicast. There is no need for a network hop to the cell tower due to using a 5G technique called 5G sidelink communications. This enables new use cases, especially in safety-critical environments (e.g., autonomous driving) where Bluetooth, Wi-Fi, and similar network communications do not work well for different reasons.

Cloud-native infrastructure

Cloud-native infrastructure provides capabilities to build applications in an elastic, scalable, and automated way. Software development concepts like microservices, DevOps, and containers usually play a crucial role here.

A fantastic example is Dish Network in the US. Dish builds a brand new 5G network completely on cloud-native AWS infrastructure with cloud-native 1st and 3rd party software. Thus, even the network providers – where enterprises build their applications – build the underlying infrastructure this way.

Cloud-native infrastructure is required in the public cloud (where it is the norm) and at the edge. Flexibility for agile development and deployment of applications is only possible this way. Hence, technologies such as Kubernetes and on-premise solutions from cloud providers are adopted more and more to achieve this goal.

The combination of 5G and cloud-native infrastructure enables building low latency applications for data processing everywhere.

Software for Low Latency Data Processing

5G and cloud-native infrastructure provide the foundation for building mission-critical low latency applications everywhere. Let’s now talk about the software part and with that about event streaming with Kafka.

Why event streaming with Apache Kafka for low latency?

Apache Kafka provides a complete software stack for real-time data processing, including:

Messaging (real-time pub/sub)
Storage (caching, backpressure handling, decoupling)
Data integration (IoT data, legacy platforms, modern microservices, and databases)
Stream processing (stateless/stateful correlation of data).

This is super important because simplicity and cost-efficient operations matter much more at the edge than in a public cloud infrastructure where various SaaS services can be glued together.

Hence, Kafka is uniquely positioned to run mission-critical and analytics workloads at the edge on cloud-native infrastructure via 5G networks. Bi-directional replication to “regular” data centers or public clouds for integration with other systems is also possible via the Kafka protocol.

Use Cases for Low Latency Data processing with Apache Kafka

Low latency and real-time data processing are crucial for many use cases across industries. Hence, no surprise that Kafka plays a key role in many architectures – whether the infrastructure runs at the edge or in a close data center or cloud.

Mobile Edge Compute / Multi-access Edge Compute (MEC) use cases for Kafka across industries

Let’s take a look at a few examples:

Telco: Infrastructure like cloud-native 5G networks, OSS applications, integration with BSS and OTT services require to integrate, orchestrate and correlate huge volumes of data in real-time.
Manufacturing: Predictive maintenance, quality assurance, real-time locating systems (RTLS), and other shop floor applications are only effective and valuable with stable, continuous data processing.
Mobility Services: Ride-hailing, car sharing, or parking services can only provide a great customer experience if the events from thousands of regional end-users are processed in real-time.
Smart City: Cars from various carmakers, infrastructures such as traffic lights, smart buildings, and many other things need to get real-time information from a central data hub to improve safety and new innovative customer experiences.
Media: Interactive live video streams, real-time interactions, a hyper-personalized experience, augmented reality (AR) and virtual reality (VR) applications for training/maintenance/customer experience, and real-time gaming can only work well with stable, high throughput, and low latency.
Energy: Utilities, oil rigs, solar parks, and other energy upstream/distribution/downstream infrastructures are supercritical environments and very expensive. Every second counts for safety and efficiency/cost reasons. Optimizations combine data from all machines in a plant to achieve greater efficiency – not just optimizing one unit but for the entire system.
Retail: Location-based services for better customer experience and cross-/upselling need notifications while customers are looking at a product or in front of the checkout.
Military: Border control, surveillance, and other location-based applications only work efficiently with low latency.
Cybersecurity: Continuous monitoring and signal processing for thread detection and practice prevention are fundamental for any security operation center (SOC) and SIEM/SOAR implementation.

For a concrete example, check out my blog “Building a Smart Factory with Apache Kafka and 5G Campus Networks“.

NOT every use case requires low latency or real-time

Real-time data in motion beats data at rest in databases or data lakes in most scenarios. However, not every use case can be or needs to be real-time. Therefore, low latency networks and communication are not required. A few examples:

Reporting (traditional business intelligence)
Batch analytics (processing high volumes of data in a bundle, for instance, Hadoop and Spark’s map-reduce, shuffling, and other data processing only make sense in batch mode)
Model training as part of a machine learning infrastructure (while model scoring and monitoring often require real-time predictions, the model training is batch in almost all currently available ML algorithms).

These use cases can be outsourced to a remote data center or public cloud. Low latency networking in terms of milliseconds does not matter and likely increases the infrastructure cost. For that reason, most architectures are hybrid to separate low latency from analytics workloads.

Let’s now take a concrete example after all the theory in the last sections.

Hybrid Architecture for Critical Low Latency and Analytical Batch Workloads

Many enterprises I talk to don’t have and don’t want to build their own infrastructure at the edge. Cloud providers understand this pain and started rolling out offerings to provide cloud-native infrastructure close to the customer’s sites. AWS Outposts, Azure Edge Zones, Google Anthos exist for this reason. This solves the problem of providing cloud-native infrastructure.

But what about low latency?

AWS is once again the first to build a new product category: AWS Wavelength is a service that enables you to deliver ultra-low latency applications for 5G devices. It is built on top of AWS Outposts. AWS works with Telco providers like Verizon, Vodafone, KDDI, or SK Telecom to build this offering. A win-win-win: Cloud-native + low latency + no need to build own data centers at the edge.

This is the foundation for building low latency applications at the edge for mission-critical workloads, plus bi-directional integration with the regular public cloud region for analytics workloads and integration with other cloud applications.

Let’s see how this looks like in a real example.

Use case: Energy Production and distribution

Energy production and distribution are perfect examples. They require reliability, flexibility, sustainability, efficiency, security, and safety. These are perfect ingredients for a hybrid architecture powered by cloud-native infrastructure, 5G networks, and event streaming.

The energy sector usually separates analytical capabilities (in the data center or cloud) and low-latency computing for mission-critical workloads (at the edge). Kafka became a critical component for various energy use cases.

For more details, check out the blog post “Apache Kafka for Smart Grid, Utilities and Energy Production” which also covers real-world examples from EON, Tesla, and Devon Energy.

Architecture with AWS Wavelength, Verizon 5G, and Confluent

The concrete example uses:

AWS Public Cloud for analytics workloads
Confluent Cloud for event streaming in the cloud and integration with 1st party (e.g., AWS S3 and Amazon Redshift) and 3rd party SaaS (e.g., MongoDB Atlas, Snowflake, Salesforce CRM)
AWS Wavelength with Verizon 5G for low latency workloads
Confluent Platform with Kafka Connect and ksqlDB for low latency competing in the Wavelength 5G zone
Confluent Cluster Linking to glue together the Wavelength zone and the public AWS region using the native Kafka protocol for bi-directional replication in real-time

The following diagram shows the same architecture from the perspective of the Wavelength zone where the low latency processing happens:

Implementation: Hybrid data processing with Kafka/Confluent, AWS Wavelength, and Verizon 5G

Diagrams are nice. But a real implementation is even better to demonstrate the value of low latency computing close to the edge, plus the integration with the edge devices and public cloud. My colleague Joseph Morais had the lead in implementing a low-latency Kafka scenario with infrastructure provided by AWS and Verizon:

We implemented a use case around real-time analytics with Machine Learning. A single data pipeline collects provides end-to-end integration in real-time across locations. The data comes from edge locations. The low latency processing happens in the AWS Wavelength zone. This includes data integration, preprocessing like filtering/aggregations, and model scoring for anomaly detection.

Cluster Linking (a Kafka-native built-in replication feature) replicates the relevant data to Confluent Cloud in the local AWS region. The cloud is used for batch use cases such as model training with AWS Sagemaker.

This demo demonstrates a realistic hybrid end-to-end scenario to combine mission-critical low latency and analytics batch workloads.

Curious about the relation between Kafka and Machine Learning? I wrote various blogs. One good starter: “Machine Learning and Real-Time Analytics in Apache Kafka Applications“.

Last mile integration: Direct Kafka connection vs gateway / bridge (MQTT / HTTP)?

The last mile integration is an important aspect. How do you integrate “the last mile”? Examples include mobile apps (e.g., ride-hailing), connected vehicles (e.g., predictive maintenance), or machines (e.g., quality assurance for the production line).

This is worth a longer discussion in its own blog post, but let’s do a summary here:

Kafka was not built for bad networks. And Kafka was not built for tens of thousands of connections. Hence, it is pretty straightforward to decide. Option 1 is a direct connection with a Kafka client (using Kafka client APIs for Java, C++, Go, etc.). Option 2 is a scalable gateway or bridge (like MQTT or HTTP Proxy). When to use which one?

Use a direct connection via a Kafka client API if you have a stable network and only a limited number of connections (usually not higher than 1000 or so).
Use a gateway or bridge if you have a bad network infrastructure and/or tens of thousands of connections.

The blog series “Use Case and Architectures for Kafka and MQTT” gives you some ideas about use cases that require a bridge or gateway, for instance, connected cars and mobility services. But keep it as simple as possible. If a direct connection works for your use case, why add yet another technology with all its implications regarding complexity and cost?

Low Latency Data Processing Requires the Right Architecture

Low latency data processing is crucial for many use cases across industries. Processing data close to the edge is necessary if the applications cannot run in the cloud. Dedicated cloud-native infrastructure such as AWS Wavelength leverages 5G networks to provide the infrastructure. Event streaming with Apache Kafka provides the capabilities to implement edge computing and the integration with the cloud.

The post Low Latency Data Streaming with Apache Kafka and Cloud-Native 5G Infrastructure appeared first on Kai Waehner.

Comparison of Open Source Apache Kafka vs Vendors including Confluent, Cloudera, Red Hat, Amazon MSK

Kai Waehner — Tue, 20 Apr 2021 07:49:39 +0000

Apache Kafka became the de facto standard for event streaming. The open-source community is huge. Various vendors added Kafka and related tooling to their offerings or provide a Kafka cloud service. This blog post uses the car analogy – from the motor engine to the self-driving car – to explore the different Kafka offerings available on the market. I also cover a few other vehicles, meaning (partly) Kafka-compatible technologies. The goal is not a feature-by-feature comparison (that would be outdated the day after the publication). Instead, the intention is to educate about the different deployment models, product strategies, and trade-offs from the available options.

Disclaimer: I work for Confluent. However, the post is not about comparing features but explaining the concepts behind the alternatives. I talk to enterprises across the globe every week. I can assure you that many people I talk to are not aware or mislead about what you read in the following sections. Hence, I hope that the following helps you to make the right decision. Either choose to run open-source Apache Kafka or one of the various commercial Kafka offerings, or even a combination of both.

UPDATE (August 2022): This blog post was written before Amazon MSK Serverless was released. The below is still accurate and worth a read for comparing Kafka products and cloud services. Additionally, please check out the article “When NOT to choose Amazon MSK Serverless for Apache Kafka?“

Apache Kafka Components and Use Cases

The goal is not to introduce Kafka here. The minimum you should know is that Kafka is NOT just a messaging layer for data ingestion into a data lake. This is just a fraction of today’s usages.

Kafka is an open-source framework under Apache 2.0 license. It provides a combination of messaging, storage, processing, and integration of high volumes of data at scale in real-time and fault-tolerant. That’s what makes Kafka unique compared to other MQ, ETL, ESB, and API platforms.

Kafka is deployed in production for various use cases across industries. This includes analytical and mission-critical workloads. Different deployments require different SLAs. You should always ask yourself what happens if the Kafka infrastructure is in trouble. What are your RTO (Recovery Time Objective) and RPO (Recovery Point Objective)? Or in other words: How much data is okay to lose? How much downtime is acceptable? Start your Kafka projects with these questions in mind when you start your comparison of the options!

Kafka is the De Facto Standard API for Event Streaming like S3 API for Object Storage

Apache Kafka is mainstream! The latest proof: Check out the new ThoughtWorks Technology Radar: “Kafka API without Kafka“:

Kafka became the de facto event streaming API. Similar to S3 API became the de facto standard for object storage. Actually, the situation is even better for the Kafka API as the S3 API is a proprietary protocol from AWS. In contrast, the Kafka API and protocol are open source under Apache 2.0 license.

Check out the blog “Kafka API is the De Facto Standard API for Event Streaming like Amazon S3 for Object Storage” for more details.

Let’s take a look at a few very different Kafka alternatives available today:

Open-source Apache Kafka from the Apache website under Apache 2.0 license
Self-managed vendor offerings from Confluent, Cloudera, Red Hat, Amazon MSK, and many more
Fully-managed cloud offerings such as Confluent Cloud
Partly protocol-compatible products such as RedPanda for embedded / WebAssembly (WASM) use cases
Partly protocol-compatible SaaS offerings like Azure EventHubs

That’s a lot of options. So, how do you make a Kafka comparison to choose the right one? Before we go into more detail, let’s explore how complex Kafka actually is and when you do have to care about this at all.

Should you care how complex or heavyweight your event streaming technology is?

Complexity matters (only) if you need to operate the infrastructure by yourself. The beauty of SaaS is that you just consume the service and focus on your business problems. For instance, the AWS S3 object storage is a simple API with a fully managed service under the hood. You do not need to worry about operations or monitoring. You just use the cloud service.

Having said this, it is a little bit strange that ThoughtWorks mentions the barriers and complexity of Kafka but then refers to the Pulsar wrapper. That is an (immature) single class mapping implementation that only maps a small part of Kafka’s protocol (here is the producer wrapper as an example). Developers can use that wrapper to move data between Kafka clients and Pulsar brokers. However, Pulsar has a much more complex three-tier distributed architecture with ZooKeeper, BookKeeper, and Pulsar clusters. What is the benefit here? Is this really what you want to do in mission-critical workloads? Why? Please let me know if you seriously consider using such a wrapper architecture. Also, please read my post about the “Myths of Kafka vs. Pulsar“. A lot of arguments like the Kafka wrapper are simply just marketing and not usable for real-world projects!

Therefore, when I think about using the Kafka API without operating Kafka, then I have fully managed SaaS offerings such as Confluent Cloud or Azure Event Hubs in my mind.

Having said this, a fully managed cloud service is not always an option. For instance, Kafka at the edge is the new black. Plenty of use cases exist for a single broker or highly-available Kafka clusters at the edge.

But even if you want or need to operate Kafka by yourself: With KIP-500 and the removal of ZooKeeper, it gets easier and more lightweight than ever before. A lot of arguments do not exist anymore to move to a more “lightweight alternative”. There might be good reasons to choose something like RedPanda. But the main argument of having a more simple and lightweight deployment is not given anymore. Check out this video showing Kafka without ZooKeeper.

How to Choose the Right Kafka Distribution or Cloud Service?

So, how to make a comparison to find out which Kafka distribution or cloud service is the right one for your project?

The answer is simpler than you might think: The ultimate goal is to focus and solve your business problem.

How do you do that? By implementing business logic. Ideally, you don’t have to worry about infrastructure, operations, security, scalability, reliability, and non-business characteristics. Hence, SaaS with fully managed Kafka should be the first choice.

Unfortunately, SaaS is not always possible or the best option for many reasons:

Missing features
Technical limitations
Cost
Security requirements
On-premise or edge use cases

Therefore, we need to go one step back and understand what options you have to deploy and operate Kafka. Understanding these concepts without all the marketing fluff from the vendors is crucial to make the right decision!

The Kafka Car: An Analogy for a Product Comparison

It is often easier to compare technology by using an analogy from real life. Something everybody understands. No matter what industry you are in. No matter how technical you are. First, I thought I use the analogy of pizza, including self-made pizza, pizza ingredients, restaurants, delivery services, and other related topics. But pizza is used so often in the IT world. This originated in the early days of Amazon. Jeff Bezos instituted a rule: Every internal team should be small enough to be fed with two pizzas.

Finally, I choose to use the analogy of a car because I think many of the arguments are less debatable this way. I guess we could never agree on what would be the best pizza option for most people…

Hence, let’s talk about car engines, car brands, self-driving, connected fleets, and vintage vehicles in the following sections.

Give me a self-driving car, please!

Obviously, most people would prefer a self-driving car (if the price is right). It is safe, cost-efficient, and comfortable.

In the Kafka context, this means that the Kafka infrastructure would be

cloud-native (= elastic, scalable and automated, ideally fully managed)
complete (= entire set of security and operational features that enterprises require)
everywhere (= available in multiple public clouds, private cloud, on-premise, edge outside the data center)

Unfortunately, not every Kafka setup can be self-driving. We need to disassemble a car into its parts to understand what’s going on under the hood. Then we can choose the right car for our business problem.

Car Brands: Comparison of Confluent, Cloudera, Red Hat, Amazon MSK

Competition creates innovation. Hence, it is great to see many car brands and car models on the streets. Similarly, many competing companies fight for market share around Kafka business. Let’s quickly think about the available car brands (= Kafka vendors) on the market.

I only focus on the most relevant ones that either care about the Kafka project and community, have a lot of market power, or ideally both.

The car brands are Confluent, Cloudera, Red Hat, and Amazon MSK. I have a section on other Kafka and non-Kafka streaming vendors at the end of the blog post to provide a more detailed comparison.

Again, the idea is NOT to have a feature-by-feature comparison or flame war. The following are a few facts about each vendor. I only focus on Kafka-related points. Hence, it is no surprise that Confluent looks best in the following list as they only focus on event streaming. But obviously, each vendor has strengths and weaknesses. For instance, if you want to discuss the overall cloud infrastructure capabilities and strategy, well, then AWS would look much stronger than all the other Kafka vendors…

Confluent – The Leading Apache Kafka Vendor

A few facts about Confluent:

Focus on event streaming
Original creators of Kafka
The main contributor to the Apache Kafka project with 80% of Kafka commits
Always the latest Kafka version (without limitations) and full support
Rich Kafka ecosystem (connectors, governance, security, etc.)
Hybrid architectures (including the only true fully-managed and complete Kafka service)
Partnership and 1st party integration into cloud providers (AWS, GCP, Azure) – e.g., you can use your cloud provider credits and account to consume Confluent Cloud
Certified for self-managed operations on cloud providers’ edge offerings (e.g., AWS Outpost including Wavelength, Google’s Anthos)

Cloudera – Big Data Analytics Suite

A few facts about Cloudera:

Focus on big data analytics
Provides a platform around tens of different big data frameworks for storage, batch, and real-time analytics
Kafka is part of the platform (Hadoop, Spark, Flume, Flink, many more) with tooling and support for the whole platform
Hybrid architectures (but no fully-managed Kafka service)
Partnership and 3rd party integration into cloud providers (AWS, GCP, Azure)

Red Hat (IBM) – Cloud-native PaaS Infrastructure

A few facts about Red Hat (IBM):

Focus on infrastructure (mainly around Linux and Kubernetes)
Kafka is available as part of the Red Hat AMQ product portfolio, combined with other open-source frameworks like ActiveMQ or Camel
OpenShift Streams for Apache Kafka provides integration with Kubernetes
Focus on open source frameworks; working actively with the community (for Kafka, Red Hat, e.g., contributes to Debezium for CDC and the Strimzi Kubernetes Operator)
Hybrid Architectures (but no fully-managed Kafka service)
Partnership and 3rd party integration into cloud providers (AWS, GCP, Azure)

Interesting side notes for the relationship between Confluent, Red Hat, and IBM:

IBM acquired Red Hat in 2019.
Confluent and IBM announced a strategic partnership in 2020
IBM deprecated its own Kafka offering (IBM Streams) in March 2021.
Confluent is the way to go with IBM as part of the IBM Cloud Pak for Integration. Even IBM’s salespeople sell Confluent.

Amazon Web Services (AWS) – The Leading Cloud Provider

AWS focuses on cloud infrastructure and 1st party fully managed cloud services (S3, Kinesis, Lambda, etc.)

A few facts about Amazon MSK, the AWS offering for Kafka:

MSK misses several key Kafka features, including Kafka Connect or Kafka Streams
Cloud-only (but only self-managed, not fully managed)
MSK is not cloud-native (like S3 or Kinesis) but just provisioned infrastructure
Obviously only available on AWS
For on-premise deployments (like AWS Outpost or AWS Wavelength), the recommended Kafka product is the certified Confluent Platform

Interesting side note about the commercial support and SLAs of AWS’s Kafka offering: Kafka is excluded from MSK support! Quote from the MSK SLAs: “The Service Commitment DOES NOT APPLY to any unavailability, suspension, or termination… caused by the underlying Apache Kafka or Apache ZooKeeper engine software that leads to request failures…”

Event Streaming Technology and Cloud-native Infrastructure are Complementary!

The above showed a few facts for the main Kafka vendors: Confluent, Cloudera, Red Hat, AWS. However, it is worth explicitly pointing out that these vendors are often complementary. For instance, most Confluent Platform deployments I see on Kubernetes on-premise are actually on Red Hat OpenShift. And with AWS’s huge market share, most self-managed Confluent deployments in the cloud are on AWS.

Also, Confluent Platform is certified on AWS Outpost and Google Anthos. Hence, you can even combine cloud-native technologies at the edge. A great example is smart factory 5G use cases leveraging Confluent Platform on AWS Wavelength. Consequently, a Kafka comparison does typically not eliminate all the other Kafka vendors from the project.

The following architecture depicts the combination of Confluent Cloud in AWS plus Confluent Platform on AWS Wavelength leveraging 5G Carrier networks:

This is not just theory. The joint teams from AWS and Confluent are working on this example in the real world while I am writing this blog post.

Cloud-native? Complete? Everywhere? What Kafka should I buy?

After exploring different vendors, let’s now walk through the different deployment options and commercial offerings.

Again, I will not make a feature-by-feature comparison. Way more important is to understand the different concepts and architecture principles: First of all, you need to decide if a self-driving car (= fully managed Kafka) works for you. In that case, why bother at all about Kafka operations? Otherwise, project teams must evaluate partially managed (= complete car) or self-managed (car engine) Kafka offerings.

Here is an overview showing the event streaming landscape. It contains native Kafka offerings, (partly) Kafka-protocol compatible products, and a few relevant non-Kafka solutions:

Let’s now take a deeper look into these alternatives to find out how to choose the right one for your next project.

Car Engine: Self-managed Open Source Apache Kafka

The car engine is the heart of the car. It provides the power. It brings you from your source to your destination. However, a lot of work is needed around the motor engine. Tires, steering wheel, breaks, and much more are required. Hence, this is a great solution for playing around, learning how a car works, or building a car by enthusiastic car fanatics.

If you download open-source Apache Kafka from the Apache website or related Docker images, then you can use it for free in all your projects. No limitations. You should be able to get it running quickly. However, be aware that similar to the motor engine of a car, there is much more to do: Operating and monitoring the ZooKeeper and Kafka Clusters, rebalancing partitions, scaling up and down, managing storage, securing and encryption the end-to-end communication between producer, Kafka cluster and consumers, and so much more.

If you can handle the operations burden and risk of downtime, open-source Apache Kafka might be a good option. Some tech giants from Silicon Valley do exactly this. They have hired masses of tech experts (or car fanatics to keep the analogy) to run huge Kafka clusters to process trillions of messages and gigabytes of data per second.

Free Kafka add-ons to build your car

Plenty of open-source Kafka add-ons originated through this. Just to name a few tools for very different purposes: kafkacat, Kafka Manager, Kafdrop, burrow, cruise control, and so many more. Some are maintained well, others not at all. Of course, you will never get guarantees to get a version upgrade or bug fix. Often built by a tech giant for their specific scenario. Not easily usable outside that organization and without a big community.

Alternatively, there are well-maintained community projects like Confluent’s Schema Registry, REST Proxy, and ksqlDB, all under Confluent Community License (CCL). This is not open source but free to use if you are not a cloud provider like AWS. Confluent also provides some components under Apache 2.0 license, such as the widely used non-Java Kafka clients based on librdkafka or the parallel-consumer to integrate with non-scalable interfaces like web services in a scalable and performant way.

Tuned Car Engine: Self-managed Kafka Product

If you want or need to self-manage your Kafka infrastructure, then you still have more options than just using open-source Apache Kafka and (well or not so well maintained) open-source add-ons:

Open-source Apache Kafka with additional commercial tools for operations and monitoring. For instance, Lenses or Conduktor.
Complete commercial platforms. For instance, Confluent Platform, Red Hat AMQ, Cloudera DataFlow.

These “tuned car engines” are based on top of Apache Kafka (or at least parts of it) and provide additional tooling for development, operations, monitoring, security, etc. Maturity of the tools, support SLAs, expertise, and consulting vary a lot between vendors. I recommend to talk to your potential vendors. Ask the right questions to understand if they really understand what they seem to sell and support.

Shiny user interfaces attract many people. Just be careful. The underlying technology needs to work reliably and scale for your needs. The UI is nice to have on top of the infrastructure. Nevertheless, a good UI can improve the developer experience, increase time to market, and bring other benefits.

Should I use a (Tuned) Car Engine and Build my own Car?

All the explored options above are still self-managed. If you consider building your own car with a car engine, always evaluate the cost-benefit equation.

Remember, at the beginning of this post, I talked about solving business problems. Hence, don’t forget to consider all the impacts on:

Total Cost of Ownership (TCO)
Risk (downtime, data loss, security, governance, etc.)
Return on Investment (ROI)
Time-to-market (Increased developer velocity and increased business agility)

At Confluent, we do TCO assessments with our prospects and customers so that they understand the complete costs and risks of a Kafka project. Such an assessment should be part of every Kafka comparison!

Hence, don’t forget to evaluate other alternatives to open-source Apache Kafka seriously. If self-managed Apache Kafka still works for you after the evaluation, then do it! But be aware that even the tech giants from Silicon Valley consider and buy other options today. Many had to build Kafka infrastructure because there was no other alternative when they built it years ago.

Does my favorite open-source vendor really provide open-source?

Also, be careful: Some open source solutions don’t provide an easy way to build the product. So, evaluate what exactly is available from a so-called “open source offering”: Only a binary download? Docker images? Or can you also build and deploy everything from scratch easily and documented (not just in theory !!!) using Maven, Gradle, Terraform, Ansible, or similar build and automation tools?

Before building your own car with an available car engine, why not buy a car? Let’s consider next if a complete car might make more sense for you.

Complete Car: Kafka Products and Kubernetes-based PaaS

I am not a car fanatic. I want to buy a complete car that I can drive everywhere. You should also at least consider this option!

In Kafka terms, this means you get help from the product for running the Kafka infrastructure. Buzzwords from vendors include terms like “platform as a service”, “private cloud”, “fully managed”, and “cloud-native”. In the end, the products help you with provisioning, operating, and monitoring everything.

The main benefit compared to the (tuned) car engines is that these products give you a more elastic, scalable, and automated infrastructure. Two options exist today:

Kubernetes-based products that can run everywhere on-premise and across multiple cloud providers. Examples: Confluent Platform, Cloudera DataFlow (CDF), Red Hat AMQ respectively Red Hat OpenShift Streams for Apache Kafka.
Proprietary cloud offerings that are typically tied to the related hyperscaler. Example: Amazon MSK.

This is similar to buying a car: It comes preassembled. Although, of course, you are still responsible for operating and maintaining it.

In Kafka terms, for many scenarios, these self-managed Kafka products (= car) are a better choice than the self-managed Kafka (= car engine) because they partially reduce the operations burden, risk, and (hopefully) TCO.

A complete car is still not self-driving!

However, as you know, today’s cars still need a lot of manual work: Driving, refueling, maintenance, and more. In Kafka terms: How much works do you still have to do by yourself? Do you have to handle rolling upgrades manually? Do you have to rebalance partitions on brokers? How do you scale up and down? Who fixes security issues and bugs? And so on.

Hence, it is really a pity that most vendors use incorrect marketing intentionally! No one of the above solutions are fully managed. All of them require work to do by you to operate the Kafka cluster. All of them! Confluent Platform. Cloudera DataFlow. Red Hat AMQ. Red Hat OpenShift Streams for Apache. Kafka Amazon MSK. None of these services is fully managed!

Each car brand and model is different. If you buy a Porsche, you probably have very different expectations than buying a small medium-priced car from another brand. The same is true for all the self-managed Kafka products on the market. Each product is very different: Confluent Platform. Cloudera DataFlow. Red Hat AMQ / OpenShift Streams for Apache Kafka. Amazon MSK. All of them have strengths and weaknesses. Make sure it fits your expectations so that you can solve your business problem within your required SLAs and budget.

Having said this, wouldn’t it be nice if you don’t have to worry about all these things? Let’s explore the self-driving Kafka car next.

Self-driving Car: Fully-managed Cloud Kafka Service

A self-driving car provides a complete solution. You just tell it where you want to go. It drives you automatically. Chooses the best route. Allows relaxing, reading, playing games, or similar things. Of course, an autonomous car with level 5 automation is not mature yet (beyond some early stages, like Waymo operating in the desert in Phoenix where no rain and other weather or traffic issues occur).

In Kafka terms, the solution needs to be fully managed by the vendor.

Fully managed means serverless (i.e., you don’t have to care and even don’t get access to the Kafka Brokers at all). Mission-critical SLAs. Usage-based billing. And so on. Like you know it from other really fully managed cloud offerings such as AWS S3 or AWS Kinesis. These are fully managed. Amazon MSK is not!

Checklist to compare partially managed and fully-managed Kafka cloud services

Please compare different Kafka cloud offerings by yourself. Here are some bullet points to check:

Infrastructure management

Upgrades (latest stable version of Kafka)
Patching
Maintenance

Kafka-specific management

Sizing (retention, latency, throughput, storage, etc.)
Data balancing for optimal performance
Performance tuning for real-time and latency requirements
Fixing Kafka bugs
Uptime monitoring and proactive remediation of issues
Recovery support from data corruption

Scaling

Scaling the cluster as needed
Data balancing the cluster as nodes are added
Support for any Kafka issue with less than 60-minute response time

Most “Kafka as a service” offerings are only partially managed. That’s like a self-driving car which you actually have to control by yourself (more like level 3, not level 5 in autonomous driving terminology).

At this point, I have to do marketing for my employer. However, it is not an advertisement, but reality: Confluent Cloud is the only offering on the market that provides a complete, fully-managed Kafka SaaS offering. And it is available everywhere – on all major cloud provides (AWS, Azure, GCP). All the other Kafka offerings are NOT fully managed – even though most vendors claim it!

Other Vehicles on the Street: Comparison of Kafka-Compatible and Non-Kafka Offerings

On the street, we don’t see just one car brand or car model. Plenty of different ones exist. Nevertheless, they have to drive on the same streets. Competition creates innovation and tackles different markets and personal interests. That’s great. The same is true for Kafka!

I focused on the “mainstream Kafka vendors” in the above sections. Namely, Confluent, Cloudera, Red Hat, Amazon MSK. Obviously, more Kafka offerings exist on the market. Some are really good for some use cases. Others are more like an April Fools’ Joke, in my opinion. Let’s quickly walk through a few other offerings.

A few more car brands: Azure HD Insight’s Kafka, Aiven, cloudkarafka, Instaclustr. These Kafka-native PaaS vendors provision Kafka clusters for you. Similarly to Amazon MSK. These offerings slightly differ from each other. In summary, they typically ask you to do storage management, scalability configuration, performance tuning, etc., by yourself. This is definitely not self-driving!
A self-driving car: Azure Event Hubs is a SaaS offering from Microsoft supporting the Kafka protocol. It has several limitations regarding support of the Kafka API and infrastructure. A solid product. Contrary to Confluent Cloud, you don’t get additional capabilities such as fully-managed connectors, Schema Registry, RBAC, Audit Logs, and much more. And obviously, this product is only available on the Azure cloud.
A vintage car: TIBCO focuses on their legacy messaging solutions like TIBCO EMS. They (try to) provide support for Kafka (and Pulsar) to sell their proprietary technologies. Zero expertise or interest in Kafka. They even provide Kafka as .exe Windows file even though this does not work well in reality. If you need to run Kafka brokers on Windows (e.g., for development), only use Kafka Docker containers and the Windows Subsystem for Linux 2 (WSL 2).

Non-Kafka offerings

Self-driving scooters: AWS Kinesis, GCP Pub/Sub, etc., are solid SaaS offerings that work well if you don’t need to be vendor-agnostic and if the feature set, scalability, and pricing work for you.
A few bicycles, motorbikes, and cars: Non-Kafka solutions, including message queuing (IBM MQ, RabbitMQ, NATS), stream processing (Flink, Spark Streaming), event streaming (Pulsar, Pravega), integration middleware (many open-source/proprietary and self-managed/SaaS). These are solid frameworks and products that you can compare to Kafka. There is no silver bullet! Make sure to understand the differences between MQ/ETL/ESB and Kafka when you do your evaluation.

Connected Car Fleet: Multiple Kafka Clusters and Hybrid Integration

The digital transformation around connected vehicles is a real game-changer. Vehicles talk to each other (V2V), their infrastructure like traffic lights (V2I), and to many other backend systems (V2X).

As a side note: If you are interested in the relation of Kafka and connected vehicles/mobility services, I covered use cases for connected vehicles and V2X in my blog series about Kafka and MQTT in more detail.

Today, we usually have to drive by ourselves. This is expected to change in the next five to ten years. However, even if Waymo, Telsa, and the likes successfully deploy level 4 and level 5 cars to the street (including legal allowance), we will still only see a fraction of all cars driving themselves. It will be a connected fleet with regular cars and self-driving cars for at least a few decades. Not even sure if self-driving cars can ever go to India

The same is true for Kafka. Self-managed open-source Kafka is still mainstream today. Many enterprises move to Kubernetes and private or public cloud infrastructure, though. In parallel, most new Kafka clusters in the cloud are consumed from vendors that provide partially or fully managed services so that the enterprise can focus on their business problems.

Kafka is deployed across infrastructures. Often, new projects have a cloud-first strategy. But there are still a lot of data centers out there. Not just for legacy reasons. For instance, in Russia, there is no public cloud provider at all. Kafka has to be deployed on-premise. And there is the trend of deploying Kafka at the edge (i.e., outside a data center).

Architectures for Hybrid Kafka (SaaS + PaaS + Self-Managed)

Hence, a connected car fleet with various brands and operation types is required. Most enterprises use different vendors and cloud providers. Most enterprises have their own data centers and a multi-cloud strategy. Hybrid Kafka includes various architectures. This includes:

Kafka in one or multiple clouds. There is no Azure or GCP in China, only Alibaba and Tencent Cloud. This is why Audi built their connected car infrastructure in the cloud, but with Kafka instead of proprietary cloud services. They need to deploy globally.
Kafka at the edge outside the data center, e.g., in a smart factory, oil rigs, ships, retail stores, etc. Often deployed as a single broker on very lightweight hardware, without high availability.
Kafka stretched across regions, i.e., one single cluster operating across the US west, east, central. Confluent’s Multi-Region Clusters (MRC) is mainly used for this architecture.
Replication between different Kafka clusters. Use cases include aggregation, disaster recovery, global deployments, and more. Kafka-native technologies such as MirrorMaker 2, Confluent Replicator, or Confluent Cluster Linking enable these architectures.

“Architecture patterns for distributed, hybrid, edge and global Apache Kafka deployments” explores this topic in more detail.

Focus on the Business Problem when making your Kafka Comparison!

This blog post explored the different deployment options for Kafka. Several open-source and commercial options exist.

If you want to remember one thing from this post: A fully-managed Kafka service (= real SaaS) takes over all the operations complexity and risk for you, similarly like a self-driving car handles all the actions on the street. However, most services available today only provide self-managed Kafka clusters. Fully managed is often only a marketing term.

A hybrid architecture is the norm in most enterprises. A combination of fully-managed Kafka in the public cloud with self-managed Kafka on premise or at the edge works very well and is the way to go for most enterprises across industries.

What Kafka car do you drive today? What is your plan for the future? Maybe you are already planning to migrate to a self-driving car to focus on your business problems – and consequently reducing cost and risk this way, too? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Comparison of Open Source Apache Kafka vs Vendors including Confluent, Cloudera, Red Hat, Amazon MSK appeared first on Kai Waehner.

Apache Kafka and MQTT (Part 5 of 5) – Smart City and 5G

Kai Waehner — Mon, 29 Mar 2021 07:10:02 +0000

Apache Kafka and MQTT are a perfect combination for many IoT use cases. This blog series covers the pros and cons of both technologies. Various use cases across industries, including connected vehicles, manufacturing, mobility services, and smart city are explored. The examples use different architectures, including lightweight edge scenarios, hybrid integrations, and serverless cloud solutions. This post is part five: Smart City and 5G.

Apache Kafka + MQTT Blog Series

The first blog post explores the relation between MQTT and Apache Kafka. Afterward, the other four blog posts discuss various use cases, architectures, and reference deployments.

Part 1 – Overview: Relation between Kafka and MQTT, pros and cons, architectures
Part 2 – Connected Vehicles: MQTT and Kafka in a private cloud on Kubernetes; use case: remote control and command of a car
Part 3 – Manufacturing: MQTT and Kafka at the edge in a smart factory; use case: Bidirectional OT-IT integration with Sparkplug between PLCs, IoT Gateways, Data Historian, MES, ERP, Data Lake, etc.
Part 4 – Mobility Services: MQTT and Kafka leveraging serverless cloud infrastructure; use case: Traffic jam prediction service using machine learning
Part 5 – Smart City (THIS POST): MQTT at the edge connected to fully-managed Kafka in the public cloud; use case: Intelligent traffic routing by combining and correlating 3rd party services

Subscribe to my newsletter to get updates immediately after the publication. Besides, I will also update the above list with direct links to this blog series’s posts as soon as published.

Use Case: Smart City and 5G

A smart city is an urban area that uses different types of electronic Internet of Things (IoT) sensors to collect data and then use insights gained from that data to manage assets, resources, and services efficiently.

A smart city provides many benefits for civilization and city management. Some of the goals are:

Improved Pedestrian Safety
Improved Vehicle Safety
Proactively Engaged First Responders
Reduced Traffic Congestion
Connected / Autonomous Vehicles
Improved Customer Experience
Automated Business Processes

I covered the use cases in more detail in the post “Event Streaming with Kafka as Foundation for a Smart City“. For a specific 5G example, check out “Building a Smart Factory with Apache Kafka and 5G Campus Networks“.

Let’s now explore the relation of Kafka and MQTT for smart city use cases.

Architecture: MQTT and Kafka for a Smart City

The following architecture shows an infrastructure deployed at a stadium:

In this example, both MQTT and Kafka are deployed close to the stadium. For instance, AWS Wavelength is an innovative infrastructure option to build low latency 5G use cases. The connected “regular AWS cloud region” is still used for use cases that do not require low latency.

The combination of Kafka and MQTT enables connectivity and real-time data processing for various use cases:

Parking information and smart navigation.
Location-based shopping and restaurant experiences, including innovative scenarios such as monitoring of queues and geofencing.
Integration of loyalty platforms to earn rewards and points.
Live information about the game or concert
Lottery drawing experiences while watching a sports game.

The possibilities are endless. Integration with 1st and 3rd party applications will create completely new opportunities to improve the customer experience, increase safety, and improve operational efficiency.

The stadium example is a particular scenario to explore the added value of processing data in motion. Let’s take a look at other real-world examples that leverage MQTT and Kafka.

Example: Cloud-based Traffic Control Systems @Berlex

The Swedish company Berlex designs and manufactures new ways to improve traffic safety.

Berlex provides cloud-based portable traffic signals. Their innovative R6 traffic signal is one of the first mobile traffic signals controlled by a cloud-based service. Berlex’s connected solution allows customers to monitor the new traffic signals on a smartphone, computer, or tablet anytime and from anywhere. MQTT enables real-time information delivery and constant monitoring.

The cloud-based service reduces the time that their customers need to spend in dangerous traffic work zones. The system enables customers to carry out numerous tasks such as checking the battery status of a traffic signal or performing an inspection remotely, with no need for risky and time-consuming on-site intervention.

Each portable R6 traffic signal is equipped with a radar that allows the signal to see traffic. Sensors within the signals publish detailed information on the current status of the signal as MQTT data. The Berlex Connect cloud service captures the continuous stream of MQTT data from each signal and shares the information with the appropriate subscribers.

To prevent interruption of the traffic signal operation, high availability is essential for the system. Berlex customers monitor the real-time information on individual portals with customized user roles that fit their specific use case.

Read the complete case study from HiveMQ for more details about this successful smart city project.

Example: The Life of Citizens as a Stream of Events @ NAV

NAV (Norwegian Work and Welfare Department) currently distributes more than one-third of the national budget to Norway or abroad citizens. NAV assists people through all phases of life within work, family, health, retirement, and social security. Events happening throughout a person’s life determines which services we provide to them, how we provide them and when we provide them.

In most countries, each person has to apply for these services resulting in many tasks handled manually by various caseworkers in the organization. Their access to insight and useful information is limited and often hard to find, causing frustration to both our caseworkers and our users. By streaming a person’s life events through our Kafka pipelines, NAV revolutionized the way users experience government services and the way the employees work:

NAV and the government as a whole have access to vast amounts of data about the citizens, reported by health institutions, employers, various government agencies, or the users themselves. Some data is distributed by large batches, while others are available on-demand through APIs. The data is ingested into streams using Kafka, Streams API, and Java microservices. NAV distributes and acts on events about birth, death, relationships, employment, income, and business processes to vastly improve the user experience, provide real-time insight and reduce the need to apply for services the government already knows are needed.

NAV chose Confluent Platform to implement to get valuable insight from life and business events. Security is a key concern. Compliance with GDPR is essential for the success of this project.

More details about NAV’s Kafka usage in their Kafka Summit presentation.

Kafka + MQTT = Smart City

In conclusion, Apache Kafka and MQTT are a perfect combination for smart city and 5G use cases. Follow the blog series to learn about use cases such as connected vehicles, manufacturing, mobility services, and smart city. Every blog post also includes real-world deployments from companies across industries. It is key to understand the different architectural options to make the right choice for your project.

What are your experiences and plans in IoT projects? What use case and architecture did you implement? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka and MQTT (Part 5 of 5) – Smart City and 5G appeared first on Kai Waehner.

App Modernization and Hybrid Cloud Architectures with Apache Kafka

Kai Waehner — Wed, 10 Mar 2021 10:03:09 +0000

Hybrid cloud architectures are the new black for most companies. A cloud-first is obvious for many, but legacy infrastructure must be maintained, integrated, and (maybe) replaced over time. Event Streaming with the Apache Kafka ecosystem is a perfect technology for building hybrid replication in real-time at scale.

App Modernization and Streaming Replication with Apache Kafka at Bayer

Most enterprises require a reliable and scalable integration between legacy systems such as IBM Mainframe, Oracle, SAP ERP, and modern cloud-native applications like Snowflake, MongoDB Atlas, or AWS Lambda.

Application modernization benefits from the Apache Kafka ecosystem for hybrid integration scenarios. The pharmaceutical and life sciences company Bayer AG is a great example of a hybrid multi-cloud infrastructure. They leverage the Apache Kafka ecosystem as “middleware” to build a bi-directional streaming replication and integration architecture between on-premises data centers and multiple cloud providers:

Learn about Bayer’s journey and how they built their hybrid and multi-cloud Enterprise DataHub with Apache Kafka and its ecosystem: Bayer’s Kafka Summit talk.

Hybrid Cloud Architectures with Apache Kafka

I already explored “architecture patterns for distributed, hybrid, edge and global Apache Kafka deployments” in 2020:

TL;DR: Various alternatives exist to deploy Apache Kafka across data centers, regions, and even continents. There is no single best architecture. It always depends on characteristics such as RPO / RTO, SLAs, latency, throughput, etc.

Some deployments focus on on-prem to cloud integration. Others link together Kafka clusters on multiple cloud providers. Technologies such as Apache Kafka’s MirrorMaker 2, Confluent Replicator, Confluent Multi-Region-Clusters, and Confluent Cluster Linking help building such an infrastructure.

Video and Live Demo of Hybrid Replication with Kafka

The following video recording discusses hybrid Kafka architectures in more detail. The focus is on the bi-directional replication between on-prem and cloud to modernize the infrastructure, integrate legacy with modern applications, and move to a more cloud-native architecture with all its benefits:

If you want to see the live demo, go to minute 14:00. The demo shows the real-time replication between a Kafka cluster on-premise and Confluent Cloud, including stream processing with ksqlDB and data integration with Kafka Connect (using the fully-managed AWS S3 connector).

The live demo uses AWS, but the same architecture is possible on Azure and GCP, of course. Even more exciting is the option to use on-prem products from the cloud vendors, such as AWS Outpost or Google Anthos. As another example, currently, I am working with colleagues from Confluent and the AWS Wavelength team on a live demo for 5G use cases such as smart factories and connected vehicles. Apache Kafka’s beauty is the freedom to choose the right architecture and infrastructure for your use case!

Summary

Hybrid cloud architectures are the new black for most companies. Consequently, event streaming with the Apache Kafka ecosystem is a perfect technology for building hybrid replication in real-time at scale. It is battle-tested across industries and regions. Leverage Kafka to build a modern and cloud-native infrastructure with on-premise, cloud, and edge workloads!

What are your experiences and plans for building hybrid architectures? Did you already build infrastructure with Apache Kafka to connect your legacy and modern applications? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post App Modernization and Hybrid Cloud Architectures with Apache Kafka appeared first on Kai Waehner.

Apache Kafka in the Financial Services Industry

Kai Waehner — Mon, 18 Jan 2021 13:24:41 +0000

The rise of event streaming in financial services is growing like crazy. Continuous real-time data integration and processing are mandatory for many use cases. Many business departments in the financial services sector deploy Apache Kafka for mission-critical transactional workloads and big data analytics. High scalability, high reliability, and an elastic open infrastructure are the key reasons for Kafka’s success. This blog post explores different use cases, architectures, and real-world examples in the FinServ sector.

FinServ Enterprise Reality: Innovate or be disrupted!

There is no way around it: The new business reality is very different from the last decades:

Technology was a support function in the past.
Innovation required for growth.
“Good enough” to run on yesterday’s data.
Technology is the business.
Innovation is required for survival.
Yesterday’s data = failure.
Modern, real-time data infrastructure is required.

Only two options exist for enterprises in the finance sector: Innovate or be disrupted!

The New FinServ Enterprise Reality – Every Company is a Software Company

Please take a look at your favorite traditional bank and how its market cap looks like compared to new FinTech companies such as Robinhood, Stripe, Square, or Revolut.

Some traditional companies re-invented themselves to focus on innovative new products and great customer experience to stay competitive. Software is eating the world, including the finance sector.

Here are a few examples:

Capital One: 10,000 of 40,000 employees are software engineers
Goldman Sachs: 1.5B (billion!) lines of code across 7,000+ applications
JPMorgan Chase & Co: Employs over 50,000 people in technology and has $10B+ technology spend.

Most successful post-modern companies in the finance sector heavily rely on Apache Kafka. This is true for emerging fintechs but also for (some) traditional banks.

Apache Kafka in Financial Services

Various use cases emerged to deploy event streaming with Apache Kafka in the finance industry. This includes mission-critical transactional workloads like payment processing or regulatory reporting and big data analytics projects leveraging Machine Learning, data lakes, etc.

Examples for Real-World Deployments

Here are a few companies leveraging Apache Kafka for banking projects:

Check past Kafka Summit video recordings and slides for details about use cases and architectures of these companies from the finance sector.

Here are a few concrete examples:

Capital One: Becoming truly event-driven – offering a service other parts of the bank can use.
ING: Significantly improved customer experience – as a differentiator + Fraud detection and cost savings.
Nordea: Able to meet strict regulatory requirements around real-time reporting + cost savings.
Paypal: Processing 400+ Billion events per day for user behavioral tracking, merchant monitoring, risk & compliance, fraud detection, and other use cases.
Royal Bank of Canada (RBC): Mainframe off-load, better CX & fraud detection – brought many parts of the bank together
10X Banking: Cloud-native and open core-banking platform to implement a next-generation FinServ platform
Robinhood: Commission-free stock trading using a mobile app and website.

This is just a concise list of companies in the financial sector using Apache Kafka as an event streaming platform for their business’s heart. Plenty of other examples are available by tens of global banks leveraging Apache Kafka for many use cases.

Kafka makes your Business Real-Time

The huge advantage is that Kafka allows decoupling your applications and infrastructure in a domain-driven design (DDD). Each microservice can use its own technology or product but leverage the same data (with security and privacy in mind, of course):

It is great to see that many FinServ companies do not just leverage Kafka in their applications but also contribute to the community. For instance, Robinhood published Faust: A stream processing library, porting Kafka Streams’ ideas from Java to Python.

This is a great example of a microservice architecture and the freedom of technology choice: Robinhood did not want to use Java for (some) applications and chose Python instead. No problem with Kafka as the brokers are dumb. The data processing and business logic happen in the clients in your favorite programming language.

And to be clear: Financial services are important in every company! Payments, orders, fraud, and similar transactional and analytical data rely on FinServ applications and the integration with partners.

Slides – The Rise of Event Streaming in FinServ

The following slide deck goes into more detail. Learn about the rise of event streaming in the financial services industry. Kafka is adopted in more and more scenarios:

Kafka in Banking for Middleware, Mainframe, Machine Learning, Open API, and more

The following links share additional content related to many banking and FinServ use cases and architectures:

Please check them out to learn more about the usage of event streaming with Kafka and its ecosystem across various business units in the Finserv sector.

Last but not least, please be aware that the term “real-time” is used in many contexts and can have different meanings. Read “Kafka is NOT hard real-time” to understand why Kafka is used in most banking projects, but not for the specific use case of trading in microseconds.

Software is Eating the Banks and FinServ Industry

Software is eating the world, including financial services. Continuous real-time data integration and processing are mandatory for many use cases. Apache Kafka is deployed across industries for mission-critical transactional workloads and big data analytics. No matter if you need to integrate with legacy systems, process mission-critical payment data, or build batch reports and analytic models, Kafka is a predominant choice as part of the architecture. Hybrid, edge, and multi-cloud deployments of Kafka are the new black.

What are your experiences and plans for event streaming in the financial services industry? Did you already build applications with Apache Kafka? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka in the Financial Services Industry appeared first on Kai Waehner.

Architecture Archives - Kai Waehner

Apache Kafka Cluster Type Deployment Strategies

Apache Kafka – The De Facto Standard for Event-Driven Architectures and Data Streaming

Different Apache Kafka Cluster Types

Apache Kafka Cluster Strategies and Architectures

Bridging Hybrid Kafka Clusters

RPO vs. RTO = Data Loss vs. Downtime

Stretched Kafka Cluster – Zero Data Loss with Synchronous Replication across Data Centers

Pricing of Kafka Cloud Offerings (vs. Self-Managed)

Kafka Storage – Tiered Storage and Iceberg Table Format to Store Data Only Once

Real-World Success Stories for Multiple Kafka Clusters

Paypal – Separation by Security Zone

JioCinema – Separation by Use Case and SLA

Audi – Operations vs. Analytics for Connected Cars

New Relic – Worldwide Multi-Cloud Observability

Multiple Kafka Clusters are the Norm; Not an Exception!

Apache Kafka in the Public Sector – Part 2: Smart City

Blog series: Apache Kafka in the Public Sector and Government

Real-time is Mandatory for a Smart City Everywhere

The Need for Real-time Data Processing Everywhere in a Smart City and how Kafka helps

Low Latency and 5G Networks for (some) Data Streaming Use Cases

Collaboration between Government, City, and 3rd Party via Open API

Ohio Department of Transportation (ODOT) – A Government-Owned Event Streaming Platform

Deutsche Bahn – Single Source of Truth and Google Maps Integration in Real-time

Free Now – Mobility Service in the Cloud Connected to Regional Laws and Vehicles

Data in Motion with Kafka for a Connected and Innovative Smart City

Apache Kafka in the Public Sector – Blog Series about Use Cases and Architectures

Blog series: Apache Kafka in the Public Sector and Government

The Public Sector is a Broad Spectrum of Use Cases

Real-time Data Beats Slow Data in the Public Sector

Open API and Partnerships are Mandatory

Data Mesh for Sharing Events between Government and 3rd Party Applications and Services

Data in Motion as Paradigm Shift in the Public Sector

When to Use Reverse ETL and when it is an Anti-Pattern

What are ETL and Reverse ETL?

ETL (Extract-Transform-Load)

ELT (Extract-Load-Transform)

Reverse ETL

Products and SaaS Offerings for Reverse ETL

Reverse ETL == Real-Time Data for Sales, Marketing, Customer Success

Reverse ETL + Data Lake + Real-Time == Myth

Reverse ETL in the Enterprise Architecture

Reverse ETL == SQL Queries vs Change Data Capture

Event-driven Architecture + Streaming ETL == Reverse ETL built-in

When do you need Reverse ETL?

Kafka Examples for Reverse ETL

Apache Kafka + Salesforce + Oracle CDC + Snowflake

Tapping into the Splunk Ingestion Layer in Motion with Kafka

Don’t Design for Data at Rest to Reverse it!

Apache Kafka in the Insurance Industry

Digital Transformation in the Insurance Industry

The Need for Brownfield Integration

Greenfield Applications at Insurtech Companies

Real-World Deployments of Kafka in the Insurance Industry

Generali – Kafka as Integration Platform

Design Principles of Generali’s Cloud-Native Architecture

Centene – Integration and Data Processing at Scale in Real-Time

Swiss Mobiliar – Decoupling and Orchestration

Humana – Real-Time Integration and Analytics

freeyou – Stateful Streaming Analytics

Tesla – Carmaker and Utility Company, now also Car Insurer

Tesla: “Much Better Feedback Loop”

Slide Deck: Kafka in the Insurance Industry

Kafka Changes How to Think Insurance

Low Latency Data Streaming with Apache Kafka and Cloud-Native 5G Infrastructure

Low Latency Data Processing

What is real-time and low latency data processing?

Low latency = soft real-time; NOT hard real-time

Kafka support for low latency processing

Infrastructure for Low Latency Data Processing

Edge infrastructure for low latency data processing

Mobile Edge Compute / Multi-access Edge Compute (MEC)

5G and cloud-native Infrastructure are a key piece of a MEC infrastructure!

5G infrastructure for low latency and high throughput SLAs

Cloud-native infrastructure

Software for Low Latency Data Processing

Why event streaming with Apache Kafka for low latency?

Use Cases for Low Latency Data processing with Apache Kafka

Mobile Edge Compute / Multi-access Edge Compute (MEC) use cases for Kafka across industries

NOT every use case requires low latency or real-time