Hybrid Cloud Archives - Kai Waehner

Real-Time Data Sharing in the Telco Industry for MVNO Growth and Beyond with Data Streaming

Kai Waehner — Wed, 30 Apr 2025 07:04:07 +0000

The telecommunications industry is entering a new era. Partnerships with MVNOs, IoT platforms, and enterprise customers demand flexible, secure, and real-time access to network and customer data. Traditional batch-driven architectures are no longer sufficient. Instead, real-time data streaming combined with policy-driven data sharing provides a powerful foundation for building scalable data products for internal and external consumers. A modern Telco must manage data collection, processing, governance, data sharing, and distribution with the same rigor as its core network services. Leading Telcos now operate centralized real-time data streaming platforms to integrate and share network events, subscriber information, billing records, and telemetry from thousands of data sources across the edge and core networks.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including a dedicated chapter about the telco industry.

Data Streaming in the Telco Industry

Telecommunications networks generate vast amounts of data every second. Every call, message, internet session, device interaction, and network event produces valuable information. Historically, much of this data was processed in batches — often hours or even days after it was collected. This delayed model no longer meets the needs of modern Telcos, partners, and customers.

Data streaming transforms how Telcos handle information. Instead of storing and processing data later, it is ingested, processed, and acted upon in real time as it is generated. This enables continuous intelligence across all parts of the network and business.

Learn more about “The Top 20 Problems with Batch Processing (and How to Fix Them with Data Streaming)“.

Business Value of Data Streaming in the Telecom Sector

Key benefits of data streaming for Telcos include:

Real-Time Visibility: Immediate insight into network health, customer behavior, fraud attempts, and service performance.
Operational Efficiency: Faster detection and resolution of issues reduces downtime, improves customer satisfaction, and lowers operating costs.
New Revenue Opportunities: Real-time data enables new services such as dynamic pricing, personalized offers, and proactive customer support.
Enhanced Security and Compliance: Immediate anomaly detection and instant auditability support regulatory requirements and protect against cyber threats.

Technologies like Apache Kafka and Apache Flink are now core components of Telco IT architectures. They allow Telcos to integrate massive, distributed data flows from radio access networks (RAN), 5G core systems, IoT ecosystems, billing and support platforms, and customer devices.

Modern Telcos use data streaming to not only improve internal operations but also to deliver trusted, secure, and differentiated services to external partners such as MVNOs, IoT platforms, and enterprise customers.

Learn More about Data Streaming in Telco

Learn more about data streaming in the telecommunications sector:

Data streaming is not an allrounder to solve every problem. Hence, a modern enterprise architecture combines data streaming with purpose-built telco-specific platforms and SaaS solutions, and data lakes/warehouses/lakehouses like Snowflake or Databricks for the analytical workloads.

I already wrote about the combination of data streaming platforms like Confluent together with Snowflake and Microsoft Fabric. A blog series about data streaming with Confluent combined with AI and analytics using Databricks is coming right after this blog post here.

By mastering real-time data streaming, Telcos unlock the ability to share valuable insights securely and efficiently with internal divisions, IoT platforms, and enterprise customers.

Mobile Virtual Network Operators (MVNOs) — companies that offer mobile services without owning their own network infrastructure — are an equally important group of consumers. As an MVNO delivers niche services, competitive pricing, and tailored customer experiences, real-time data sharing becomes essential to support their growth and enable differentiation in a highly competitive market.

A strong real-time data sharing platform in the telco industry integrates multiple types of components and stakeholders, organized into four critical areas:

Data Sources

A real-time data platform aggregates information from a wide range of technical systems across the Telco infrastructure.

Radio Access Network (RAN) Metrics: Capture real-time information about signal quality, handovers, and user session performance.
5G Core Network Functions: Manage traffic flows, session lifecycles, and device mobility through UPF, SMF, and AMF components.
Operational Support Systems (OSS) and Business Support Systems (BSS): Provide data for service assurance, provisioning, customer management, and billing processes.
IoT Devices: Send continuous telemetry data from connected vehicles, industrial assets, healthcare monitors, and consumer electronics.
Customer Premises Equipment (CPE): Supply performance and operational data from routers, gateways, modems, and set-top boxes.
Billing Events: Stream usage records, real-time charging information, and transaction logs to support accurate billing.
Customer Profiles: Update subscription plans, user preferences, device types, and behavioral attributes dynamically.
Security Logs: Capture authentication events, threat detections, network access attempts, and audit trail information.

Stream Processing

Stream processing technologies ensure raw events are turned into enriched, actionable data products as they move through the system.

Real-Time Data Ingestion: Continuously collect and process events from all sources with low latency and high reliability.
Data Aggregation and Enrichment: Transform raw network, billing, and device data into structured, valuable datasets.
Actionable Data Products: Create enriched, ready-to-consume information for operational and business use cases across the ecosystem.

Data Governance

Effective governance frameworks guarantee that data sharing is secure, compliant, and aligned with commercial agreements.

Policy-Based Access Control: Enforce business, regulatory, and contractual rules on how data is shared internally and externally.
Data Protection Techniques: Apply masking, anonymization, and encryption to secure sensitive information at every stage.
Compliance Assurance: Meet regulatory requirements like GDPR, CCPA, and telecom-specific standards through real-time monitoring and enforcement.

Data Consumers

Multiple internal and external stakeholders rely on tailored, policy-controlled access to real-time data streams to achieve business outcomes.

MVNO Partners: Consume real-time network metrics, subscriber insights, and fraud alerts to offer better customer experiences and safeguard operations.
Internal Telco Divisions: Use operational data to improve network uptime, optimize marketing initiatives, and detect revenue leakage early.
IoT Platform Services: Rely on device telemetry and mobility data to improve fleet management, predictive maintenance, and automated operations.
Enterprise Customers: Integrate real-time network insights and SLA compliance monitoring into private network and corporate IT systems.
Regulatory and Compliance Bodies: Access live audit streams, security incident data, and privacy-preserving compliance reports as required by law.

In modern Telco architectures, data products act as the building blocks for a data mesh approach, enabling decentralized ownership, scalable integration with microservices, and direct access for consumers across the business and partner ecosystem.

The right data products accelerate time-to-insight and enable additional revenue streams. Leading Telcos typically offer:

Network Quality Metrics: Monitoring service degradation, latency spikes, and coverage gaps continuously.
Customer Behavior Analytics: Tracking app usage, mobility patterns, device types, and engagement trends.
Fraud and Anomaly Detection Feeds: Capturing unusual usage, SIM swaps, or suspicious roaming activities in real time.
Billing and Charging Data Streams: Delivering session records and consumption details instantly to billing systems or MVNO partners.
Device Telemetry and Health Data: Providing operational status and error signals from smartphones, CPE, and IoT devices.
Subscriber Profile Updates: Streaming changes in service plans, device upgrades, or user preferences.
Location-Aware Services Data: Powering geofencing, smart city applications, and targeted marketing efforts.
Churn Prediction Models: Scoring customer retention risks based on usage behavior and network experience.
Network Capacity and Traffic Forecasts: Helping optimize resource allocation and investment planning.
Policy Compliance Monitoring: Ensuring real-time validation of internal and external SLAs, privacy agreements, and regulatory requirements.

These data products can be offered via APIs, secure topics, or integrated into partner platforms for direct consumption.

How Each Data Consumer Gains Strategic Value

Real-time data streaming empowers each data consumer within the Telco ecosystem to achieve specific business outcomes, drive operational excellence, and unlock new growth opportunities based on continuous, trusted insights.

Internal Telco Divisions

Real-time insights into network behavior allow proactive incident management and customer support. Marketing teams optimize campaigns based on live subscriber data, while finance teams minimize revenue leakage by tracking billing and usage patterns instantly.

MVNO Partners

Access to live network quality indicators helps MVNOs improve customer satisfaction and loyalty. Real-time fraud monitoring protects against financial losses. Tailored subscriber insights enable MVNOs to offer personalized plans and upsells based on actual usage.

IoT Platform Services

Large-scale telemetry streaming enables better device management, predictive maintenance, and operational automation. Real-time geolocation data improves logistics, fleet management, and smart infrastructure performance. Event-driven alerts help detect and resolve device malfunctions rapidly.

Enterprise Customers

Private 5G networks and managed services depend on live analytics to meet SLA obligations. Enterprises integrate real-time network telemetry into their own systems for smarter decision-making. Data-driven optimizations ensure higher uptime, better resource utilization, and enhanced customer experiences.

Building a Trusted Data Ecosystem for Telcos with Real-Time Streaming and Hybrid Cloud

Real-time data sharing is no longer a luxury for Telcos — it is a competitive necessity. A successful platform must balance openness with control, ensuring that every data exchange respects privacy, governance, and commercial boundaries.

Hybrid cloud architectures play a critical role in this evolution. They enable Telcos to process, govern, and share real-time data across on-premises infrastructure, edge environments, and public clouds seamlessly. By combining the flexibility of cloud-native services with the security and performance of on-premises systems, hybrid cloud ensures that data remains accessible, scalable, cost-efficient and compliant wherever it is needed.

By deploying scalable data streaming solutions across a hybrid cloud environment, Telcos enable secure, real-time data sharing with MVNOs, IoT platforms, enterprise customers, and internal business units. This empowers critical use cases such as dynamic quality of service monitoring, real-time fraud detection, customer behavior analytics, predictive maintenance for connected devices, and SLA compliance reporting — all without compromising performance or regulatory requirements.

The future of telecommunications belongs to those who implement real-time data streaming and controlled data sharing — turning raw events into strategic advantage faster, more securely, and more effectively than ever before.

How do you share data in your organization? Do you already leverage data streaming or still operate in batch mode? Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases.

The post Real-Time Data Sharing in the Telco Industry for MVNO Growth and Beyond with Data Streaming appeared first on Kai Waehner.

Multi-Cloud Replication in Real-Time with Apache Kafka and Cluster Linking

Kai Waehner — Wed, 14 Aug 2024 06:07:28 +0000

Multiple Apache Kafka clusters are the norm; not an exception anymore. Hybrid integration and multi-cloud replication for migration or disaster recovery are common use cases. This blog post explores a real-world success story from financial services around the transition of a large traditional bank from on-premise data centers into the public cloud for multi-cloud data sharing between AWS and Azure.

What is Multi-Cloud and How Does Apache Kafka Help?

Multi-cloud refers to the use of multiple cloud computing services from different providers in a single heterogeneous IT environment. This approach enhances flexibility, performance, and reliability while avoiding vendor lock-in.

Here are the key benefits of multi-cloud:

Avoidance of Vendor Lock-In: By utilizing multiple cloud providers, organizations can avoid dependency on a single vendor, reducing the risk associated with vendor-specific outages and price changes.
Optimization of Performance and Cost: Different cloud providers offer varying strengths, pricing models, and geographic availability. Multi-cloud strategies enable organizations to choose the best provider for each workload to optimize performance and cost.
Enhanced Redundancy and Resilience: Multi-cloud setups can provide higher availability and disaster recovery capabilities by distributing workloads across multiple cloud environments, thus reducing the impact of localized outages.
Regulatory and Compliance Benefits: Some industries and regions have specific regulatory requirements that may be easier to meet using a multi-cloud approach, ensuring data residency and compliance.

The real-time capabilities of Apache Kafka are a perfect match for multi-cloud architectures. Information is replicated and synchronized directly after the creation of an event. Apache Kafka’s combination of high throughput, low latency, durability provides strong data consistency guarantees across multiple cloud providers like AWS, Azure, GCP and Alibaba; no matter if the data sources or data sinks are real time, batch or API-driven request

One Apache Kafka Cluster Does NOT Fit All Use Cases

Organizations require multiple Kafka cluster strategies for various use cases: Hybrid integration, aggregation, migration and disaster recovery. I explored the architecture options and trade-offs in a dedicated blog post: “Apache Kafka Cluster Type Deployment Strategies“.

Multi-cloud is a special case with even higher challenges regarding security, cost, and latency. Nevertheless, all larger organizations have multi-cloud infrastructure and integration needs. Let’s explore the multi-cloud journey of fidelity investments and how data streaming with Apache Kafka helps.

Fidelity’s Hybrid Cloud Data Streaming Journey

Fidelity Investments is a leading financial services company that provides a wide range of investment management, retirement planning, brokerage, and wealth management services to individuals and institutions. Founded in 1946, Fidelity has grown to become one of the largest asset managers in the world, with trillions of dollars in assets under management. The company is known for its comprehensive research tools, innovative technology platforms, and commitment to customer service, helping clients achieve their financial goals.

Fidelity Investments presented at Kafka Summit events about their data streaming journey transitioning from on-premise to hybrid cloud infrastructure.

Source: Fidelity Investments

Fidelity’s Event Streaming Platform

Fidelity Investments built a streaming platform that hosts business application events for event driven architectures and integration with other business applications through events and streams.

Source: Fidelity Investments

A few impressive numbers about from Fidelity’s event streaming platform infrastructure (presented at Kafka Summit London in 2023):

4 years in public cloud
16k+ producer and consumer applications
6B+ events per day
72+ self-service APIs
300+ observability metrics

One of the first critical use cases was the integration and offloading from IBM z Systems mainframe via IBM MQ, Kafka Connect (deployed on the mainframe) and Confluent Platform.

For architectures and best practices around mainframe modernization, check out my article “Mainframe Integration, Offloading and Replacement with Apache Kafka“.

Fidelity’s Cloud Journey: From Point-to-Point to Decoupling Applications with Kafka and Cluster Linking between AWS and Azure

The Kafka Summit talk “Multi-Cloud Data Sharing: Make the Data Move for you Across CSPs using Cluster Linking” explored Fidelity Investment’s transition to the cloud.

Fidelity Investments designed its multi-cloud event streaming platform to enable applications residing in different cloud service providers to seamlessly share data between them.

BEFORE: Point-to-Point Multi-Cloud Replication

Many architects call this the spaghetti integration architecture. All applications do a point-to-point connection to each other application:

Source: Fidelity Investments

This setup is costly, error-prone and hard to maintain or innovate.

One of Apache Kafka’s unique values is truly decoupling between applications. The event-based durable commit log guarantees data consistency but also allows choosing the right technology or API in each business unit.

Replication between the Kafka clusters running in different cloud infrastructures and regions on AWS and Azure is implemented with Confluent Cluster Linking.

Source: Fidelity Investments

Confluent Cluster Linking plays a crucial role in this design for real-time replication between Kafka clusters. It uses the Kafka protocol for replication to provide all the benefits of Kafka and no needed infrastructure and operations overhead with tools like MirrorMaker. Using the Kafka protocol for multi-cloud replication also affects, i.e., reduces the network cost significantly because it avoids many translations and compression tasks required by MirrorMaker.

Fidelity’s Multi-Cloud Requirements: Data Ownership, Data Contracts and Self-Service API

Besides reliable real-time replication across multi-cloud environments, Fidelity Investments’ other important aspects include for the multi-cloud Kafka enterprise architecture include:

Data ownership in a multi-cloud environment
Schema Registry to provide common data contracts (often called data products in a data mesh architecture) across Kafka clusters in different cloud providers
Self-service management API plane allowing teams to manage their multi-cloud topic replications with as little as a single configuration change.

Transition to Cloud with Data Streaming Across Industries

Multi-cloud use cases include data integration, migration, aggregation and disaster recovery scenarios. Here are a few real-world examples from the financial services, healthcare and telecom sector:

Even if you do not plan multi-cloud infrastructure because focusing on a single service provider across regions, you can be sure: The next merger and acquisition (M&A) comes for sure… Multi-cloud scenarios are not an exception, but the norm in larger organizations.

Do you already deploy across multiple cloud providers? What are the use cases? How do you efficiently and reliably integrate these environments? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Multi-Cloud Replication in Real-Time with Apache Kafka and Cluster Linking appeared first on Kai Waehner.

Apache Kafka Cluster Type Deployment Strategies

Kai Waehner — Mon, 29 Jul 2024 06:34:49 +0000

Organizations start their data streaming adoption with a single Apache Kafka cluster to deploy the first use cases. The need for group-wide data governance and security but different SLAs, latency, and infrastructure requirements introduce new Kafka clusters. Multiple Kafka clusters are the norm, not an exception. Use cases include hybrid integration, aggregation, migration, and disaster recovery. This blog post explores real-world success stories and cluster strategies for different Kafka deployments across industries.

Apache Kafka – The De Facto Standard for Event-Driven Architectures and Data Streaming

Apache Kafka is an open-source, distributed event streaming platform designed for high-throughput, low-latency data processing. It allows you to publish, subscribe to, store, and process streams of records in real time.

Kafka serves as a popular choice for building real-time data pipelines and streaming applications. The Kafka protocol became the de facto standard for event streaming across various frameworks, solutions, and cloud services. It supports operational and analytical workloads with features like persistent storage, scalability, and fault tolerance. Kafka includes components like Kafka Connect for integration and Kafka Streams for stream processing, making it a versatile tool for various data-driven use cases.

While Kafka is famous for real-time use cases, many projects leverage the data streaming platform for data consistency across the entire enterprise architecture, including databases, data lakes, legacy systems, Open APIs, and cloud-native applications.

Different Apache Kafka Cluster Types

Kafka is a distributed system. A production setup usually requires at least four brokers. Hence, most people automatically assume that all you need is a single distributed cluster you scale up when you add throughput and use cases. This is not wrong in the beginning. But…

One Kafka cluster is NOT the right answer for every use case. Various characteristics influence the architecture of a Kafka cluster:

Availability: Zero downtime? 99.99% uptime SLA? Non-critical analytics?
Latency: End-to-end processing in <100ms (including processing)? 10-minute end-to-end data warehouse pipeline? Time travel for re-processing historical events?
Cost: Value vs. cost? Total Cost of Ownership (TCO) matters! For instance, in the public cloud, networking can be up to 80% of the total Kafka cost!
Security and Data Privacy: Data privacy (PCI data, GDPR, etc.)? Data governance and compliance? End-to-end encryption on the attribute level? Bring your own key? Public access and data sharing? Air-gapped edge environment?
Throughput and Data Size: Critical transactions (typically low volume)? Big data feeds (clickstream, IoT sensors, security logs, etc.)?

Related topics like on-premise vs. public cloud, regional vs. global, and many other requirements also affect the Kafka architecture.

Apache Kafka Cluster Strategies and Architectures

A single Kafka cluster is often the right starting point for your data streaming journey. It can onboard multiple use cases from different business domains and process gigabytes per second (if operated and scaled the right way). However, depending on your project requirements, you need an enterprise architecture with multiple Kafka clusters. Here are a few common examples:

Hybrid Architecture: Data integration and uni- or bi-directional data synchronization between multiple data centers. Often, connectivity between an on-premise data center and a public cloud service provider. Offloading from legacy into cloud analytics is one of the most common scenarios. But command & control communication is also possible, i.e., sending decisions/recommendations/transactions into a regional environment (e.g., storing a payment or order from a mobile app in the mainframe).
Multi-Region / Multi-Cloud: Data replication for compliance, cost, or data privacy reasons. Data sharing usually only includes a fraction of the events, not all Kafka Topics. Healthcare is one of many industries that goes this direction.
Disaster Recovery: Replication of critical data in active-active or active-passive mode between different data centers or cloud regions. Includes strategies and tooling for fail-over and fallback mechanisms in the case of a disaster to guarantee business continuity and compliance.
Aggregation: Regional clusters for local processing (e.g., pre-processing, streaming ETL, stream processing business applications) and replication of curated data to the big data center or cloud. Retail stores are an excellent example.
Migration: IT modernization with a migration from on-premise into the cloud or from self-managed open source into a fully managed SaaS. Such migrations can be done with zero downtime or data loss while the business continues during the cut-over.
Edge (Disconnected / Air-Gapped): Security, cost, or latency require edge deployments, e.g. in a factory or retail store. Some industries deploy in safety-critical environments with unidirectional hardware gateway and data diode.
Single Broker: Not resilient, but sufficient for scenarios like embedding a Kafka broker into a machine or on an Industrial PC (IPC) and replicating aggregated data into a large cloud analytics Kafka cluster. One nice example is the installation of data streaming (including integration and processing) on a computer of a soldier on the battlefield.

Bridging Hybrid Kafka Clusters

These options can be combined. For instance, a single broker at the edge typically replicates some curated data to a remote data center. And hybrid clusters have such different architectures depending on how they are bridged: connections over public internet, private link, VPC peering, and transit gateway, etc.

Having seen the development of Confluent Cloud over the years, I totally underestimated how much engineering time needs to be spent on security and connectivity. However, missing security bridges are the main blocker for the adoption of a Kafka cloud service. So, there is no way around providing various security bridges between Kafka clusters beyond just public internet.

There are even use cases where organizations need to replicate data from the data center to the cloud but the cloud service is NOT allowed to initiative the connection. Confluent built a specific feature “source-initiated link” for such security requirements where the source (i.e., the on-premise Kafka cluster) always initiates the connection – even though the cloud Kafka clusters is consuming the data:

Source: Confluent

As you see, it gets complex quickly. Find the right experts to help you from the beginning; not after you already deployed the first clusters and applications.

A long time ago, I already described in a detailed presentation of the architecture patterns for distributed, hybrid, edge, and global Apache Kafka deployments. Look at that slide deck and video recording for more details about the deployment options and trade-offs.

RPO vs. RTO = Data Loss vs. Downtime

RPO and RTO are two critical KPIs you need to discuss before deciding for a Kafka cluster strategy:

RPO (Recovery Point Objective) is the maximum acceptable amount of data loss measured in time, indicating how frequently backups should occur to minimize data loss.
RTO (Recovery Time Objective) is the maximum acceptable duration of time it takes to restore business operations after a disruption. Together, they help organizations plan their data backup and disaster recovery strategies to balance cost and operational impact.

While people often start with the goal of RPO = 0 and RTO = 0, they quickly realize how hard (but not impossible) it is to get this. You need to decide how much data are you okay to lose in a disaster? You need a disaster recovery plan if disaster strikes. The legal and compliance teams will have to tell you if it is okay to lose a few data sets in case of disaster or not. These any many other challenges need to be discussed when evaluating your Kafka cluster strategy.

The replication between Kafka clusters with tools like MIrrorMaker or Cluster Linking is asynchronous and RPO > 0. Only a stretched Kafka cluster provides RPO = 0.

Stretched Kafka Cluster – Zero Data Loss with Synchronous Replication across Data Centers

Most deployments with multiple Kafka clusters use asynchronous replication across data centers or clouds via tools like MirrorMaker or Confluent Cluster Linking. This is good enough for most use cases. But in case of a disaster, you lose a few messages. The RPO is > 0.

A stretched Kafka cluster deploys Kafka brokers of ONE SINGLE CLUSTER across three data centers. The replication is synchronous (as this is how Kafka replicates data within one cluster) and guarantees zero data loss (RPO = 0) – even in the case of a disaster!

Why shouldn’t you always do stretched clusters?

Low latency (<~50ms) and stable connection required between data centers
Three (!) data centers are needed, two is not enough as a majority (quorum) must acknowledge writes and reads to ensure the system’s reliability
Hard to set up, operate, and monitor – much harder than a cluster running in one data center
Cost vs. value is not worth it in many use cases – during a real disaster, most organizations and use cases have bigger problems than losing a few messages (even if it is critical data like a payment or order).

To be clear: In the public cloud, a region usually has three data centers (= availability zones). Hence, in the cloud, it depends on your SLAs if one cloud region counts as a stretched cluster or not. Most SaaS Kafka offerings deploy in a stretched cluster here. However, many compliance scenarios do NOT see a Kafka cluster in one cloud region as good enough for guaranteeing SLAs and business continuity if a disaster strikes.

Confluent built a dedicated product to solve (some of) these challenges: Multi-Region Clusters (MRC). It provides capabilities to do synchronous and asynchrounous replication within a stretched Kafka cluster.

For example, in a financial services scenario, MRC replicates low-volume critical transactions synchronously, but high-volume logs asynchronously:

handles ‘Payment’ transactions enter from us-east and us-west with fully synchronous replication
‘Log’ and ‘Location’ information in the same cluster use async – optimized for latency
Automated disaster recovery (zero downtime, zero data loss)

More details about stretched Kafka clusters vs. active-active / active-passive replication between two Kafka clusters in my global Kafka presentation.

Pricing of Kafka Cloud Offerings (vs. Self-Managed)

The above sections explain why you need to consider different Kafka architectures depending on your project requirements. Self-managed Kafka clusters can be configured the way you need. In the public cloud, fully managed offerings look different (the same way as any other fully managed SaaS). Pricing is different because SaaS vendors need to configure reasonable limits. The vendor has to provide specific SLAs.

The data streaming landscape includes various Kafka cloud offerings. Here is an example of Confluent’s current cloud offerings, including multi-tenant and dedicated environments with different SLAs, security features, and cost models.

Source: Confluent

Make sure to evaluate and understand the various cluster types from different vendors available in the public cloud, including TCO, provided uptime SLAs, replication costs across regions or cloud providers, and so on. The gaps and limitations are often intentionally hidden in the details.

For instance, if you use Amazon Managed Streaming for Apache Kafka (MSK), you should be aware that the terms and conditions tell you that “The service commitment does not apply to any unavailability, suspension or termination … caused by the underlying Apache Kafka or Apache Zookeeper engine software that leads to request failures”.

But pricing and support SLAs are just one critical piece of such a comparison. There are lots of “build vs. buy” decisions you have to make as part of evaluating a data streaming platform, as I pointed out in my detailed article comparing Confluent to Amazon MSK Serverless.

Kafka Storage – Tiered Storage and Iceberg Table Format to Store Data Only Once

Apache Kafka added Tiered Storage to separate compute and storage. The capability enables more scalable, reliable, and cost-efficient enterprise architectures. Tiered Storage for Kafka enables a new Kafka cluster type: Storing Petabytes of data in the Kafka commit log in a cost-efficient way (like in your data lake) with timestamps and guaranteed ordering to travel back in time for re-processing historical data. KOR Financial is a nice example of using Apache Kafka as a database for long-term persistence.

Kafka enables a Shift Left Architecture to store data only once for operational and analytical datasets:

With this in mind, think again about the use cases I described above for multiple Kafka clusters. Should you still replicate data in batch at rest in the database, data lake, or lakehouse from one data center or cloud region to another? No. You should synchronize data in real-time, store the data once (usually in an object store like Amazon S3), and then connect all analytical engines like Snowflake, Databricks, Amazon Athena, Google Cloud BigQuery, and so on to this standard table format.

Learn more about the unification of operational and analytical data in my article “Apache Iceberg – The Open Table Format for Lakehouse AND Data Streaming“.

Real-World Success Stories for Multiple Kafka Clusters

Most organizations have multiple Kafka clusters. This section explores four success stories across different industries:

Paypal (Financial Services) – US: Instant payments, fraud prevention.
JioCinema (Telco/Media) – APAC: Data integration, clickstream analytics, advertisement, personalization.
Audi (Automotive/Manufacturing) – EMEA: Connected cars with critical and analytical requirements.
New Relic (Software/Cloud) – US: Observability and application performance management (APM) across the world.

Paypal – Separation by Security Zone

PayPal is a digital payment platform that allows users to send and receive money online securely and conveniently around the world in real time. This requires a scalable, secure and compliant Kafka infrastructure.

During the 2022 Black Friday, Kafka traffic volume peaked at about 1.3 trillion messages daily! At present, PayPal has 85+ Kafka clusters, and every holiday season they flex up their Kafka infrastructure to handle the traffic surge. The Kafka platform continues to seamlessly scale to support this traffic growth without any impact on their business.

Today, PayPal’s Kafka fleet consists of over 1,500 brokers that host over 20,000 topics. The events are replicated among the clusters, offering 99.99% availability.

Kafka cluster deployments are separated into different security zones within a data center:

Source: Paypal

The Kafka clusters are deployed across these security zones, based on data classification and business requirements. Real-time replication with tools such as MirrorMaker (in this example, running on Kafka Connect infrastructure) or Confluent Cluster Linking (using a simpler and less error-prone approach directly using the Kafka protocol for replication) is used to mirror the data across the data centers, which helps with disaster recovery and to achieve inter-security zone communication.

JioCinema – Separation by Use Case and SLA

JioCinema is a rapidly growing video streaming platform in India. The telco OTT service is known for its expansive content offerings, including live sports like the Indian Premier League (IPL) for cricket, a newly launched Anime Hub, and comprehensive plans for covering major events like the Paris 2024 Olympics.

The data architecture leverages Apache Kafka, Flink, and Spark for data processing, as presented at Kafka Summit India 2024 in Bangalore:

Source: JioCinema

Data streaming plays a pivotal role in various use cases to transform user experiences and content delivery. Over ten million messages per second enhance analytics, user insights, and content delivery mechanisms.

JioCinema’s use cases include:

Inter Service Communication
Clickstream / Analytics
Ad Tracker
Machine Learning and Personalization

Kushal Khandelwal, Head of Data Platform, Analytics, and Consumption at JioCinema, explained that not all data is equal and the priorities and SLAs differ per use case:

Source: JioCinema

Data streaming is a journey. Like so many other organizations worldwide, JioCinema started with one large Kafka cluster using 1000+ Kafka Topics and 100,000+ Kafka Partitions for various use cases. Over time, a separation of concerns regarding use cases and SLAs developed into multiple Kafka clusters:

Source: JioCinema

The success story of JioCinema shows the common evolution of a data streaming organization. Let’s now explore another example where two very different Kafka clusters were deployed from the beginning for one use case.

Audi – Operations vs. Analytics for Connected Cars

The car manufacturer Audi provides connected cars featuring advanced technology that integrates internet connectivity and intelligent systems. Audi’s cars enable real-time navigation, remote diagnostics, and enhanced in-car entertainment. These vehicles are equipped with Audi Connect services. Features include emergency calls, online traffic information, and integration with smart home devices, to enhance convenience and safety for drivers.

Source: Audi

Audi presented their connected car architecture in the keynote of Kafka Summit in 2018. The Audi enterprise architecture relies on two Kafka clusters with very different SLAs and use cases.

Source: Audi

The Data Ingestion Kafka cluster is very critical. It needs to run 24/7 at scale. It provides last-mile connectivity to millions of cars using Kafka and MQTT. Backchannels from the IT side to the vehicle help with service communication and over-the-air updates (OTA).

ACDC Cloud is the analytics Kafka cluster of Audi’s connected car architecture. The cluster is the foundation of many analytical workloads. These process enormous volumes of IoT and log data at scale with batch processing frameworks, like Apache Spark.

This architecture was already presented in 2018. Audi’s slogan “Progress through Technology” shows how the company applied new technology for innovation long before most car manufacturers deployed similar scenarios. All sensor data from the connected cars is processed in real time and stored for historical analysis and reporting.

New Relic – Worldwide Multi-Cloud Observability

New Relic is a cloud-based observability platform that provides real-time performance monitoring and analytics for applications and infrastructure to customers around the world.

Andrew Hartnett, VP of Software Engineering, at New Relic explains how data streaming is crucial for the entire business model of New Relic:

“Kafka is our central nervous system. It is a part of everything that we do. Most services across 110 different engineering teams with hundreds of services touch Kafka in some way, shape, or form in our company, so it really is mission-critical. What we were looking for is the ability to grow, and Confluent Cloud provided that.”

New Relic ingested up to 7 billion data points per minute; on track to ingest 2.5 exabytes of data in 2023. As New Relic expands its multi-cloud strategies, teams will use Confluent Cloud for a single pane of glass view across all environments.

“New Relic is multi-cloud. We want to be where our customers are. We want to be in those same environments, in those same regions, and we wanted to have our Kafka there with us.” says Artnett in a Confluent case study.

Multiple Kafka Clusters are the Norm; Not an Exception!

Event-driven architectures and stream processing have existed for decades. The adoption grows with open source frameworks like Apache Kafka and Flink in combination with fully managed cloud services. More and more organizations struggle with their Kafka scale. Enterprise-wide data governance, center of excellence, automation of deployment and operations, and enterprise architecture best practices help to successfully provide data streaming with multiple Kafka clusters for independent or collaborating business domains.

Multiple Kafka clusters are the norm, not an exception. Use cases such as hybrid integration, disaster recovery, migration or aggregation enable real-time data streaming everywhere with the needed SLAs.

How does your enterprise architecture look like? How many Kafka clusters do you have? And how do you decide about data governance, separation of concerns, multi-tenancy, security, and similar challenges in your data streaming organization? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka Cluster Type Deployment Strategies appeared first on Kai Waehner.

ARM CPU for Cost-Effective Apache Kafka at the Edge and Cloud

Kai Waehner — Thu, 22 Feb 2024 13:22:35 +0000

ARM CPUs often outperform x86 CPUs in scenarios requiring high energy efficiency and lower power consumption. These characteristics make ARM preferred for edge and cloud environments. This blog post discusses the benefits of using Apache Kafka alongside ARM CPUs for real-time data processing in edge and hybrid cloud setups, highlighting energy-efficiency, cost-effectiveness, and versatility. A wide range of use cases are explored across industries, including manufacturing, retail, smart cities and telco.

Apache Kafka at the Edge and Hybrid Cloud

Apache Kafka is a distributed event streaming platform that enables building real-time streaming data pipelines and applications by providing capabilities for publishing, subscribing to, storing, and processing streams of records in a scalable and fault-tolerant way.

Various examples exist for Kafka deployments on the edge. These use cases are related to several of the above categories and requirements, such as low hardware footprint, disconnected offline processing, hundred of locations, and hybrid architectures.

Use Cases for Apache Kafka at the Edge

I have worked with enterprises across industries and the globe on the following scenarios:

Public Sector: Local administration in each city, smart city projects including public transportation, traffic management, integration of various connected car platforms from different carmakers, cybersecurity (including IoT use cases such as capturing and processing camera images)
Transportation / Logistics / Railway / Aviation: Track and trace, Kafka in the trains for offline and local processing / storage, traveller information (delayed or canceled flight / train / bus), real-time loyalty platforms (class upgrade, lounge access)
Manufacturing (Automotive, Aerospace, Semiconductors, Chemical, Food, and others): IoT aftermarket customer services, OEM in machines and vehicles, embedding into standard software such as ERP or MES systems, cybersecurity, a digital twin of devices/machines/production lines/processes, production line monitoring in factories for predictive maintenance/quality control/production efficiency, operations dashboards and line wellness (on-site for the plant manager, and aggregated global KPIs for executive management), track&trace and geofencing on the shop floor
Energy / Utility / Oil & Gas: Smart home, smart buildings, smart meters, monitoring of remote machines (e.g., for drilling, windmills, mining), pipeline and refinery operations (e.g., predictive failure or anomaly detection)
Telecommunications / Media: OSS real-time monitoring/problem analysis/metrics reporting/root cause analysis/action response of the network devices and infrastructure (routers, switches, other network devices), BSS customer experience and OTT services (mobile app integration for millions of users), 5G edge (e.g., street sensors)
Healthcare: Track & trace in the hospital, remote monitoring, machine sensor analytics
Retailing / Food / Restaurants / Banking: Customer communication, cross-/up-selling, loyalty system, payments in retail stores, perpetual inventory, Point-of-Sale (PoS) integration for (local) payments and (remote) CRM integration, EFTPOS (Electronic funds transfer at point of sale)

Benefits for Kafka at the Edge AND in the Cloud

Deploying the same technology in hybrid environments is not a new idea. Project teams see tremendous benefits when using Kafka at the edge and in the data center or cloud:

Same APIs, concepts, development tools and testing
Same architecture for streaming, storing, processing and connecting systems, even if at very different scale
Real-time synchronization between multiple environments included out-of-the-box via the Kafka protocol

Let’s explore how ARM CPUs fit into this discussion.

What is ARM CPU?

An ARM CPU refers to a family of CPUs based on the Advanced RISC Machine (ARM) architecture, which is a type of Reduced Instruction Set Computing (RISC) architecture. ARM CPUs ave a reputation for their high performance, power efficiency, and low cost. These characteristics make them particularly popular in mobile devices such as smartphones, tablets, and an increasingly wide range of other devices like IoT (Internet of Things) gadgets, servers, and even desktop computers.

The ARM architecture performs operations with a smaller number of computer instructions, allowing it to achieve high performance with lower power consumption compared to more complex instruction set computing (CISC) architectures like x86 used by Intel and AMD CPUs. This efficiency is a key advantage for battery-powered devices, where energy conservation is critical.

ARM Holdings, the company behind the ARM architecture, does not manufacture CPUs but licenses the architecture to other companies. These companies can then implement their own ARM-based processors, potentially customizing them for specific needs. This licensing model has led to a wide adoption of ARM processors across various segments of the technology industry.

ARM32 vs. ARM64

ARM architectures come in different versions, primarily distinguished by their instruction set architectures and addressing capabilities. The most commonly referenced are ARMv7 and ARMv8 (also called AArch64) correspond to 32-bit and 64-bit processing capabilities, respectively.

Newer hardware for industrial PCs and home computers incorporates ARMv8 (64-bit). It is the foundation for smartphones, tablets, servers, and processors like Apple’s A-series chips in iPhones and iPads. Even the cloud providers use the ARM architecture to build new processors for cloud computing, like Amazon’s Graviton. ARMv8 processors can run both 32-bit and 64-bit applications, offering greater versatility and performance.

Key Features and Benefits of ARM CPUs

The key features and benefits of ARM CPUs include:

Power Efficiency: Their design allows for significant power savings, extending battery life in portable devices.
Performance: While historically seen as less powerful than their x86 counterparts, modern ARM processors offer competitive performance, especially in multi-core configurations.
Customization: Companies can license the ARM architecture and customize their own chips, allowing for optimized processors that meet specific product requirements.
Ecosystem: A broad adoption across mobile, embedded, and increasingly in server and desktop markets ensures a robust ecosystem of software and development tools.

ARM CPUs are central to the development of mobile computing and are becoming more important in other areas, including edge computing, data centers, and as part of the shift towards more energy-efficient computing solutions.

Why ARM CPUs at the Edge (e.g., for Industrial IoT)?

ARM architecture is favored for edge computing, including Industrial IoT. It provides high power efficiency and performance within compact form factors. These characteristics ensure devices can handle compute-intensive tasks locally. Only relevant data is transmitted to the cloud, which saves bandwidth and decreases latency.

The efficiency of ARM CPUs is crucial for industrial applications where real-time processing and long battery life are essential. ARM’s versatility and low power consumption make it ideal for the diverse needs of edge computing in various environments.

For instance, in manufacturing, ARM-powered sensors on machines enable predictive maintenance by monitoring conditions like vibration and temperature. These sensors process data locally, offering real-time alerts on potential failures, reducing downtime, and saving costs. ARM’s efficiency supports widespread deployment, making it ideal for continuous, autonomous monitoring in industrial environments.

Why ARM in the Cloud?

ARM’s efficiency and performance advantages are driving its adoption in cloud computing. ARM-based processors, like Amazon’s AWS Graviton, offer an attractive mix of high performance and lower power consumption compared to traditional x86 CPUs. This efficiency translates into cost savings and reduced environmental impact for cloud service providers and their customers.

AWS Graviton, specifically designed for cloud workloads, exemplifies how ARM architecture can optimize operations in data centers, enhancing the performance of web servers, containerized applications, and microservices at a lower cost. This shift towards ARM in the cloud represents a significant move towards more energy-efficient and cost-effective data center operations.

Apache Kafka on ARM – A Match Made in Heaven for Edge and Cloud Workloads

Using ARM architecture together with Apache Kafka, a distributed streaming platform, offers several advantages, especially in scenarios that demand high throughput, scalability, and energy efficiency.

Energy Efficiency and Cost-Effectiveness: ARM processors are known for their low power consumption, which makes them cost-effective for running distributed systems like Kafka. Deploying Kafka on ARM-based servers can reduce operational costs, particularly in large-scale environments where energy consumption can significantly affect the budget.
Scalability: Kafka handles large volumes of data and high throughput, characteristics that align well with the scalability of ARM processors in cloud environments. ARM’s efficiency enables scaling out Kafka clusters more economically, allowing for the processing of streaming data in real-time without incurring high energy or hardware costs.
Edge Computing: Kafka is a common choice for real-time data processing and aggregation in edge computing scenarios. ARM’s dominance in IoT and edge devices makes it a natural fit for these use cases. Running Kafka on ARM enables efficient data processing closer to the source, reducing latency and bandwidth usage by minimizing the need to send large volumes of data to central data centers.
Eco-Friendly Solutions: With growing environmental concerns, ARM’s energy efficiency contributes to more sustainable computing solutions. Deploying Kafka on ARM can be part of an eco-friendly strategy for organizations looking to minimize their carbon footprint.
Innovative Use Cases: Combining Kafka with ARM opens up new possibilities for innovative applications in IoT, real-time analytics, and mobile applications. The efficiency of ARM allows for cost-effective experimentation and deployment of new services that require real-time data processing and streaming capabilities.

Examples and Case Studies for Kafka at the Edge

Overall, the combination of ARM and Apache Kafka supports the development of efficient, scalable, and sustainable data processing architectures, particularly suited for modern applications that require real-time performance with minimal energy consumption.

For several use cases, architectures and case studies about data streaming at the edge and hybrid cloud, check out my related articles:

Most of these blog posts are a few years old. But they are as relevant today as at the time of writing them. Actually, the official support of ARM CPU at the edge completely changes the conversations about challenges and solutions of deploying Kafka on edge infrastructure. The deployment of Kafka at the edge was never easier. If you buy a new Industrial PC (IPC) today, it will have enough hardware power to run Kafka and its ecosystem for data integration and stream processing easily.

Confluent Platform on ARM Infrastructure for Edge Deployments

Confluent Platform is Confluent’s data streaming platform for self-managed deployments of Apache Kafka. Most deployments operate in a traditional data center. However, this is more and more shifting to deploy at the edge, i.e., outside of a data center, too.

Since version 7.6., Confluent Platform officially supports ARM64 Linux architectures. Confluent Platform’s architecture allows you to run it wherever your IT systems are, across a global footprint. This includes running it as a mission-critical cluster in data centers, but also on edge sites like retail stores, ships or factories, or as a single broker on edge devices.

Confluent itself recognized the benefits of ARM64 CPUs: They moved the entire AWS fleet for the fully managed Confluent Cloud to ARM-based images in the past months.

The Confluent Server Broker powered by Apache Kafka enables end-to-end data pipelines:

Collecting data from any source
Persists the data in the event storage with separate compute and storage
Processes the events with stream processing
Share data with downstream applications
Replicate selected events across the WAN through the native Kafka protocol via Cluster Linking.

Now, you can deploy this in production on low-cost, small-footprint ARM64 architecture infrastructure at the edge and also synchronize with a data center or cloud Kafka cluster.

Kafka + ARM = Cost-Effective and Sustainable

The article outlined the synergistic relationship between Apache Kafka and ARM CPUs. It enables efficient, scalable, and sustainable data processing architectures for edge and hybrid cloud environments.

The adoption of ARM in cloud computing marks a significant shift towards more sustainable and performance-optimized computing solutions. The combination of Kafka and ARM CPUs is poised to drive innovation in real-time analytics, IoT, and mobile applications. A few great examples:

AWS Graviton to operate Kafka cost-efficient in the public cloud.
Confluent Platform’s compatibility and support for ARM64 architectures at the edge.

The sustainability of energy-efficient ARM CPUs is a perfect segue to the data streaming article “Green Data, Clean Insights: How Kafka and Flink Power ESG Transformations“.

Do you already use ARM processors in your edge or cloud Kafka environment? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post ARM CPU for Cost-Effective Apache Kafka at the Edge and Cloud appeared first on Kai Waehner.

Data Streaming is not a Race, it is a Journey!

Kai Waehner — Thu, 13 Apr 2023 07:15:31 +0000

Data Streaming is not a race, it is a Journey! Event-driven architectures and technologies like Apache Kafka or Apache Flink require a mind shift in architecting, developing, deploying, and monitoring applications. Legacy integration, cloud-native microservices, and data sharing across hybrid and multi-cloud setups are the norm, not an exception. This blog post explores success stories from data streaming journeys across industries, including banking, retail, insurance, manufacturing, healthcare, energy & utilities, and software companies.

Data Streaming is a Journey, not a Race!

Confluent’s maturity model is used across thousands of customers to analyze the status quo, deploy real-time infrastructure and applications, and plan for a strategic event-driven architecture to ensure success and flexibility in the future of the multi-year data streaming journey:

Source: Confluent

The following sections show success stories from various companies across industries that moved through the data streaming journey. Each journey looks different. Each company has different technologies, vendors, strategies, and legal requirements.

Before we begin, I must stress that these journeys are NOT just successful because of the newest technologies like Apache Kafka, Kubernetes, and public cloud providers like AWS, Azure, and GCP. Success comes with a combination of technologies and expertise.

Consulting – Expertise from Software Vendors and System Integrators

All the below success stories combined open source technologies, software vendors, cloud providers, internal business and technical experts, system integrators, and consultants of software vendors.

Long story short: Technology and expertise are required to make your data streaming journey successful.

We not only sell data streaming products and cloud services but also offer advice and essential support. Note that the bundle numbers (1 to 5) in the following diagram are related to the above data streaming maturity curve:

Source: Confluent

Other vendors have similar strategies to support you. The same is true for the system integrators. Learn together. Bring people from different companies into the room to solve your business problems.

With this background, let’s look at the fantastic data streaming journeys we heard about at past Kafka Summits, Data in Motion events, Confluent blog posts, or other similar public knowledge-sharing alternatives.

Customer Journeys across Industries powered by Apache Kafka

Apache Kafka is the DE FACTO standard for data streaming. However, each data streaming journey looks different. Don’t underestimate how much you can learn from other industries. These companies might have different legal or compliance requirements, but the overall IT challenges are often very similar.

We cover stories from various industries, including banks, retailers, insurers, manufacturers, healthcare organizations, energy providers, and software companies.

The following data streaming journeys are explored in the below sections:

AO.com: Real-Time Clickstream Analytics
Nordstrom: Event-driven Analytics Platform
Migros: End-to-End Supply Chain with IoT
NordLB: Bank-wide Data Streaming for Core Banking
Raiffeisen Bank International: Strategic Real-Time Data Integration Platform Across 13 Countries
Allianz: Legacy Modernization and Cloud-Native Innovation
Optum (UnitedHealth Group): Self-Service Data Streaming
Intel: Real-Time Cyber Intelligence at Scale with Kafka and Splunk
Bayer: Transition from On-Premise to Multi-Cloud
Tesla: Migration from a Message Broker (RabbitMQ) to Data Streaming (Kafka)
Siemens: Integration between On-Premise SAP and Salesforce CRM in the Cloud
50Hertz: Transition from Proprietary SCADA to Cloud-Native Industrial IoT

There is no specific order besides industries. If you read the stories, you see that all are different, but you can still learn a lot from them, no matter what industry your business is in.

For more information about data streaming in a specific industry, just search my blog for case studies and architectures. Recent blog posts focus on the state of data streaming in manufacturing, in financial services, and so on.

AO.com – Real-Time Clickstream Analytics

AO.com is an electrical retailer. The hyper-personalized online retail experience turns each customer visit into a one-on-one marketing opportunity. The technical implementations correlation of historical customer data with real-time digital signals.

Years ago, Apache Hadoop and Apache Spark referred to this kind of clickstream analytics as the “Hello World” example. AO.com does real-time clickstream analytics powered by data streaming.

AO.com’s journey started with relatively simple data streaming pipelines leveraging the Schema Registry for decoupling and API contracts. Over time, the platform thinking of data streaming added more business value, and operations shifted to the fully-managed Confluent Cloud to focus on implementing business applications, not operations infrastructure.

Source: AO.com

Nordstrom – Event-driven Analytics Platform

Nordstrom is an American luxury department store chain. In the meantime, 50+% of revenue comes from online sales. They built the Nordstrom Analytical Platform (NAP) as the heart of its event-driven architecture. Nordstrom can use a singular event for analytical, functional, operational, and model-building purposes.

As Nordstrom said at Kafka Summit: “If it’s not in NAP, it didn’t happen.”

Nordstrom’s journey started in 1901 with the first stores. The last 25 years brought the retailer into the digital world and online retail. NAP has been a core component since 2017 for real-time analytics. The future is all automated decision-making in real-time.

Source: Nordstrom

Migros – End-to-End Supply Chain with IoT

Migros is Switzerland’s largest retail company, the largest supermarket chain, and the largest employer. They leverage data streaming with Confluent to distribute master data across many systems.

Migros’ supply chain is optimized with a single data streaming pipeline (including replaying entire days of events). For instance, real-time transportation information is visualized with MQTT and Kafka. Afterward, more advanced business logic was implemented, like forecasting the truck arrival time and planning, respectively, rescheduling truck tours.

The data streaming journey started with a single Kafka cluster. And grew to various independent Kafka clusters and a strategic Kafka-powered enterprise integration platform.

Source: Migros

NordLB – Bank-wide Data Streaming for Core Banking

Norddeutsche Landesbank (NordLB) is one of the largest commercial banks in Germany. They implemented an enterprise-wide transformation. The new Confluent-powered core banking platform enables event-based and truly decoupled stream processing. Use cases include improved real-time analytics, fraud detection, and customer retention.

Unfortunately, NordLB’s slide is only available in German. But I guess you can still follow their data streaming journey moving over the years from on-premise big data batch processing with Hadoop to real-time data streaming and analytics in the cloud with Kafka and Confluent:

Source: Norddeutsche Landesbank

Raiffeisen Bank International – Strategic Real-Time Data Integration Platform across 13 Countries

Raiffeisen Bank International (RBI) operates as a corporate and investment bank in Austria and as a universal bank in Central and Eastern Europe (CEE).

RBI’s bank transformation across 13 countries includes various components:

Bank-wide transformation program
Realtime Integration Center of Excellence (“RICE“)
Central platform and reference architecture for self-service re-use
Event-driven integration platform (fully-managed cloud and on-premise)
Group-wide API contracts (=schemas) and data governance

While I don’t have a nice diagram of RBI’s data streaming journey over the past few years, I can show you one of the most impressive migration stories. The context is horrible had to move from Ukraine data centers to the public cloud after the Russian invasion started in early 2022. The journey was super impressive from the IT perspective, as the migration happened without downtime or data loss.

RBI’s CTO recapped:

“We did this migration in three months, because we didn’t have a choice.”

Source: McKinsey

Allianz – Legacy Modernization and Cloud-Native Innovation

Allianz is a European multinational financial services company headquartered in Munich, Germany. Its core businesses are insurance and asset management. The company is one of the largest insurers and financial services groups.

A large organization like Allianz does not have just one enterprise architecture. This section has two independent stories of two separate Allianz business units.

One team of Allianz has built a so-called Core Insurance Service Layer (CISL). The Kafka-powered enterprise architecture is flexible and evolutionary. Why? Because it needs to integrate with hundreds of old and new applications. Some are running on the mainframe or use a batch file transfer. Others are cloud-native, running in containers or as a SaaS application in the public cloud.

Data streaming ensures the decoupling of applications via events and the reliable integration of different backend applications, APIs and communication paradigms. Legacy and new applications connect but also allow running in parallel (old V1 and new V2) before migrating away and shutting down from the legacy component.

Source: Allianz

Allianz Direct is a separate Kafka deployment. This business unit started with a greenfield approach to build insurance as a service in the public cloud. The cloud-native platform is elastic, scalable, and open. Hence, one platform can be used across countries with different legal, compliance, and data governance requirements.

This data streaming journey is best described by looking at a quote from Allianz Direct’s CTO:

Source: Linkedin (November 2022)

Optum – Self-Service Data Streaming

Optum is an American pharmacy benefit manager and healthcare provider
(UnitedHealth Group subsidiary). Optum started with a single data source connecting to Kafka. Today, they provide Kafka as a Service within the UnitedHealth Group. The service is centrally managed and used by over 200 internal application teams.

The benefits of this data streaming approach are the following characteristics: repeatable, scalable, and a cost-efficient way to standardize data. Optum leverages data streaming for many use cases, from mainframe via change data capture (CDC) to modern data processing and analytics tools.

Source: Optum

Intel – Real-Time Cyber Intelligence at Scale with Kafka and Splunk

Intel Corporation (commonly known as Intel) is an American multinational corporation and technology company headquartered in Santa Clara, California. It is one of the world’s largest semiconductor chip manufacturers by revenue.

Intel’s Cyber Intelligence Platform leverages the entire Kafka-native ecosystem, including Kafka Connect, Kafka Streams, Multi-Region Clusters (MRC), and more…

Their Kafka maturity curve shows how Intel started with a few data pipelines, for instance, getting data from various sources into Splunk for situational awareness and threat intelligence use cases. Later, Intel added Kafka-native stream processing for streaming ETL (to pre-process, filter, and aggregate data) instead of ingesting raw data (with high $$$ bills) and for advanced analytics by combining streaming analytics with Machine Learning.

Source: Intel

Brent Conran (Vice President and Chief Information Security Officer, Intel) described the benefits of data streaming:

“Kafka enables us to deploy more advanced techniques in-stream, such as machine learning models that analyze data and produce new insights. This helps us reduce the meantime to detect and respond.”

Bayer AG – Transition from On-Premise to Multi-Cloud

Bayer AG is a German multinational pharmaceutical and biotechnology company and one of the largest pharmaceutical companies in the world. They adopted a cloud-first strategy and started a multi-year transition to the cloud.

While Bayer AG has various data streaming projects powered by Kafka, my favorite is the success story of their Monsanto acquisition. The Kafka-based cross-data center DataHub was created to facilitate migration and to drive a shift to real-time stream processing.

Instead of using many words, let’s look at Bayer’s impressive data streaming journey from on-premise to multi-cloud, connecting various cloud-native Kafka and non-Kafka technologies over the years.

Source: Bayer AG

Tesla – Migration from a Message Broker (RabbitMQ) to Data Streaming (Kafka)

Tesla is a carmaker, utility company, insurance provider, and more. The company processes trillions of messages per day for IoT use cases.

Tesla’s data streaming journey is fascinating because it focuses on migrating from a message broker to stream processing. The initial reasons for Kafka were the scalability and reliability of processing high volumes of IoT sensor data.

But as you can see, more use cases were added quickly, such as Kafka Connect for data integration.

Source: Tesla

Tesla’s blog post details its advanced stream processing use cases. I often call streaming analytics the “secret sauce” (as data is only valuable if you correlate it in real-time instead of just ingesting it into a batch data warehouse or data lake).

Siemens – Integration between On-Premise SAP and Salesforce CRM in the Cloud

Siemens is a German multinational conglomerate corporation and the largest industrial manufacturing company in Europe.

One strategic data-streaming use case allowed Siemens to move “from batch to faster“ data processing. For instance, Siemens connected its SAP PMD to Kafka on-premise. The SAP infrastructure is very complex, with 80% ”customized ERP”. They improved the business processes and integration workflow from daily or weekly batches to real-time communication by integrating SAP’s proprietary IDoc messages within the event streaming platform.

Siemens later migrated from self-managed on-premise Kafka to Confluent Cloud via Confluent Replicator. Integrating Salesforce CRM via Kafka Connect was the first step of Siemens’ cloud strategy. As usual, more and more projects and applications join the data streaming journey as it is super easy to tap into the event stream and connect it to your favorite tools, APIs, and SaaS products.

Source: Siemens

A few important notes on this migration:

Confluent Cloud was chosen because it enables focusing on business logic and handing over complex operations to the expert, including mission-critical SLAs and support.
Migration from one Kafka cluster to another (in this case, from open source to Confluent Cloud) is possible with zero downtime and no data loss. Confluent Replicator and Cluster Linking, in conjunction with the expertise and skills of consultants, make such a migration simple and safe.

50Hertz – Transition from Proprietary SCADA to Cloud-Native Industrial IoT

50hertz is a transmission system operator for electricity in Germany. They built a cloud-native 24/7 SCADA system with Confluent. The solution is developed and tested in the public cloud but deployed in safety-critical air-gapped edge environments. A unidirectional hardware gateway helps replicate data from the edge into the cloud safely and securely.

This data streaming journey is one of my favorites as it combines so many challenges well known in any OT/IT environment across industries like manufacturing, energy and utilities, automotive, healthcare, etc.:

From monolithic and proprietary systems to open architectures and APIs
From batch with mainframe, files, and old Windows servers to real-time data processing and analytics
From on-premise data centers (and air-gapped production facilities) to a hybrid edge and cloud strategy, often with a cloud-first focus for new use cases.

The following graphic is fantastic. It shows the data streaming journey from left to right, including the bridge in the middle (as such a project is not a big bang but usually takes years, at a minimum):

Source: 50Hertz

Technical Evolution during the Data Streaming Journey

The different data streaming journeys showed you a few things: It is not a big bang. Old and new technology and processes need to work together. Only technology AND expertise drive successful projects.

Therefore, I want to share a few more resources to think about this from a technical perspective:

The data streaming journey never ends. Apache Kafka and Apache Flink increasingly join forces to build innovative real-time stream processing applications. We are just at the beginning.

What does your data streaming journey look like? Where are you right now, and what are your plans for the next few years? Let’s connect on LinkedIn and discuss it! Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

The post Data Streaming is not a Race, it is a Journey! appeared first on Kai Waehner.

Disaster Recovery with Kafka across the Edge and Hybrid Cloud (QCon Talk)

Kai Waehner — Wed, 06 Apr 2022 11:15:46 +0000

I spoke at QCon London in April 2022 about building disaster recovery and resilient real-time enterprise architectures with Apache Kafka. This blog post summarizes the use cases, architectures, and real-world examples. The slide deck and video recording of the presentation is included as well.

What is QCon?

QCon is a leading software development conference held across the globe for 16 years. It provides a realistic look at what is trending in tech. The QCon events are organized by InfoQ, a well-known website for professional software development with over two million unique visitors per month.

QCon in 2022 uncovers emerging software trends and practices. Developers and architects learn how to solve complex engineering challenges without the product pitches.

There is no Call for Papers (CfP) for QCon. The organizers invite trusted speakers to talk about trends, best practices, and real-world stories. This makes QCon so strong and respected in the software development community.

Disaster Recovery and Resiliency with Apache Kafka

Apache Kafka is the de facto data streaming platform for analytical AND transactional workloads. Multiple options exist to design Kafka for resilient applications. For instance, MirrorMaker 2 and Confluent Replicator enable uni- or bi-directional real-time replication between independent Kafka clusters in different data centers or clouds.

Cluster Linking is a more advanced and straightforward option from Confluent leveraging the native Kafka protocol instead of additional infrastructure and complexity using Kafka Connect (like MirrorMaker 2 and Replicator).

Stretching a single Kafka cluster across multiple regions is the best option to guarantee no downtime and seamless failover in the case of a disaster. However, it is hard to operate and only recommended (i.e., consistent, stable, and mission-critical) across distances with enhanced add-ons to open-source Kafka:

QCon Presentation: Disaster Recovery with Apache Kafka

In my QCon talk, I intentionally showed the broad spectrum of real-world success stories across industries for data streaming with Apache Kafka from companies such as BMW, JPMorgan Chase, Robinhood, Royal Caribbean, and Devon Energy.

Best practices explored how to build resilient enterprises architecture with disaster recovery with RPO (Recovery Point Object) and RTO (Recovery Time Objective) in mind. The audience learns how to get your SLAs and requirements for downtime and data loss right.

The examples looked at serverless cloud offerings integrating to the IoT edge, hybrid retail architectures, and the disconnected edge in military scenarios.

The agenda looks like this:

Resilient enterprise architectures
Real-time data streaming with the Apache Kafka ecosystem
Cloud-first and serverless Industrial IoT in automotive
Multi-region infrastructure for core banking
Hybrid cloud for customer experiences in retail
Disconnected edge for safety and security in the public sector

Slide Deck from QCon Talk:

Here is the slide deck of my presentation from QCon London 2022:

We also had a great panel that discussed lessons learned from building resilient applications on the code and infrastructure level, plus the organizational challenges and best practices:

Video Recording from QCon Talk:

With the risk of Covid in mind, InfoQ decided not to record QCon sessions live.

Instead, a pre-recorded video had to be submitted by the speakers. The video recording is already available for QCon attendees (no matter if on-site in London or at the QCon Plus virtual event):

Qcon makes conference talks available for free sometime after the event. I will update this post with the free link as soon as it is available.

Disaster Recovery with Apache Kafka across all Industries

I hope you enjoyed the slides and video on this exciting topic. Hybrid and global Kafka infrastructures for disaster recovery and other use cases are the norm, not exceptions.

Real-time data beats slow data. That is true in almost every use case. Hence, data streaming with the de facto standard Apache Kafka gets adopted more and more across all industries.

How do you leverage data streaming with Apache Kafka for building resilient applications and enterprise architectures? What architecture does your platform use? Which products do you combine with data streaming? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Disaster Recovery with Kafka across the Edge and Hybrid Cloud (QCon Talk) appeared first on Kai Waehner.

Apache Kafka in the Public Sector – Part 4: Energy and Utilities

Kai Waehner — Mon, 18 Oct 2021 12:47:01 +0000

The public sector includes many different areas. Some groups leverage cutting-edge technology, like military leverage. Others like the public administration are years or even decades behind. This blog series explores how the public sector leverages data in motion powered by Apache Kafka to add value for innovative new applications and modernizing legacy IT infrastructures. This is part 4: Use cases and architectures for energy, utilities, and smart grid infrastructure.

Blog series: Apache Kafka in the Public Sector and Government

This blog series explores why many governments and public infrastructure sectors leverage event streaming for various use cases. Learn about real-world deployments and different architectures for Kafka in the public sector:

Subscribe to my newsletter to get updates immediately after the publication. Besides, I will also update the above list with direct links to this blog series’s posts once published.

As a side note: If you wonder why healthcare is not on the above list. Healthcare is another blog series on its own. While the government can provide public health care through national healthcare systems, it is part of the private sector in many other cases.

Energy, Utilities, Smart Grid, and the Public Sector

The energy sector is different in countries and even states. Public utilities are subject to public control and regulation, ranging from local community-based groups to statewide government monopolies. Hence, some markets are private businesses, or entirely controlled by the government, or a mix of both. For instance, here is the complex US Regulated vs. Deregulated Electricity Market:

Nevertheless, one thing is clear: The energy sector is changing; no matter if the government entirely regulates the market or not:

Let’s look at a few real-world examples for Apache Kafka in the Energy Sector, its relation to the public sector, and a few possible enterprise architectures.

Kafka Examples for Public Utilities

First of all, I already wrote about data in motion powered by Kafka in the energy sector. I also had a great panel discussion about edge and hybrid architecture in a panel discussion about Kafka and 5G networks in the oil and gas and mining industry.

Let’s now take a look at two more examples:

Stadtwerke Leipzig: A government-owned electricity provider
Tesla: A private company heavily influenced by the public administration

Stadtwerke Leipzig – Digital Customer Interface for Public Utilities

Stadtwerke Leipzig is a municipal energy utility in central Germany that provides electricity, natural gas, and district heating. They are wholly owned by LVV Leipziger Versorgungs- und Verkehrsgesellschaft, in which the City of Leipzig holds a 100% stake.

Leipziger Stadtwerke built a digital customer interface to connect public utilities, grid operators, the housing industry, end-consumer, industrial customers:

The picture is of bad quality, unfortunately, and not available in a better version. Though, the essential point is that the long green rectangle in the middle is Apache Kafka. Kafka is the central nervous system to connect edge devices, proprietary protocols, and open standards such as MQTT, OPC-UA, XML, JSON, etc. This way, the OT and IT world are connected with a single, scalable real-time pipeline.

Instead of having various data silos, the data is now accessible by any interested consumer in real-time at scale. Hence, this architecture solves one of the biggest challenges in energy infrastructures: Getting value out of the massive volumes of OT data. Nevertheless, the enterprise architecture allows different technologies and brownfield integration. Kafka provides automatic backpressure handling and preprocessing.

Leipziger Stadtwerke combines Kafka with other great technologies to build innovative digital services. For instance, Kunbus edge devices (a PI with custom Linux) and over-the-air updates (OTA) with Mender.

Tesla – Streaming IoT Data for Innovative Services

Tesla is a private enterprise, not within the public sector. However, living in Germany, I see how much related the company is with the public administration, government, law, etc. The Gigafactory in Berlin is in the press every week. The innovation around electric cars is a widespread public discussion; even German competitors like Volkswagen admit Tesla’s innovative business. As the public sector often does not talk to the public about its projects, I thought Tesla’s Kafka success story is still worth mentioning in this post.

Why?

Well, because Tesla has a considerable energy business (they don’t just sell cars), innovates like not many other car and energy companies need to collaborate with governments across the globe regarding law compliance, charging infrastructure, and other crucial topics.

Tesla processes trillions of messages per day for IoT use cases with Kafka. The data comes from connected cars, power packs, factories, charging stations, etc. Tesla’s Kafka Summit talk showed exciting information about their Kafka journey:

Hybrid IoT Architecture for the Energy Sector

IT architectures in the public sector look very similar to the private sector. The main difference is the more limited usage of public cloud providers. Nevertheless, most energy infrastructures require a hybrid approach with edge computing outside a data center or cloud.

Let’s take a look at a few example architecture for energy production from upstream and midstream to downstream:

Event Streaming enables data integration and data processing in motion, whether it has to happen at the edge or in the data center/cloud.

Edge Computing with Kafka in Disconnected Offline Mode

From the perspective of the edge, data is often filtered, preprocessed, and aggregated at the edge for latency, security, or cost reasons:

Disconnected data processing at the edge is crucial in many energy, and utilities use cases. It has to work even without an internet connection in “offline mode”:

The same is valid on the consumer side. The point-of-sale (POS) has to run 24/7 for transactional workloads, no matter if there is an internet connection:

I covered edge use cases for Kafka and security implications with Kafka in a zero-trust air-gapped environment in separate posts.

Cybersecurity – The Threat is Real for Public Sector and the Energy Infrastructure

Cybersecurity is crucial everywhere in the public sector, including citizen services, smart city, and mobility services. But in these “convenience use cases”, we “only” talk about data privacy. Yes, this is very important. But in the energy sector, we are talking about safety and human lives at risk. The Colonial Pipeline ransomware attack in May 2021 in the US is just one of many successful attacks in the past few quarters.

National security is a huge topic for the energy sector. Electric utilities can be affected by cyberattacks across the whole value chain. McKinsey has an exciting diagram explaining this:

The discussion around cybersecurity is a primer to the last post of this blog series.

Of course, my general blog series about Apache Kafka for Cybersecurity (Situational Awareness, Threat Intelligence, Forensics, Zero Trust, SIEM/SOAR Modernization) is helpful, too.

Data in Motion for Reliable and Scalable Smart Grid Infrastructure

This post showed a few real-world examples and architectures for data in motion in hybrid architectures in the energy industry. The private sector has fewer examples than the public sector. But the architectures look the same, no matter who is responsible.

The private energy sector needs to collaborate with the government and public administration like the public energy sector. The integration and processing of data in motion with Apache Kafka is a game-changer for improving existing processes and building new innovative solutions.

For instance, Tesla is a very innovative private company with cutting business models that are only possible if you collect, aggregate, and leverage data streams from various data sources. Tesla’s new car insurance service is an excellent example of this. The insurance business is backed by data from many IoT sensors and applied in real-time to provide context-specific information. That’s the way to go for the public sector, too.

How do you leverage event streaming in the public sector? Are you working on energy/utility projects or building a smart grid? What technologies and architectures do you use? What projects did you already work on or are in the planning? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka in the Public Sector – Part 4: Energy and Utilities appeared first on Kai Waehner.

Panel Discussion about Kafka, Edge, Networking and 5G in Oil and Gas and Mining Industry

Kai Waehner — Fri, 20 Aug 2021 07:40:43 +0000

The oil & gas and mining industries require edge computing for low latency and zero trust use cases. Most IT architectures are hybrid with big data analytics in the cloud and safety-critical data processing in disconnected and often air-gapped environments. This blog post shares a panel discussion that explores the challenges, use cases, and hardware/software/network technologies to reduce cost and innovate. A key focus is on the open-source framework Apache Kafka, the de facto standard for processing data in motion at the edge and in the cloud.

Apache Kafka at the Edge and in Hybrid Cloud

I explored the usage of event streaming at the edge and in hybrid cloud scenarios in lots of detail in the past. Hence, instead of yet another description, check out the following posts to learn about use cases and architectures:

Panel Discussion: Kafka, Network Infrastructure, Edge, and Hybrid Cloud in Oil and Gas

Here is the panel discussion. The conversation includes the software and the hardware/infrastructure/networking perspective. It is a great mix of exploring use cases from the oil&gas and mining industries for processing data in motion and technical facts about communication/radio/telco infrastructures. I think it was really a great mix of topics that are heavily related and depend on each other to deploy a project successfully.

Speakers:

Andrew Duong (Confluent): Moderator
Kai Waehner (Confluent): Expert on hybrid software architectures and data in motion
Dion Stevenson (Tait Communications): Expert on hardware and network infrastructure
Sohan Domingo (Tait Communications): Expert on hardware and network infrastructure

Now enjoy the discussion and feel free to share any thoughts or feedback:

Kafka in the Energy Sector including Oil and Gas, Mining, Smart Grids

An example architecture for hybrid event streaming in the oil and gas industry can look like the following:

If you want to learn more about event streaming with Apache Kafka in the energy industry (including oil and gas, mining, smart grids), check out the following blog post:

Notes about the Kafka, Edge, Oil, and Gas, Mining Conversation

If you prefer reading or just listening to a few of the sections, here are some notes about the flow of the panel discussion:

0-4:20– Introduction to Tait Communications

4:45- 7:20– Introduction to Confluent and high-level definition of Edge & IoT

7:30- 10:10– Voice communication discussion about connectivity, the importance of context at the point of time through data so the right response can be determined sooner. No matter where they are, what they’re doing, we get communication’s at the edge to suit the needs of a modern workforce.

10:15-12:10 ML/AI at the Edge. Continuous monitoring of all infrastructure and sensors for safety purposes. Event streaming to help send alerts in the real and also for post-event analysis too. There’s a process to get into AI- Infra, then pipeline, then AI, not the other way around.

12:15- 14:42– 5G can’t solve all problems- security, privacy, compliance considerations as to where to process the data, and beyond this, the cost is also a factor. Considerations for Cloud and on-premise.

14:50 – 16:03– 5G discussion. There are real-world limitations like cell towers. You also need contextual awareness at the Edge and making decisions there (local awareness)- e.g., Gas on a vehicle that’s disconnected on the backend.

16:15 – 20:10– Manufacturing & Supply chain, radios & communications today and what’s possible in the future. Having IoT at the Edge manufacturing optimizations with low latency requirements where cloud-first doesn’t make sense. On the flip side, if it’s not safety-critical or things like an ERP system, this can be pushed into the cloud.

20:10 -23:35– Mining side of things, lacking connectivity and preference for edge-based usage. Autonomous trucks, decisions on the edge rather than delays or even milliseconds by going to the cloud. Doing it locally at the edge is more efficient in some cases. Collecting all sensors on the trucks, temperatures, etc., even whilst disconnected, but once the connection is re-established at the base, that data can be uploaded. ‘Last mile’ analytics. Confluent is IT, not OT- we integrate with IT systems, but the OT world is separated.

23:38- 26:25: Digital mobile radios and voice communications, but with autonomous trucks, you don’t have that. This is where our Unified Vehicle comes in where it’s a combination of Digital Mobile Radio(DMR) and LTE and intelligent algorithms help with failover from DMR to LTE if there are connectivity issues. Voice is still important despite the amount of technology being in use and data exploration.

27:03 – 31:15– Where to start with data exploration- Start with your requirements. Does it really need computing at the edge to solve the problems, or can Cloud work? Event streaming at the edge and use case where it makes sense. How customers get started, simple use cases to be solved first before the more advanced ones (building the foundations, data pipelines, simple rules, and test). AWS Wavelength team collaboration and edge-making sense with low latency and security requirements.

31:15- 32:54– Need to consider your bandwidth & latency as to whether edge computing makes sense. Driverless cars.

33:15 – 37:49- Where to go from here with existing customers. How do they upgrade, what customers are coming to Tait for, and the use of video as part of all this for public safety.

Health & safety, monitoring driver alertness in NZ. Truck performance, driver performance, and when to take a break. That decision needs to be made as a combination between edge and cloud.

37:50 – 40:55- Connected vehicles and cars- it’s not as hard as it looks. Gas stations with edge computing, loyalty systems, etc., and the importance of after-sales for connected vehicles. GDPR and compliance by aggregation of data instead of as some countries have high privacy issues.

41:00-44:10- Joint project with Tait in the law enforcement space. Voice to text, use of metadata, and combining voice + video with event streaming.

Kafka for Next-Generation Edge Computing

The energy industry, including oil&gas and mining, is super interesting from a technical perspective. It requires edge and cloud computing. Upstream, midstream, and downstream is a complex and safety-critical supply chain. Processing data in motion with Apache Kafka leveraging various network infrastructures is a great opportunity to innovate and reduce costs across various use cases.

Do you already leverage Apache Kafka for processing data in motion in the oil and gas, mining, or any other industry? How does your (future) edge or hybrid architecture look like? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Panel Discussion about Kafka, Edge, Networking and 5G in Oil and Gas and Mining Industry appeared first on Kai Waehner.

Low Latency Data Streaming with Apache Kafka and Cloud-Native 5G Infrastructure

Kai Waehner — Sun, 23 May 2021 08:06:59 +0000

Many mission-critical use cases require low latency data processing. Running these workloads close to the edge is mandatory if the applications cannot run in the cloud. This blog post explores architectures for low latency deployments leveraging a combination of cloud-native infrastructure at the edge, such as AWS Wavelength, 5G networks from Telco providers, and event streaming with Apache Kafka to integrate and process data in motion.

The blog post is structured as follows:

Definition of “low latency data processing” and the relation to Apache Kafka
Cloud-native infrastructure for low latency computing
Low latency mission-critical use cases for Apache Kafka and its relation to analytical workloads
Example for a hybrid architecture with AWS Wavelength, Verizon 5G, and Confluent

Low Latency Data Processing

Let’s begin with a definition. “Real-time” and “low latency” are terms that different industries, vendors, and consultants use very differently.

What is real-time and low latency data processing?

For the context of this blog, real-time data processing with low latency means processing low or high volumes of data in ~5 to 50 milliseconds end-to-end. On a high level, this includes three parts:

Consume events from one or more data sources, either directly from a Kafka client or indirectly via a gateway or proxy.
Process and correlate events from one or more data sources, either stateless or stateful, with the internal state in the application and stream processing features like sliding windows.
Produce events to one or more data sinks, either directly from a Kafka client or indirectly via a gateway or proxy. The data sinks can include the data sources and/or other applications.

These parts are the same as for “traditional event streaming use cases”. However, for low latency use cases with zero downtime and data loss, the architecture often looks different to reach the defined goals and SLAs. A single infrastructure is usually the better choice than using a best-of-breed approach with many different frameworks or products. That’s where the Kafka ecosystem shines! The Kafka vs. MQ/ETL/ESB/API blog explores this discussion in more detail.

Low latency = soft real-time; NOT hard real-time

Make sure to understand that real-time in the IT world (that includes Kafka) is not hard real-time. Latency spikes and non-deterministic network behavior exist. The chosen software or framework does not matter. Hence, in the IT world, real-time means soft real-time. Contrarily, in the OT world and Industrial IoT, real-time means zero latency and deterministic networks. This is embedded software for sensors, robots, or cars.

For more details, read the blog post “Kafka is NOT hard-real-time“.

Kafka support for low latency processing

Apache Kafka provides very low end-to-end latency for large volumes of data. This means the amount of time it takes for a record that is produced to Kafka to be fetched by the consumer is short.

For example, detecting fraud for online banking transactions has to happen in real-time to deliver business value without adding more than 50—100 ms of overhead to each transaction to maintain a good customer experience.

Here is the technical architecture for end-to-end latency with Kafka:

Latency objectives are expressed as both target latency and the importance of meeting this target. For instance, a latency objective says: “I would like to get 99th percentile end-to-end latency of 50 ms from Kafka.” The right Kafka configuration options need to be optimized to achieve this. The blog post “99th Percentile Latency at Scale with Apache Kafka” shares more details.

After exploring what low latency and real-time data processing mean in Kafka’s context, let’s now discuss the infrastructure options.

Infrastructure for Low Latency Data Processing

Low latency always requires a short distance between data sources, data processing platforms, and data sinks due to physics. Latency optimization is relatively straightforward if all your applications run in the same public cloud. Low end-to-end latency gets much more difficult as soon as some software, mobile apps, sensors, machines, etc., run elsewhere. Think about connected cars, mobile apps for mobility services like ride-hailing, location-based services in retail, machines/robots in factories, etc.

The remote data center or remote cloud region cannot provide low latency data processing! The focus of this post is software that has to provide low end-to-end latency outside a central data center or public cloud. This is where edge computing and 5G networks come into play.

Edge infrastructure for low latency data processing

As for real-time and low latency, we need to define the term first, as everyone uses it differently. When I talk about the edge in the context of Kafka, it means:

Edge is NOT a regular data center or cloud region, but limited compute, storage, network bandwidth.
Edge can be a regional cloud-native infrastructure enabled for low-latency use cases – often provided by Telco enterprises in conjunction with cloud providers.
Kafka clients AND the Kafka broker(s) deployed here, not just the client applications.
Often 100+ locations, like restaurants, coffee shops, or retail stores, or even embedded into 1000s of devices or machines.
Offline business continuity, i.e., the workloads continue to work even if there is no connection to the cloud.
Low-footprint and low-touch, i.e., Kafka can run as a normal highly available cluster or as a single broker (no cluster, no high availability); often shipped “as a preconfigured box” in OEM hardware (e.g., Hivecell).
Hybrid integration, i.e., most use cases require uni- or bidirectional communication with a remote Kafka cluster in a data center or the cloud.

Check out my infrastructure checklist for Apache Kafka at the edge and use cases for Kafka at the edge across industries for more details.

Mobile Edge Compute / Multi-access Edge Compute (MEC)

In addition to edge computing, a few industries (especially everyone related to the Telco sector) uses the terms Mobile Edge Compute / Multi-access Edge Compute (MEC) to describe use cases around edge computing, low latency, 5G, and data processing.

MEC is an ETSI-defined network architecture concept that enables cloud computing capabilities and an IT service environment at the edge of the cellular network and, more generally, at the edge of any network. The basic idea behind MEC is that by running applications and performing related processing tasks closer to the cellular customer, network congestion is reduced, and applications perform better.

MEC technology is designed to be implemented at the cellular base stations or other edge nodes. It enables flexible and rapid deployment of new applications and services for customers. Combining elements of information technology and telecommunications networking, MEC also allows cellular operators to open their radio access network (RAN) to authorized third parties, such as application developers and content providers.

5G and cloud-native Infrastructure are a key piece of a MEC infrastructure!

Low-latency data processing outside a cloud region requires a cloud-native infrastructure and 5G networks. Let’s explore this combination in more detail.

5G infrastructure for low latency and high throughput SLAs

On a high level from a use case perspective, it is important to understand that 5G is much more than just higher speed and lower latency:

Public 5G telco infrastructure: That’s what Verizon, AT&T, T-Mobile, Dish, Vodafone, Telefonica, and all the other telco providers talk about in their TV spots. The end consumer gets higher download speeds and lower latency (at least in theory). This infrastructure integrates vehicles (e.g., cars) and devices (e.g., mobile phones) to the 5G network (V2N).
Private 5G campus networks: That’s what many enterprises are most interested in. The enterprise can set up private 5G networks with guaranteed quality of service (QoS) using acquired 5G slices from the 5G spectrum. Enterprise work with telco providers, telco hardware vendors, and sometimes also with cloud providers to provide cloud-native infrastructure (e.g., AWS Outposts, Azure Edge Zones, Google Anthos). This infrastructure is used similarly to the public 5G but deployed, e.g., in a factory or hospital. The trade-offs are guaranteed SLAs and increased security vs. higher cost. Lufthansa Technik and Vodafone’s standalone private 5G campus network at the aircraft hangar is a great example for various use cases like maintenance via video streaming and augmented reality.
Direct connection between devices: That’s for interlinking the communication between two or more vehicles (V2V) or vehicles and infrastructure (V2I) via unicast or multicast. There is no need for a network hop to the cell tower due to using a 5G technique called 5G sidelink communications. This enables new use cases, especially in safety-critical environments (e.g., autonomous driving) where Bluetooth, Wi-Fi, and similar network communications do not work well for different reasons.

Cloud-native infrastructure

Cloud-native infrastructure provides capabilities to build applications in an elastic, scalable, and automated way. Software development concepts like microservices, DevOps, and containers usually play a crucial role here.

A fantastic example is Dish Network in the US. Dish builds a brand new 5G network completely on cloud-native AWS infrastructure with cloud-native 1st and 3rd party software. Thus, even the network providers – where enterprises build their applications – build the underlying infrastructure this way.

Cloud-native infrastructure is required in the public cloud (where it is the norm) and at the edge. Flexibility for agile development and deployment of applications is only possible this way. Hence, technologies such as Kubernetes and on-premise solutions from cloud providers are adopted more and more to achieve this goal.

The combination of 5G and cloud-native infrastructure enables building low latency applications for data processing everywhere.

Software for Low Latency Data Processing

5G and cloud-native infrastructure provide the foundation for building mission-critical low latency applications everywhere. Let’s now talk about the software part and with that about event streaming with Kafka.

Why event streaming with Apache Kafka for low latency?

Apache Kafka provides a complete software stack for real-time data processing, including:

Messaging (real-time pub/sub)
Storage (caching, backpressure handling, decoupling)
Data integration (IoT data, legacy platforms, modern microservices, and databases)
Stream processing (stateless/stateful correlation of data).

This is super important because simplicity and cost-efficient operations matter much more at the edge than in a public cloud infrastructure where various SaaS services can be glued together.

Hence, Kafka is uniquely positioned to run mission-critical and analytics workloads at the edge on cloud-native infrastructure via 5G networks. Bi-directional replication to “regular” data centers or public clouds for integration with other systems is also possible via the Kafka protocol.

Use Cases for Low Latency Data processing with Apache Kafka

Low latency and real-time data processing are crucial for many use cases across industries. Hence, no surprise that Kafka plays a key role in many architectures – whether the infrastructure runs at the edge or in a close data center or cloud.

Mobile Edge Compute / Multi-access Edge Compute (MEC) use cases for Kafka across industries

Let’s take a look at a few examples:

Telco: Infrastructure like cloud-native 5G networks, OSS applications, integration with BSS and OTT services require to integrate, orchestrate and correlate huge volumes of data in real-time.
Manufacturing: Predictive maintenance, quality assurance, real-time locating systems (RTLS), and other shop floor applications are only effective and valuable with stable, continuous data processing.
Mobility Services: Ride-hailing, car sharing, or parking services can only provide a great customer experience if the events from thousands of regional end-users are processed in real-time.
Smart City: Cars from various carmakers, infrastructures such as traffic lights, smart buildings, and many other things need to get real-time information from a central data hub to improve safety and new innovative customer experiences.
Media: Interactive live video streams, real-time interactions, a hyper-personalized experience, augmented reality (AR) and virtual reality (VR) applications for training/maintenance/customer experience, and real-time gaming can only work well with stable, high throughput, and low latency.
Energy: Utilities, oil rigs, solar parks, and other energy upstream/distribution/downstream infrastructures are supercritical environments and very expensive. Every second counts for safety and efficiency/cost reasons. Optimizations combine data from all machines in a plant to achieve greater efficiency – not just optimizing one unit but for the entire system.
Retail: Location-based services for better customer experience and cross-/upselling need notifications while customers are looking at a product or in front of the checkout.
Military: Border control, surveillance, and other location-based applications only work efficiently with low latency.
Cybersecurity: Continuous monitoring and signal processing for thread detection and practice prevention are fundamental for any security operation center (SOC) and SIEM/SOAR implementation.

For a concrete example, check out my blog “Building a Smart Factory with Apache Kafka and 5G Campus Networks“.

NOT every use case requires low latency or real-time

Real-time data in motion beats data at rest in databases or data lakes in most scenarios. However, not every use case can be or needs to be real-time. Therefore, low latency networks and communication are not required. A few examples:

Reporting (traditional business intelligence)
Batch analytics (processing high volumes of data in a bundle, for instance, Hadoop and Spark’s map-reduce, shuffling, and other data processing only make sense in batch mode)
Model training as part of a machine learning infrastructure (while model scoring and monitoring often require real-time predictions, the model training is batch in almost all currently available ML algorithms).

These use cases can be outsourced to a remote data center or public cloud. Low latency networking in terms of milliseconds does not matter and likely increases the infrastructure cost. For that reason, most architectures are hybrid to separate low latency from analytics workloads.

Let’s now take a concrete example after all the theory in the last sections.

Hybrid Architecture for Critical Low Latency and Analytical Batch Workloads

Many enterprises I talk to don’t have and don’t want to build their own infrastructure at the edge. Cloud providers understand this pain and started rolling out offerings to provide cloud-native infrastructure close to the customer’s sites. AWS Outposts, Azure Edge Zones, Google Anthos exist for this reason. This solves the problem of providing cloud-native infrastructure.

But what about low latency?

AWS is once again the first to build a new product category: AWS Wavelength is a service that enables you to deliver ultra-low latency applications for 5G devices. It is built on top of AWS Outposts. AWS works with Telco providers like Verizon, Vodafone, KDDI, or SK Telecom to build this offering. A win-win-win: Cloud-native + low latency + no need to build own data centers at the edge.

This is the foundation for building low latency applications at the edge for mission-critical workloads, plus bi-directional integration with the regular public cloud region for analytics workloads and integration with other cloud applications.

Let’s see how this looks like in a real example.

Use case: Energy Production and distribution

Energy production and distribution are perfect examples. They require reliability, flexibility, sustainability, efficiency, security, and safety. These are perfect ingredients for a hybrid architecture powered by cloud-native infrastructure, 5G networks, and event streaming.

The energy sector usually separates analytical capabilities (in the data center or cloud) and low-latency computing for mission-critical workloads (at the edge). Kafka became a critical component for various energy use cases.

For more details, check out the blog post “Apache Kafka for Smart Grid, Utilities and Energy Production” which also covers real-world examples from EON, Tesla, and Devon Energy.

Architecture with AWS Wavelength, Verizon 5G, and Confluent

The concrete example uses:

AWS Public Cloud for analytics workloads
Confluent Cloud for event streaming in the cloud and integration with 1st party (e.g., AWS S3 and Amazon Redshift) and 3rd party SaaS (e.g., MongoDB Atlas, Snowflake, Salesforce CRM)
AWS Wavelength with Verizon 5G for low latency workloads
Confluent Platform with Kafka Connect and ksqlDB for low latency competing in the Wavelength 5G zone
Confluent Cluster Linking to glue together the Wavelength zone and the public AWS region using the native Kafka protocol for bi-directional replication in real-time

The following diagram shows the same architecture from the perspective of the Wavelength zone where the low latency processing happens:

Implementation: Hybrid data processing with Kafka/Confluent, AWS Wavelength, and Verizon 5G

Diagrams are nice. But a real implementation is even better to demonstrate the value of low latency computing close to the edge, plus the integration with the edge devices and public cloud. My colleague Joseph Morais had the lead in implementing a low-latency Kafka scenario with infrastructure provided by AWS and Verizon:

We implemented a use case around real-time analytics with Machine Learning. A single data pipeline collects provides end-to-end integration in real-time across locations. The data comes from edge locations. The low latency processing happens in the AWS Wavelength zone. This includes data integration, preprocessing like filtering/aggregations, and model scoring for anomaly detection.

Cluster Linking (a Kafka-native built-in replication feature) replicates the relevant data to Confluent Cloud in the local AWS region. The cloud is used for batch use cases such as model training with AWS Sagemaker.

This demo demonstrates a realistic hybrid end-to-end scenario to combine mission-critical low latency and analytics batch workloads.

Curious about the relation between Kafka and Machine Learning? I wrote various blogs. One good starter: “Machine Learning and Real-Time Analytics in Apache Kafka Applications“.

Last mile integration: Direct Kafka connection vs gateway / bridge (MQTT / HTTP)?

The last mile integration is an important aspect. How do you integrate “the last mile”? Examples include mobile apps (e.g., ride-hailing), connected vehicles (e.g., predictive maintenance), or machines (e.g., quality assurance for the production line).

This is worth a longer discussion in its own blog post, but let’s do a summary here:

Kafka was not built for bad networks. And Kafka was not built for tens of thousands of connections. Hence, it is pretty straightforward to decide. Option 1 is a direct connection with a Kafka client (using Kafka client APIs for Java, C++, Go, etc.). Option 2 is a scalable gateway or bridge (like MQTT or HTTP Proxy). When to use which one?

Use a direct connection via a Kafka client API if you have a stable network and only a limited number of connections (usually not higher than 1000 or so).
Use a gateway or bridge if you have a bad network infrastructure and/or tens of thousands of connections.

The blog series “Use Case and Architectures for Kafka and MQTT” gives you some ideas about use cases that require a bridge or gateway, for instance, connected cars and mobility services. But keep it as simple as possible. If a direct connection works for your use case, why add yet another technology with all its implications regarding complexity and cost?

Low Latency Data Processing Requires the Right Architecture

Low latency data processing is crucial for many use cases across industries. Processing data close to the edge is necessary if the applications cannot run in the cloud. Dedicated cloud-native infrastructure such as AWS Wavelength leverages 5G networks to provide the infrastructure. Event streaming with Apache Kafka provides the capabilities to implement edge computing and the integration with the cloud.

What are your experiences and plans for low latency use cases? What use case and architecture did you implement? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Low Latency Data Streaming with Apache Kafka and Cloud-Native 5G Infrastructure appeared first on Kai Waehner.

Comparison of Open Source Apache Kafka vs Vendors including Confluent, Cloudera, Red Hat, Amazon MSK

Kai Waehner — Tue, 20 Apr 2021 07:49:39 +0000

Apache Kafka became the de facto standard for event streaming. The open-source community is huge. Various vendors added Kafka and related tooling to their offerings or provide a Kafka cloud service. This blog post uses the car analogy – from the motor engine to the self-driving car – to explore the different Kafka offerings available on the market. I also cover a few other vehicles, meaning (partly) Kafka-compatible technologies. The goal is not a feature-by-feature comparison (that would be outdated the day after the publication). Instead, the intention is to educate about the different deployment models, product strategies, and trade-offs from the available options.

Disclaimer: I work for Confluent. However, the post is not about comparing features but explaining the concepts behind the alternatives. I talk to enterprises across the globe every week. I can assure you that many people I talk to are not aware or mislead about what you read in the following sections. Hence, I hope that the following helps you to make the right decision. Either choose to run open-source Apache Kafka or one of the various commercial Kafka offerings, or even a combination of both.

UPDATE (August 2022): This blog post was written before Amazon MSK Serverless was released. The below is still accurate and worth a read for comparing Kafka products and cloud services. Additionally, please check out the article “When NOT to choose Amazon MSK Serverless for Apache Kafka?“

Apache Kafka Components and Use Cases

The goal is not to introduce Kafka here. The minimum you should know is that Kafka is NOT just a messaging layer for data ingestion into a data lake. This is just a fraction of today’s usages.

Kafka is an open-source framework under Apache 2.0 license. It provides a combination of messaging, storage, processing, and integration of high volumes of data at scale in real-time and fault-tolerant. That’s what makes Kafka unique compared to other MQ, ETL, ESB, and API platforms.

Kafka is deployed in production for various use cases across industries. This includes analytical and mission-critical workloads. Different deployments require different SLAs. You should always ask yourself what happens if the Kafka infrastructure is in trouble. What are your RTO (Recovery Time Objective) and RPO (Recovery Point Objective)? Or in other words: How much data is okay to lose? How much downtime is acceptable? Start your Kafka projects with these questions in mind when you start your comparison of the options!

Kafka is the De Facto Standard API for Event Streaming like S3 API for Object Storage

Apache Kafka is mainstream! The latest proof: Check out the new ThoughtWorks Technology Radar: “Kafka API without Kafka“:

Kafka became the de facto event streaming API. Similar to S3 API became the de facto standard for object storage. Actually, the situation is even better for the Kafka API as the S3 API is a proprietary protocol from AWS. In contrast, the Kafka API and protocol are open source under Apache 2.0 license.

Check out the blog “Kafka API is the De Facto Standard API for Event Streaming like Amazon S3 for Object Storage” for more details.

Let’s take a look at a few very different Kafka alternatives available today:

Open-source Apache Kafka from the Apache website under Apache 2.0 license
Self-managed vendor offerings from Confluent, Cloudera, Red Hat, Amazon MSK, and many more
Fully-managed cloud offerings such as Confluent Cloud
Partly protocol-compatible products such as RedPanda for embedded / WebAssembly (WASM) use cases
Partly protocol-compatible SaaS offerings like Azure EventHubs

That’s a lot of options. So, how do you make a Kafka comparison to choose the right one? Before we go into more detail, let’s explore how complex Kafka actually is and when you do have to care about this at all.

Should you care how complex or heavyweight your event streaming technology is?

Complexity matters (only) if you need to operate the infrastructure by yourself. The beauty of SaaS is that you just consume the service and focus on your business problems. For instance, the AWS S3 object storage is a simple API with a fully managed service under the hood. You do not need to worry about operations or monitoring. You just use the cloud service.

Having said this, it is a little bit strange that ThoughtWorks mentions the barriers and complexity of Kafka but then refers to the Pulsar wrapper. That is an (immature) single class mapping implementation that only maps a small part of Kafka’s protocol (here is the producer wrapper as an example). Developers can use that wrapper to move data between Kafka clients and Pulsar brokers. However, Pulsar has a much more complex three-tier distributed architecture with ZooKeeper, BookKeeper, and Pulsar clusters. What is the benefit here? Is this really what you want to do in mission-critical workloads? Why? Please let me know if you seriously consider using such a wrapper architecture. Also, please read my post about the “Myths of Kafka vs. Pulsar“. A lot of arguments like the Kafka wrapper are simply just marketing and not usable for real-world projects!

Therefore, when I think about using the Kafka API without operating Kafka, then I have fully managed SaaS offerings such as Confluent Cloud or Azure Event Hubs in my mind.

Having said this, a fully managed cloud service is not always an option. For instance, Kafka at the edge is the new black. Plenty of use cases exist for a single broker or highly-available Kafka clusters at the edge.

But even if you want or need to operate Kafka by yourself: With KIP-500 and the removal of ZooKeeper, it gets easier and more lightweight than ever before. A lot of arguments do not exist anymore to move to a more “lightweight alternative”. There might be good reasons to choose something like RedPanda. But the main argument of having a more simple and lightweight deployment is not given anymore. Check out this video showing Kafka without ZooKeeper.

How to Choose the Right Kafka Distribution or Cloud Service?

So, how to make a comparison to find out which Kafka distribution or cloud service is the right one for your project?

The answer is simpler than you might think: The ultimate goal is to focus and solve your business problem.

How do you do that? By implementing business logic. Ideally, you don’t have to worry about infrastructure, operations, security, scalability, reliability, and non-business characteristics. Hence, SaaS with fully managed Kafka should be the first choice.

Unfortunately, SaaS is not always possible or the best option for many reasons:

Missing features
Technical limitations
Cost
Security requirements
On-premise or edge use cases

Therefore, we need to go one step back and understand what options you have to deploy and operate Kafka. Understanding these concepts without all the marketing fluff from the vendors is crucial to make the right decision!

The Kafka Car: An Analogy for a Product Comparison

It is often easier to compare technology by using an analogy from real life. Something everybody understands. No matter what industry you are in. No matter how technical you are. First, I thought I use the analogy of pizza, including self-made pizza, pizza ingredients, restaurants, delivery services, and other related topics. But pizza is used so often in the IT world. This originated in the early days of Amazon. Jeff Bezos instituted a rule: Every internal team should be small enough to be fed with two pizzas.

Finally, I choose to use the analogy of a car because I think many of the arguments are less debatable this way. I guess we could never agree on what would be the best pizza option for most people…

Hence, let’s talk about car engines, car brands, self-driving, connected fleets, and vintage vehicles in the following sections.

Give me a self-driving car, please!

Obviously, most people would prefer a self-driving car (if the price is right). It is safe, cost-efficient, and comfortable.

In the Kafka context, this means that the Kafka infrastructure would be

cloud-native (= elastic, scalable and automated, ideally fully managed)
complete (= entire set of security and operational features that enterprises require)
everywhere (= available in multiple public clouds, private cloud, on-premise, edge outside the data center)

Unfortunately, not every Kafka setup can be self-driving. We need to disassemble a car into its parts to understand what’s going on under the hood. Then we can choose the right car for our business problem.

Car Brands: Comparison of Confluent, Cloudera, Red Hat, Amazon MSK

Competition creates innovation. Hence, it is great to see many car brands and car models on the streets. Similarly, many competing companies fight for market share around Kafka business. Let’s quickly think about the available car brands (= Kafka vendors) on the market.

I only focus on the most relevant ones that either care about the Kafka project and community, have a lot of market power, or ideally both.

The car brands are Confluent, Cloudera, Red Hat, and Amazon MSK. I have a section on other Kafka and non-Kafka streaming vendors at the end of the blog post to provide a more detailed comparison.

Again, the idea is NOT to have a feature-by-feature comparison or flame war. The following are a few facts about each vendor. I only focus on Kafka-related points. Hence, it is no surprise that Confluent looks best in the following list as they only focus on event streaming. But obviously, each vendor has strengths and weaknesses. For instance, if you want to discuss the overall cloud infrastructure capabilities and strategy, well, then AWS would look much stronger than all the other Kafka vendors…

Confluent – The Leading Apache Kafka Vendor

A few facts about Confluent:

Focus on event streaming
Original creators of Kafka
The main contributor to the Apache Kafka project with 80% of Kafka commits
Always the latest Kafka version (without limitations) and full support
Rich Kafka ecosystem (connectors, governance, security, etc.)
Hybrid architectures (including the only true fully-managed and complete Kafka service)
Partnership and 1st party integration into cloud providers (AWS, GCP, Azure) – e.g., you can use your cloud provider credits and account to consume Confluent Cloud
Certified for self-managed operations on cloud providers’ edge offerings (e.g., AWS Outpost including Wavelength, Google’s Anthos)

Cloudera – Big Data Analytics Suite

A few facts about Cloudera:

Focus on big data analytics
Provides a platform around tens of different big data frameworks for storage, batch, and real-time analytics
Kafka is part of the platform (Hadoop, Spark, Flume, Flink, many more) with tooling and support for the whole platform
Hybrid architectures (but no fully-managed Kafka service)
Partnership and 3rd party integration into cloud providers (AWS, GCP, Azure)

Red Hat (IBM) – Cloud-native PaaS Infrastructure

A few facts about Red Hat (IBM):

Focus on infrastructure (mainly around Linux and Kubernetes)
Kafka is available as part of the Red Hat AMQ product portfolio, combined with other open-source frameworks like ActiveMQ or Camel
OpenShift Streams for Apache Kafka provides integration with Kubernetes
Focus on open source frameworks; working actively with the community (for Kafka, Red Hat, e.g., contributes to Debezium for CDC and the Strimzi Kubernetes Operator)
Hybrid Architectures (but no fully-managed Kafka service)
Partnership and 3rd party integration into cloud providers (AWS, GCP, Azure)

Interesting side notes for the relationship between Confluent, Red Hat, and IBM:

IBM acquired Red Hat in 2019.
Confluent and IBM announced a strategic partnership in 2020
IBM deprecated its own Kafka offering (IBM Streams) in March 2021.
Confluent is the way to go with IBM as part of the IBM Cloud Pak for Integration. Even IBM’s salespeople sell Confluent.

Amazon Web Services (AWS) – The Leading Cloud Provider

AWS focuses on cloud infrastructure and 1st party fully managed cloud services (S3, Kinesis, Lambda, etc.)

A few facts about Amazon MSK, the AWS offering for Kafka:

MSK misses several key Kafka features, including Kafka Connect or Kafka Streams
Cloud-only (but only self-managed, not fully managed)
MSK is not cloud-native (like S3 or Kinesis) but just provisioned infrastructure
Obviously only available on AWS
For on-premise deployments (like AWS Outpost or AWS Wavelength), the recommended Kafka product is the certified Confluent Platform

Interesting side note about the commercial support and SLAs of AWS’s Kafka offering: Kafka is excluded from MSK support! Quote from the MSK SLAs: “The Service Commitment DOES NOT APPLY to any unavailability, suspension, or termination… caused by the underlying Apache Kafka or Apache ZooKeeper engine software that leads to request failures…”

Event Streaming Technology and Cloud-native Infrastructure are Complementary!

The above showed a few facts for the main Kafka vendors: Confluent, Cloudera, Red Hat, AWS. However, it is worth explicitly pointing out that these vendors are often complementary. For instance, most Confluent Platform deployments I see on Kubernetes on-premise are actually on Red Hat OpenShift. And with AWS’s huge market share, most self-managed Confluent deployments in the cloud are on AWS.

Also, Confluent Platform is certified on AWS Outpost and Google Anthos. Hence, you can even combine cloud-native technologies at the edge. A great example is smart factory 5G use cases leveraging Confluent Platform on AWS Wavelength. Consequently, a Kafka comparison does typically not eliminate all the other Kafka vendors from the project.

The following architecture depicts the combination of Confluent Cloud in AWS plus Confluent Platform on AWS Wavelength leveraging 5G Carrier networks:

This is not just theory. The joint teams from AWS and Confluent are working on this example in the real world while I am writing this blog post.

Cloud-native? Complete? Everywhere? What Kafka should I buy?

After exploring different vendors, let’s now walk through the different deployment options and commercial offerings.

Again, I will not make a feature-by-feature comparison. Way more important is to understand the different concepts and architecture principles: First of all, you need to decide if a self-driving car (= fully managed Kafka) works for you. In that case, why bother at all about Kafka operations? Otherwise, project teams must evaluate partially managed (= complete car) or self-managed (car engine) Kafka offerings.

Here is an overview showing the event streaming landscape. It contains native Kafka offerings, (partly) Kafka-protocol compatible products, and a few relevant non-Kafka solutions:

Let’s now take a deeper look into these alternatives to find out how to choose the right one for your next project.

Car Engine: Self-managed Open Source Apache Kafka

The car engine is the heart of the car. It provides the power. It brings you from your source to your destination. However, a lot of work is needed around the motor engine. Tires, steering wheel, breaks, and much more are required. Hence, this is a great solution for playing around, learning how a car works, or building a car by enthusiastic car fanatics.

If you download open-source Apache Kafka from the Apache website or related Docker images, then you can use it for free in all your projects. No limitations. You should be able to get it running quickly. However, be aware that similar to the motor engine of a car, there is much more to do: Operating and monitoring the ZooKeeper and Kafka Clusters, rebalancing partitions, scaling up and down, managing storage, securing and encryption the end-to-end communication between producer, Kafka cluster and consumers, and so much more.

If you can handle the operations burden and risk of downtime, open-source Apache Kafka might be a good option. Some tech giants from Silicon Valley do exactly this. They have hired masses of tech experts (or car fanatics to keep the analogy) to run huge Kafka clusters to process trillions of messages and gigabytes of data per second.

Free Kafka add-ons to build your car

Plenty of open-source Kafka add-ons originated through this. Just to name a few tools for very different purposes: kafkacat, Kafka Manager, Kafdrop, burrow, cruise control, and so many more. Some are maintained well, others not at all. Of course, you will never get guarantees to get a version upgrade or bug fix. Often built by a tech giant for their specific scenario. Not easily usable outside that organization and without a big community.

Alternatively, there are well-maintained community projects like Confluent’s Schema Registry, REST Proxy, and ksqlDB, all under Confluent Community License (CCL). This is not open source but free to use if you are not a cloud provider like AWS. Confluent also provides some components under Apache 2.0 license, such as the widely used non-Java Kafka clients based on librdkafka or the parallel-consumer to integrate with non-scalable interfaces like web services in a scalable and performant way.

Tuned Car Engine: Self-managed Kafka Product

If you want or need to self-manage your Kafka infrastructure, then you still have more options than just using open-source Apache Kafka and (well or not so well maintained) open-source add-ons:

Open-source Apache Kafka with additional commercial tools for operations and monitoring. For instance, Lenses or Conduktor.
Complete commercial platforms. For instance, Confluent Platform, Red Hat AMQ, Cloudera DataFlow.

These “tuned car engines” are based on top of Apache Kafka (or at least parts of it) and provide additional tooling for development, operations, monitoring, security, etc. Maturity of the tools, support SLAs, expertise, and consulting vary a lot between vendors. I recommend to talk to your potential vendors. Ask the right questions to understand if they really understand what they seem to sell and support.

Shiny user interfaces attract many people. Just be careful. The underlying technology needs to work reliably and scale for your needs. The UI is nice to have on top of the infrastructure. Nevertheless, a good UI can improve the developer experience, increase time to market, and bring other benefits.

Should I use a (Tuned) Car Engine and Build my own Car?

All the explored options above are still self-managed. If you consider building your own car with a car engine, always evaluate the cost-benefit equation.

Remember, at the beginning of this post, I talked about solving business problems. Hence, don’t forget to consider all the impacts on:

Total Cost of Ownership (TCO)
Risk (downtime, data loss, security, governance, etc.)
Return on Investment (ROI)
Time-to-market (Increased developer velocity and increased business agility)

At Confluent, we do TCO assessments with our prospects and customers so that they understand the complete costs and risks of a Kafka project. Such an assessment should be part of every Kafka comparison!

Hence, don’t forget to evaluate other alternatives to open-source Apache Kafka seriously. If self-managed Apache Kafka still works for you after the evaluation, then do it! But be aware that even the tech giants from Silicon Valley consider and buy other options today. Many had to build Kafka infrastructure because there was no other alternative when they built it years ago.

Does my favorite open-source vendor really provide open-source?

Also, be careful: Some open source solutions don’t provide an easy way to build the product. So, evaluate what exactly is available from a so-called “open source offering”: Only a binary download? Docker images? Or can you also build and deploy everything from scratch easily and documented (not just in theory !!!) using Maven, Gradle, Terraform, Ansible, or similar build and automation tools?

Before building your own car with an available car engine, why not buy a car? Let’s consider next if a complete car might make more sense for you.

Complete Car: Kafka Products and Kubernetes-based PaaS

I am not a car fanatic. I want to buy a complete car that I can drive everywhere. You should also at least consider this option!

In Kafka terms, this means you get help from the product for running the Kafka infrastructure. Buzzwords from vendors include terms like “platform as a service”, “private cloud”, “fully managed”, and “cloud-native”. In the end, the products help you with provisioning, operating, and monitoring everything.

The main benefit compared to the (tuned) car engines is that these products give you a more elastic, scalable, and automated infrastructure. Two options exist today:

Kubernetes-based products that can run everywhere on-premise and across multiple cloud providers. Examples: Confluent Platform, Cloudera DataFlow (CDF), Red Hat AMQ respectively Red Hat OpenShift Streams for Apache Kafka.
Proprietary cloud offerings that are typically tied to the related hyperscaler. Example: Amazon MSK.

This is similar to buying a car: It comes preassembled. Although, of course, you are still responsible for operating and maintaining it.

In Kafka terms, for many scenarios, these self-managed Kafka products (= car) are a better choice than the self-managed Kafka (= car engine) because they partially reduce the operations burden, risk, and (hopefully) TCO.

A complete car is still not self-driving!

However, as you know, today’s cars still need a lot of manual work: Driving, refueling, maintenance, and more. In Kafka terms: How much works do you still have to do by yourself? Do you have to handle rolling upgrades manually? Do you have to rebalance partitions on brokers? How do you scale up and down? Who fixes security issues and bugs? And so on.

Hence, it is really a pity that most vendors use incorrect marketing intentionally! No one of the above solutions are fully managed. All of them require work to do by you to operate the Kafka cluster. All of them! Confluent Platform. Cloudera DataFlow. Red Hat AMQ. Red Hat OpenShift Streams for Apache. Kafka Amazon MSK. None of these services is fully managed!

Each car brand and model is different. If you buy a Porsche, you probably have very different expectations than buying a small medium-priced car from another brand. The same is true for all the self-managed Kafka products on the market. Each product is very different: Confluent Platform. Cloudera DataFlow. Red Hat AMQ / OpenShift Streams for Apache Kafka. Amazon MSK. All of them have strengths and weaknesses. Make sure it fits your expectations so that you can solve your business problem within your required SLAs and budget.

Having said this, wouldn’t it be nice if you don’t have to worry about all these things? Let’s explore the self-driving Kafka car next.

Self-driving Car: Fully-managed Cloud Kafka Service

A self-driving car provides a complete solution. You just tell it where you want to go. It drives you automatically. Chooses the best route. Allows relaxing, reading, playing games, or similar things. Of course, an autonomous car with level 5 automation is not mature yet (beyond some early stages, like Waymo operating in the desert in Phoenix where no rain and other weather or traffic issues occur).

In Kafka terms, the solution needs to be fully managed by the vendor.

Fully managed means serverless (i.e., you don’t have to care and even don’t get access to the Kafka Brokers at all). Mission-critical SLAs. Usage-based billing. And so on. Like you know it from other really fully managed cloud offerings such as AWS S3 or AWS Kinesis. These are fully managed. Amazon MSK is not!

Checklist to compare partially managed and fully-managed Kafka cloud services

Please compare different Kafka cloud offerings by yourself. Here are some bullet points to check:

Infrastructure management

Upgrades (latest stable version of Kafka)
Patching
Maintenance

Kafka-specific management

Sizing (retention, latency, throughput, storage, etc.)
Data balancing for optimal performance
Performance tuning for real-time and latency requirements
Fixing Kafka bugs
Uptime monitoring and proactive remediation of issues
Recovery support from data corruption

Scaling

Scaling the cluster as needed
Data balancing the cluster as nodes are added
Support for any Kafka issue with less than 60-minute response time

Most “Kafka as a service” offerings are only partially managed. That’s like a self-driving car which you actually have to control by yourself (more like level 3, not level 5 in autonomous driving terminology).

At this point, I have to do marketing for my employer. However, it is not an advertisement, but reality: Confluent Cloud is the only offering on the market that provides a complete, fully-managed Kafka SaaS offering. And it is available everywhere – on all major cloud provides (AWS, Azure, GCP). All the other Kafka offerings are NOT fully managed – even though most vendors claim it!

Other Vehicles on the Street: Comparison of Kafka-Compatible and Non-Kafka Offerings

On the street, we don’t see just one car brand or car model. Plenty of different ones exist. Nevertheless, they have to drive on the same streets. Competition creates innovation and tackles different markets and personal interests. That’s great. The same is true for Kafka!

I focused on the “mainstream Kafka vendors” in the above sections. Namely, Confluent, Cloudera, Red Hat, Amazon MSK. Obviously, more Kafka offerings exist on the market. Some are really good for some use cases. Others are more like an April Fools’ Joke, in my opinion. Let’s quickly walk through a few other offerings.

A few more car brands: Azure HD Insight’s Kafka, Aiven, cloudkarafka, Instaclustr. These Kafka-native PaaS vendors provision Kafka clusters for you. Similarly to Amazon MSK. These offerings slightly differ from each other. In summary, they typically ask you to do storage management, scalability configuration, performance tuning, etc., by yourself. This is definitely not self-driving!
A self-driving car: Azure Event Hubs is a SaaS offering from Microsoft supporting the Kafka protocol. It has several limitations regarding support of the Kafka API and infrastructure. A solid product. Contrary to Confluent Cloud, you don’t get additional capabilities such as fully-managed connectors, Schema Registry, RBAC, Audit Logs, and much more. And obviously, this product is only available on the Azure cloud.
A vintage car: TIBCO focuses on their legacy messaging solutions like TIBCO EMS. They (try to) provide support for Kafka (and Pulsar) to sell their proprietary technologies. Zero expertise or interest in Kafka. They even provide Kafka as .exe Windows file even though this does not work well in reality. If you need to run Kafka brokers on Windows (e.g., for development), only use Kafka Docker containers and the Windows Subsystem for Linux 2 (WSL 2).

Non-Kafka offerings

Self-driving scooters: AWS Kinesis, GCP Pub/Sub, etc., are solid SaaS offerings that work well if you don’t need to be vendor-agnostic and if the feature set, scalability, and pricing work for you.
A few bicycles, motorbikes, and cars: Non-Kafka solutions, including message queuing (IBM MQ, RabbitMQ, NATS), stream processing (Flink, Spark Streaming), event streaming (Pulsar, Pravega), integration middleware (many open-source/proprietary and self-managed/SaaS). These are solid frameworks and products that you can compare to Kafka. There is no silver bullet! Make sure to understand the differences between MQ/ETL/ESB and Kafka when you do your evaluation.

Connected Car Fleet: Multiple Kafka Clusters and Hybrid Integration

The digital transformation around connected vehicles is a real game-changer. Vehicles talk to each other (V2V), their infrastructure like traffic lights (V2I), and to many other backend systems (V2X).

As a side note: If you are interested in the relation of Kafka and connected vehicles/mobility services, I covered use cases for connected vehicles and V2X in my blog series about Kafka and MQTT in more detail.

Today, we usually have to drive by ourselves. This is expected to change in the next five to ten years. However, even if Waymo, Telsa, and the likes successfully deploy level 4 and level 5 cars to the street (including legal allowance), we will still only see a fraction of all cars driving themselves. It will be a connected fleet with regular cars and self-driving cars for at least a few decades. Not even sure if self-driving cars can ever go to India

The same is true for Kafka. Self-managed open-source Kafka is still mainstream today. Many enterprises move to Kubernetes and private or public cloud infrastructure, though. In parallel, most new Kafka clusters in the cloud are consumed from vendors that provide partially or fully managed services so that the enterprise can focus on their business problems.

Kafka is deployed across infrastructures. Often, new projects have a cloud-first strategy. But there are still a lot of data centers out there. Not just for legacy reasons. For instance, in Russia, there is no public cloud provider at all. Kafka has to be deployed on-premise. And there is the trend of deploying Kafka at the edge (i.e., outside a data center).

Architectures for Hybrid Kafka (SaaS + PaaS + Self-Managed)

Hence, a connected car fleet with various brands and operation types is required. Most enterprises use different vendors and cloud providers. Most enterprises have their own data centers and a multi-cloud strategy. Hybrid Kafka includes various architectures. This includes:

Kafka in one or multiple clouds. There is no Azure or GCP in China, only Alibaba and Tencent Cloud. This is why Audi built their connected car infrastructure in the cloud, but with Kafka instead of proprietary cloud services. They need to deploy globally.
Kafka at the edge outside the data center, e.g., in a smart factory, oil rigs, ships, retail stores, etc. Often deployed as a single broker on very lightweight hardware, without high availability.
Kafka stretched across regions, i.e., one single cluster operating across the US west, east, central. Confluent’s Multi-Region Clusters (MRC) is mainly used for this architecture.
Replication between different Kafka clusters. Use cases include aggregation, disaster recovery, global deployments, and more. Kafka-native technologies such as MirrorMaker 2, Confluent Replicator, or Confluent Cluster Linking enable these architectures.

“Architecture patterns for distributed, hybrid, edge and global Apache Kafka deployments” explores this topic in more detail.

Focus on the Business Problem when making your Kafka Comparison!

This blog post explored the different deployment options for Kafka. Several open-source and commercial options exist.

If you want to remember one thing from this post: A fully-managed Kafka service (= real SaaS) takes over all the operations complexity and risk for you, similarly like a self-driving car handles all the actions on the street. However, most services available today only provide self-managed Kafka clusters. Fully managed is often only a marketing term.

A hybrid architecture is the norm in most enterprises. A combination of fully-managed Kafka in the public cloud with self-managed Kafka on premise or at the edge works very well and is the way to go for most enterprises across industries.

What Kafka car do you drive today? What is your plan for the future? Maybe you are already planning to migrate to a self-driving car to focus on your business problems – and consequently reducing cost and risk this way, too? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Comparison of Open Source Apache Kafka vs Vendors including Confluent, Cloudera, Red Hat, Amazon MSK appeared first on Kai Waehner.

Hybrid Cloud Archives - Kai Waehner

Real-Time Data Sharing in the Telco Industry for MVNO Growth and Beyond with Data Streaming

Data Streaming in the Telco Industry

Business Value of Data Streaming in the Telecom Sector

Learn More about Data Streaming in Telco

Building a Real-Time Data Sharing Platform in the Telco Industry with Data Streaming

Data Sources

Stream Processing

Data Governance

Data Consumers

Key Data Products Driving Value for Data Sharing in the Telco Industry

How Each Data Consumer Gains Strategic Value

Building a Trusted Data Ecosystem for Telcos with Real-Time Streaming and Hybrid Cloud

Multi-Cloud Replication in Real-Time with Apache Kafka and Cluster Linking

What is Multi-Cloud and How Does Apache Kafka Help?

One Apache Kafka Cluster Does NOT Fit All Use Cases

Fidelity’s Hybrid Cloud Data Streaming Journey

Fidelity’s Event Streaming Platform

Fidelity’s Cloud Journey: From Point-to-Point to Decoupling Applications with Kafka and Cluster Linking between AWS and Azure

BEFORE: Point-to-Point Multi-Cloud Replication

AFTER: Apache Kafka and Cluster Linking for Real-Time Data Sharing and Decoupled Applications

Fidelity’s Multi-Cloud Requirements: Data Ownership, Data Contracts and Self-Service API

Transition to Cloud with Data Streaming Across Industries

Apache Kafka Cluster Type Deployment Strategies

Apache Kafka – The De Facto Standard for Event-Driven Architectures and Data Streaming

Different Apache Kafka Cluster Types

Apache Kafka Cluster Strategies and Architectures

Bridging Hybrid Kafka Clusters

RPO vs. RTO = Data Loss vs. Downtime

Stretched Kafka Cluster – Zero Data Loss with Synchronous Replication across Data Centers

Pricing of Kafka Cloud Offerings (vs. Self-Managed)

Kafka Storage – Tiered Storage and Iceberg Table Format to Store Data Only Once

Real-World Success Stories for Multiple Kafka Clusters

Paypal – Separation by Security Zone

JioCinema – Separation by Use Case and SLA

Audi – Operations vs. Analytics for Connected Cars

New Relic – Worldwide Multi-Cloud Observability

Multiple Kafka Clusters are the Norm; Not an Exception!

ARM CPU for Cost-Effective Apache Kafka at the Edge and Cloud

Apache Kafka at the Edge and Hybrid Cloud

Use Cases for Apache Kafka at the Edge

Benefits for Kafka at the Edge AND in the Cloud

What is ARM CPU?

ARM32 vs. ARM64

Key Features and Benefits of ARM CPUs

Why ARM CPUs at the Edge (e.g., for Industrial IoT)?

Why ARM in the Cloud?

Apache Kafka on ARM – A Match Made in Heaven for Edge and Cloud Workloads

Examples and Case Studies for Kafka at the Edge

Confluent Platform on ARM Infrastructure for Edge Deployments

Kafka + ARM = Cost-Effective and Sustainable

Data Streaming is not a Race, it is a Journey!

Data Streaming is a Journey, not a Race!

Consulting – Expertise from Software Vendors and System Integrators

Customer Journeys across Industries powered by Apache Kafka

AO.com – Real-Time Clickstream Analytics

Nordstrom – Event-driven Analytics Platform

Migros – End-to-End Supply Chain with IoT

NordLB – Bank-wide Data Streaming for Core Banking

Raiffeisen Bank International – Strategic Real-Time Data Integration Platform across 13 Countries

Allianz – Legacy Modernization and Cloud-Native Innovation

Optum – Self-Service Data Streaming

Intel – Real-Time Cyber Intelligence at Scale with Kafka and Splunk

Bayer AG – Transition from On-Premise to Multi-Cloud

Tesla – Migration from a Message Broker (RabbitMQ) to Data Streaming (Kafka)

Siemens – Integration between On-Premise SAP and Salesforce CRM in the Cloud

50Hertz – Transition from Proprietary SCADA to Cloud-Native Industrial IoT

Technical Evolution during the Data Streaming Journey

Disaster Recovery with Kafka across the Edge and Hybrid Cloud (QCon Talk)

What is QCon?

Disaster Recovery and Resiliency with Apache Kafka

QCon Presentation: Disaster Recovery with Apache Kafka

Slide Deck from QCon Talk:

Video Recording from QCon Talk:

Disaster Recovery with Apache Kafka across all Industries

Apache Kafka in the Public Sector – Part 4: Energy and Utilities

Blog series: Apache Kafka in the Public Sector and Government

Energy, Utilities, Smart Grid, and the Public Sector

Kafka Examples for Public Utilities

Stadtwerke Leipzig – Digital Customer Interface for Public Utilities