Kafka Streams Archives - Kai Waehner https://www.kai-waehner.de/blog/category/kafka-streams/ Technology Evangelist - Big Data Analytics - Middleware - Apache Kafka Mon, 28 Apr 2025 06:29:25 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.2 https://www.kai-waehner.de/wp-content/uploads/2020/01/cropped-favicon-32x32.png Kafka Streams Archives - Kai Waehner https://www.kai-waehner.de/blog/category/kafka-streams/ 32 32 Fraud Detection in Mobility Services (Ride-Hailing, Food Delivery) with Data Streaming using Apache Kafka and Flink https://www.kai-waehner.de/blog/2025/04/28/fraud-detection-in-mobility-services-ride-hailing-food-delivery-with-data-streaming-using-apache-kafka-and-flink/ Mon, 28 Apr 2025 06:29:25 +0000 https://www.kai-waehner.de/?p=7516 Mobility services like Uber, Grab, and FREE NOW (Lyft) rely on real-time data to power seamless trips, deliveries, and payments. But this real-time nature also opens the door to sophisticated fraud schemes—ranging from GPS spoofing to payment abuse and fake accounts. Traditional fraud detection methods fall short in speed and adaptability. By using Apache Kafka and Apache Flink, leading mobility platforms now detect and block fraud as it happens, protecting their revenue, users, and trust. This blog explores how real-time data streaming is transforming fraud prevention across the mobility industry.

The post Fraud Detection in Mobility Services (Ride-Hailing, Food Delivery) with Data Streaming using Apache Kafka and Flink appeared first on Kai Waehner.

]]>
Mobility services like Uber, Grab, FREE NOW (Lyft), and DoorDash are built on real-time data. Every trip, delivery, and payment relies on accurate, instant decision-making. But as these services scale, they become prime targets for sophisticated fraud—GPS spoofing, fake accounts, payment abuse, and more. Traditional, batch-based fraud detection can’t keep up. It reacts too late, misses complex patterns, and creates blind spots that fraudsters exploit. To stop fraud before it happens, mobility platforms need data streaming technologies like Apache Kafka and Apache Flink for fraud detection. This blog explores how leading platforms are using real-time event processing to detect and block fraud as it happens—protecting revenue, user trust, and platform integrity at scale.

Fraud Prevention in Mobility Services with Data Streaming using Apache Kafka and Flink with AI Machine Learning

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases.

The Business of Mobility Services (Ride-Hailing, Food Delivery, Taxi Aggregators, etc.)

Mobility services have become an essential part of modern urban life. They offer convenience and efficiency through ride-hailing, food delivery, car-sharing, e-scooters, taxi aggregators, and micro-mobility options. Companies such as Uber, Lyft, FREE NOW (former MyTaxi; acquired by Lyft recently), Grab, Careem, and DoorDash connect millions of passengers, drivers, restaurants, retailers, and logistics partners to enable seamless transactions through digital platforms.

Taxis and Delivery Services in a Modern Smart City

These platforms operate in highly dynamic environments where real-time data is crucial for pricing, route optimization, customer experience, and fraud detection. However, this very nature of mobility services also makes them prime targets for fraudulent activities. Fraud in this sector can lead to financial losses, reputational damage, and deteriorating customer trust.

To effectively combat fraud, mobility services must rely on real-time data streaming with technologies such as Apache Kafka and Apache Flink. These technologies enable continuous event processing and allow platforms to detect and prevent fraud before transactions are finalized.

Why Fraud is a Major Challenge in Mobility Services

Fraudsters continually exploit weaknesses in digital mobility platforms. Some of the most common fraud types include:

  1. Fake Rides and GPS Spoofing: Drivers manipulate GPS data to simulate trips that never occurred. Passengers use location spoofing to receive cheaper fares or exploit promotions.
  1. Payment Fraud and Stolen Credit Cards: Fraudsters use stolen payment methods to book rides or order food.
  1. Fake Drivers and Passengers: Fraudsters create multiple accounts and pretend to be both the driver and passenger to collect incentives. Some drivers manipulate fares by manually adjusting distances in their favor.
  1. Promo Abuse: Users create multiple fake accounts to exploit referral bonuses and promo discounts.
  1. Account Takeovers and Identity Fraud: Hackers gain access to legitimate accounts, misusing stored payment information. Fraudsters use fake identities to bypass security measures.

Fraud not only impacts revenue but also creates risks for legitimate users and drivers. Without proper fraud prevention measures, ride-hailing and delivery companies could face serious losses, both financially and operationally.

The Unseen Enemy: Core Challenges in Mobility Fraud
Detection

Traditional fraud detection relies on batch processing and manual rule-based systems. However, these approaches are no longer effective due to the speed and complexity of modern mobile apps with real-time experiences combined with modern fraud schemes.

Payment Fraud - The Hidden Enemy in a Digital World
Payment Fraud – The Hidden Enemy in a Digital World

Key challenges in mobility fraud detection include:

  • Fraud occurs in real-time, requiring instant detection and prevention before transactions are completed.
  • Millions of events per second must be processed, requiring scalable and efficient systems.
  • Fraud patterns constantly evolve, making static rule-based approaches ineffective.
  • Platforms operate across hybrid and multi-cloud environments, requiring seamless integration of fraud detection systems.

To overcome these challenges, real-time streaming analytics powered by Apache Kafka and Apache Flink provide an effective solution.

Event-driven Architecture for Mobility Services with Data Streaming using Apache Kafka and Flink

Apache Kafka: The Backbone of Event-Driven Fraud Detection

Kafka serves as the core event streaming platform. It captures and processes real-time data from multiple sources such as:

  • GPS location data
  • Payment transactions
  • User and driver behavior analytics
  • Device fingerprints and network metadata

Kafka provides:

  • High-throughput data streaming, capable of processing millions of events per second to support real-time decision-making.
  • An event-driven architecture that enables decoupled, flexible systems—ideal for scalable and maintainable mobility platforms.
  • Seamless scalability across hybrid and multi-cloud environments to meet growing demand and regional expansion.
  • Always-on reliability, ensuring 24/7 data availability and consistency for mission-critical services such as fraud detection, pricing, and trip orchestration.

An excellent success story about the transition to data streaming comes from DoorDash: Why DoorDash migrated from Cloud-native Amazon SQS and Kinesis to Apache Kafka and Flink.

Apache Flink enables real-time fraud detection through advanced event correlation and applied AI:

  • Detects anomalies in GPS data, such as sudden jumps, route manipulation, or unrealistic movement patterns.
  • Analyzes historical user behavior to surface signs of account takeovers or other forms of identity misuse.
  • Joins multiple real-time streams—including payment events, location updates, and account interactions—to generate accurate, low-latency fraud scores.
  • Applies machine learning models in-stream, enabling the system to flag and stop suspicious transactions before they are processed.
  • Continuously adapts to new fraud patterns, updating models with fresh data in near real-time to reflect evolving user behavior and emerging threats.

With Kafka and Flink, fraud detection can shift from reactive to proactive to stop fraudulent transactions before they are completed.

I already covered various data streaming success stories from financial services companies such as Paypal, Capital One and ING Bank in a dedicated blog post. And a separate case study from about “Fraud Prevention in Under 60 Seconds with Apache Kafka: How A Bank in Thailand is Leading the Charge“.

Real-World Fraud Prevention Stories from Mobility Leaders

Fraud is not just a technical issue—it’s a business-critical challenge that impacts trust, revenue, and operational stability in mobility services. The following real-world examples from industry leaders like FREE NOW (Lyft), Grab, and Uber show how data streaming with advanced stream processing and AI are used around the world to detect and stop fraud in real time, at massive scale.

FREE NOW (Lyft): Detecting Fraudulent Trips in Real Time by Analyzing GPS Data of Cars

FREE NOW operates in more than 150 cities across Europe with 48 million users. It integrates multiple mobility services, including taxis, private vehicles, car-sharing, e-scooters, and bikes.

The company was recently acquired by Lyft, the U.S.-based ride-hailing giant known for its focus on multimodal urban transport and strong presence in North America. This acquisition marks Lyft’s strategic entry into the European mobility ecosystem, expanding its footprint beyond the U.S. and Canada.

FREE NOW - former MyTaxi - Company Overview
Source: FREE NOW

Fraud Prevention Approach leveraging Data Streaming (presented at Kafka Summit)

  • Uses Kafka Streams and Kafka Connect to analyze GPS trip data in real-time.
  • Deploys fraud detection models that identify anomalies in trip routes and fare calculations.
  • Operates data streaming on fully managed Confluent Cloud and applications on Kubernetes for scalable fraud detection.
Fraud Prevention in Mobility Services with Data Streaming using Kafka Streams and Connect at FREE NOW
Source: FREE NOW

Example: Detecting Fake Rides

  1. A driver inputs trip details into the app.
  2. Kafka Streams predicts expected trip fare based on distance and duration.
  3. GPS anomalies and unexpected route changes are flagged.
  4. Fraud alerts are triggered for suspicious transactions.

By implementing real-time fraud detection with Kafka and Flink, FREE NOW (Lyft) has significantly reduced fraudulent trips and improved platform security.

Grab: AI-Powered Fraud Detection for Ride-Hailing and Delivery with Data Streaming and AI/ML

Grab is a leading mobility platform in Southeast Asia, handling millions of transactions daily. Fraud accounts for 1.6 percent of total revenue loss in the region.

To address these significant fraud numbers, Grab developed GrabDefence—an AI-powered fraud detection engine that leverages real-time data and machine learning to detect and block suspicious activity across its platform.

Fraud Detection and Presentation with Kafka and AI ML at Grab in Asia
Source: Grab

Fraud Detection Approach

  • Uses Kafka Streams and machine learning for fraud risk scoring.
  • Leverages Flink for feature aggregation and anomaly detection.
  • Detects fraudulent transactions before they are completed.
GrabDefence - Fraud Prevention with Data Streaming and AI / Machine Learning in Grab Mobility Service
Source: Grab

Example: Fake Driver and Passenger Fraud

  1. Fraudsters create accounts as both driver and passenger to claim rewards.
  2. Kafka ingests device fingerprints, payment transactions, and ride data.
  3. Flink aggregates historical fraud behavior and assigns risk scores.
  4. High-risk transactions are blocked instantly.

With GrabDefence built with data streaming, Grab reduced fraud rates to 0.2 percent, well below the industry average. Learn more about GrabDefence in the Kafka Summit talk.

Uber: Project RADAR – AI-Powered Fraud Detection with Human Oversight

Uber processes millions of payments per second globally. Fraud detection is complex due to chargebacks and uncollected payments.

To combat this, Uber launched Project RADAR—a hybrid system that combines machine learning with human reviewers to continuously detect, investigate, and adapt to evolving fraud patterns in near real time. Low latency is not required in this scenario. And humans are in the loop of the business process. Hence, Apache Spark is sufficient for Uber.

Uber Project Radar for Scam Detection with Humans in the Loop
Source: Uber

Fraud Prevention Approach

  • Uses Kafka and Spark for multi-layered fraud detection.
  • Implements machine learning models to detect chargeback fraud.
  • Incorporates human analysts for rule validation.
Uber Project RADAR with Apache Kafka and Spark for Scam Detection with AI and Machine Learning
Source: Uber

Example: Chargeback Fraud Detection

  1. Kafka collects all ride transactions in real time.
  2. Stream processing detects anomalies in payment patterns and disputes.
  3. AI-based fraud scoring identifies high-risk transactions.
  4. Uber’s RADAR system allows human analysts to validate fraud alerts.

Uber’s combination of AI-driven detection and human oversight has significantly reduced chargeback-related fraud.

Fraud in mobility services is a real-time challenge that requires real-time solutions that work 24/7, even at extreme scale for millions of events. Traditional batch processing systems are too slow, and static rule-based approaches cannot keep up with evolving fraud tactics.

By leveraging data streaming with Apache Kafka in conjunction with Kafka Streams or Apache Flink, mobility platforms can:

  • Process millions of events per second to detect fraud in real time.
  • Prevent fraudulent transactions before they occur.
  • Use AI-driven real-time fraud scoring for accurate risk assessment.
  • Adapt dynamically through continuous learning to evolving fraud patterns.

Mobility platforms such as Uber, Grab, and FREE NOW (Lyft) are leading the way in using real-time streaming analytics to protect their platforms from fraud. By implementing similar approaches, other mobility businesses can enhance security, reduce financial losses, and maintain customer trust.

Real-time fraud prevention in mobility services is not an option; it is a necessity. The ability to detect and stop fraud in real time will define the future success of ride-hailing, food delivery, and urban mobility platforms.

Stay ahead of the curve! Subscribe to my newsletter for insights into data streaming and connect with me on LinkedIn to continue the conversation. And download my free book about data streaming use cases.

The post Fraud Detection in Mobility Services (Ride-Hailing, Food Delivery) with Data Streaming using Apache Kafka and Flink appeared first on Kai Waehner.

]]>
Apache Flink: Overkill for Simple, Stateless Stream Processing and ETL? https://www.kai-waehner.de/blog/2025/01/14/apache-flink-overkill-for-simple-stateless-stream-processing/ Tue, 14 Jan 2025 07:48:04 +0000 https://www.kai-waehner.de/?p=7210 Discover when Apache Flink is the right tool for your stream processing needs. Explore its role in stateful and stateless processing, the advantages of serverless Flink SaaS solutions like Confluent Cloud, and how it supports advanced analytics and real-time data integration together with Apache Kafka. Dive into the trade-offs, deployment options, and strategies for leveraging Flink effectively across cloud, on-premise, and edge environments, and when to use Kafka Streams or Single Message Transforms (SMT) within Kafka Connect for ETL instead of Flink.

The post Apache Flink: Overkill for Simple, Stateless Stream Processing and ETL? appeared first on Kai Waehner.

]]>
When discussing stream processing engines, Apache Flink often takes center stage for its advanced capabilities in stateful stream processing and real-time data analytics. However, a common question arises: is Flink too heavyweight for simple, stateless stream processing  and ETL tasks? The short answer for open-source Flink is often yes. But the story evolves significantly when looking at SaaS Flink products such as Confluent Cloud’s Flink offering, with its serverless architecture, multi-tenancy, consumption-based pricing, and no-code/low-code capabilities like Flink Actions. This post explores the considerations and trade-offs to help you decide when Flink is the right tool for your data streaming needs, and when Kafka Streams or Single Message Transform (SMT) within Kafka Connect are the better choice.

Apache Flink - Overkill for Simple Stateless Stream Processing

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch.

The Nature of Stateless Stream Processing

Stateless stream processing, as the name implies, processes each event independently, with no reliance on prior events or context. This simplicity lends itself to use cases such as filtering, transformations, and simple ETL operations. Stateless tasks are:

  • Efficient: They don’t require state management, reducing overhead.
  • Scalable: Easily parallelized since there is no dependency between events.
  • Minimalistic: Often achievable with simpler, lightweight frameworks like Kafka Streams or Kafka Connect’s Single Message Transforms (SMT).

For example, filtering transactions above a certain amount or transforming event formats for downstream systems are classic stateless tasks that demand minimal computational complexity.

In these scenarios, deploying a robust and feature-rich framework like open-source Apache Flink might seem excessive. Flink’s rich API and state management features are unnecessary for such straightforward use cases. Instead, tools with smaller footprints, and simpler deployment models, such as Kafka Streams, often suffice.

Apache Flink is a powerhouse. It’s designed for advanced analytics, stateful processing, and complex event patterns. But this sophistication of the open source framework comes with complexity:

  1. Operational Overhead: Setting up and maintaining Flink in an open-source environment can require significant infrastructure and expertise.
  2. Resource Intensity: Flink’s distributed architecture and stateful processing capabilities are resource-hungry, often overkill for tasks that don’t require stateful processing.
  3. Complexity in Development: The Flink API is robust but comes with a steeper learning curve. The combination with Kafka (or another streaming engine) requires understanding two frameworks. In contrast, Kafka Streams is Kafka-native, offering a single, unified framework for stream processing, which can reduce complexity for basic tasks.

For organizations that need to perform straightforward stateless operations, investing in the full Flink stack can feel like using a sledgehammer to crack a nut. Having said this, FlinkSQL simplifies development for certain personas, providing a more accessible interface beyond just Java and Python.

The conversation shifts dramatically with Serverless Flink Cloud offerings, such as Confluent Cloud, which address many of the challenges associated with running open-source Flink. Let’s unpack how Serverless Flink makes a more attractive choice, even for simpler use cases.

1. Serverless Architecture

With a Serverless stream processing service, Flink operates on a fully serverless model, eliminating the need for heavy infrastructure management. This means:

  • No Setup Hassles: Developers focus purely on application logic, not cluster provisioning or tuning.
  • Elastic Scaling: Resources automatically scale with the workload, ensuring efficient handling of varying traffic patterns without manual intervention. One of the biggest challenges of self-managing Flink is over-provisioning resources to handle anticipated peak loads.  Elastic Scaling mitigates this inefficiency.

2. Multi-Tenancy

Multi-tenant design allows multiple applications, teams or organizations to share the same infrastructure securely. This reduces operational costs and complexity compared to managing isolated clusters for each workload.

3. Consumption-Based Pricing

One of the key barriers to adopting Flink for simple tasks is cost. A truly Serverless Flink offering mitigates this with a pay-as-you-go pricing model:

  • You only pay for the resources you use, making it cost-effective for both lightweight and high-throughput workloads.
  • It aligns with the scalability of stateless stream processing, where workloads may spike temporarily and then taper off.

4. Bridging the Gap with No-Code and Low-Code Solutions

The rise of citizen integrators and the demand for low-code/no-code solutions have reshaped how organizations approach data streaming. Less-technical users, such as business analysts or operational teams, often face challenges when trying to engage with technical platforms designed for developers.

Low-code/no-code tools address this by providing intuitive interfaces that allow users to build, deploy, and monitor pipelines without deep programming knowledge. These solutions empower business users to take charge of simple workflows and integrations, significantly reducing time-to-value while minimizing the reliance on technical teams.

For example, capabilities like Flink Actions in Confluent Cloud offer a user-friendly approach to deploying stream processing pipelines without coding. By simplifying the process and making it accessible to non-technical stakeholders, these tools enhance collaboration and ensure faster outcomes without compromising performance or scalability. For instance, you can do ETL functions such as transformation, deduplication or masking field:

Confluent Cloud - Apache Flink Action UI for No Code Low Code Streaming ETL Integration
Source: Confluent

Fully Managed (SaaS) vs. Partially Managed (PaaS) Cloud Products

When choosing between SaaS and PaaS for data streaming, it’s essential to understand the key differences.

SaaS solutions, like Confluent Cloud, offer a fully managed, serverless experience with automatic scaling, low operational overhead, and pay-as-you-go pricing.

In contrast, PaaS requires users to manage infrastructure, configure scaling policies, and handle more operational complexity.

While many products are marketed as “serverless,” not all truly abstract infrastructure or eliminate idle costs—so scrutinize claims carefully.

SaaS is ideal for teams focused on rapid deployment and simplicity, while PaaS suits those needing deep customization and control. Ultimately, SaaS ensures scalability and ease of use, making it a compelling choice for most modern streaming needs. Always dive into the technical details to ensure the platform aligns with your goals. Don’t trust the marketing slogans of the vendors!

Stateless vs. Stateful Stream Processing: Blurring the Lines

Even if your current use case is stateless, it’s worth considering the potential for future needs. Stateless pipelines often evolve into more complex systems as businesses grow, requiring features like:

  • State Management: For event correlation and pattern detection.
  • Windows and Aggregations: To derive insights from time-series data.
  • Joins: To enrich data streams with contextual information.
  • Integrating Multiple Data Sources: To seamlessly combine information from various streams for a comprehensive and cohesive analysis.
  • AI/ML Integration: Incorporating machine learning models for real-time inference, enabling intelligent decision-making directly within data streams.

Stateless and Stateful Stream Processing with Kafka Streams and Apache Flink

With a SaaS Flink service such as Confluent Cloud, you can start small with stateless tasks and seamlessly scale into stateful operations as needed, leveraging Flink’s full capabilities without a complete overhaul.

While Flink may feel like overkill for simple, stateless tasks in its open-source form, its potential is unmatched in these scenarios:

  • Enterprise Workloads: Scalable, reliable, and fault-tolerant systems for mission-critical applications.
  • Data Integration and Preparation (Streaming ETL): Flink enables preprocessing, cleansing, and enriching data at the streaming layer, ensuring high-quality data reaches downstream systems like data lakes and warehouses.
  • Complex Event Processing (CEP): Detecting patterns across events in real time.
  • Advanced Analytics: Stateful stream processing for aggregations, joins, and windowed computations.
  • AI/ML Integration: Incorporating machine learning models for real-time inference, enabling intelligent decision-making directly within data streams.

Stateless stream processing is often achieved using lightweight tools like Kafka Streams or Single Message Transforms (SMTs) within Kafka Connect. SMTs enable inline transformations, such as normalization, enrichment, or filtering, as events pass through the integration framework. This functionality is available in Kafka Connect (provided by Confluent, IBM/Red Hat, Amazon MSK and others) and tools like Benthos for Redpanda. SMTs are particularly useful for quick adjustments and filtering data before it reaches the Kafka cluster, optimizing resource usage and data flow.

While Kafka Streams and Kafka Connect’s SMTs handle many stateless workloads effectively, Apache Flink offers significant advantages for all types of workloads—whether simple or complex, stateless or stateful.

Stream processing in Flink enables true decoupling within the enterprise architecture (as it is not bound to the Kafka cluster like Kafka Streams and Kafka Connect). The benefits are separation of concerns with a domain-driven design (DDD), and improved data governance. And Flink provides interfaces for Java, Python and SQL. Something for (almost) everyone. This makes ideal Flink for ensuring clean, modular architectures and easier scalability.

Stream Processing and ETL with Apache Kafka Streams Connect SMT and Flink

By processing events from diverse sources and preparing them for downstream consumption, Flink supports both lightweight and comprehensive workflows while aligning with domain boundaries and governance requirements. This brings us to the shift left architecture.

The Shift Left Architecture

No matter what specific use cases you have in mind: The Shift Left Architecture brings data processing upstream with real-time stream processing, transforming raw data into high-quality data products early in the pipeline.

Shift Left Architecture with Data Streaming into Data Lake Warehouse Lakehouse

Apache Flink plays a key role as part of a complete data streaming platform by enabling advanced streaming ETL, data curation, and on-the-fly transformations, ensuring consistent, reliable, and ready-to-use data for both operational and analytical workloads, while reducing costs and accelerating time-to-market.

Shift Left Architecture with Apacke Kafka Flink and Iceberg

The decision to use Flink boils down to your use case, expertise, and growth trajectory:

  • For basic stateless tasks, consider lightweight options like Kafka Streams or SMTs within Kafka Connect unless you’re already invested in a SaaS such as Confluent Cloud where Flink is also the appropriate choice for simple ETL processes.
  • For evolving workloads or scenarios requiring scalability and advanced analytics, a Flink SaaS offers unparalleled flexibility and ease of use.
  • For on-premise or edge deployments, Flink’s flexibility makes it an excellent choice for environments where data processing must occur locally due to latency, security, or compliance requirements.

Understanding the deployment environment—cloud, on-premise, or edge— and the capabilities of the Flink product is crucial to choosing the right streaming technology. Flink’s adaptability ensures it can serve diverse needs across these contexts.

Kafka Streams is another excellent, Kafka-native stream processing alternative. Most importantly for this discussion, Kafka Streams is “just” a lightweight Java library, not a server infrastructure like Flink. Hence, it brings different trade-offs with it. I wrote a dedicated article about the trade-offs between Apache Flink and Kafka Streams for stream processing.

In its open-source form, Flink can seem excessive for simple, stateless tasks. However, a serverless Flink SaaS like Confluent Cloud changes the equation. Multi-tenancy and pay-as-you-go pricing make it suitable for a wider range of use cases, from basic ETL to advanced analytics. Serverless features like Confluent’s Flink Actions further reduce complexity, allowing non-technical users to harness the power of stream processing without coding.

Whether you’re just beginning your journey into stream processing or scaling up for enterprise-grade applications, Flink—as part of a complete data streaming platform such as Confluent Cloud—is a future-proof investment that adapts to your needs.

The Data Streaming Landscape 2025 highlights how data streaming has evolved into a key software category, moving from niche adoption to a fundamental part of modern data architecture.

The Data Streaming Landcape 2025 with Kafka Flink Confluent Amazon MSK Cloudera Event Hubs and Other Platforms

With frameworks like Apache Kafka and Flink at its core, the landscape now spans self-managed, BYOC, and fully managed SaaS solutions, driving real-time use cases, unifying transactional and analytical workloads, and enabling innovation across industries.

Stay ahead of the curve! Subscribe to my newsletter for insights into data streaming and connect with me on LinkedIn to continue the conversation.

The post Apache Flink: Overkill for Simple, Stateless Stream Processing and ETL? appeared first on Kai Waehner.

]]>
Stateless vs. Stateful Stream Processing with Kafka Streams and Apache Flink https://www.kai-waehner.de/blog/2024/12/27/stateless-vs-stateful-stream-processing-with-kafka-streams-and-apache-flink/ Fri, 27 Dec 2024 08:48:54 +0000 https://www.kai-waehner.de/?p=6857 The rise of stream processing has changed how we handle and act on data. While traditional databases, data lakes, and warehouses are effective for many batch-based use cases, they fall short in scenarios demanding low latency, scalability, and real-time decision-making. This post explores the key concepts of stateless and stateful stream processing, using Kafka Streams and Apache Flink as examples.

The post Stateless vs. Stateful Stream Processing with Kafka Streams and Apache Flink appeared first on Kai Waehner.

]]>
In the world of data-driven applications, the rise of stream processing has changed how we handle and act on data. While traditional databases, data lakes, and warehouses are effective for many batch-based use cases, they fall short in scenarios demanding low latency, scalability, and real-time decision-making. This post explores the key concepts of stateless and stateful stream processing, using Kafka Streams and Apache Flink as examples. These principles apply to any stream processing engine, whether open-source or a cloud service. Let’s break down the differences, practical use cases, the relation to AI/ML, and the immense value stream processing offers compared to traditional data-at-rest methods.

Stateless and Stateful Stream Processing with Kafka Streams and Apache Flink

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch.

Rethinking Data Processing: From Static to Dynamic

In traditional systems, data is typically stored first in a database or data lake and queried later for computation. This works well for batch processing tasks, like generating reports or dashboards. The process usually looks something like this:

  1. Store Data: Data arrives and is stored in a database or data lake.
  2. Query & Compute: Applications request data for analysis or processing at a later time with a web service, request-response API or SQL script.

However, this approach fails when you need:

  • Immediate Action: Real-time responses to events, such as fraud detection.
  • Scalability: Handling thousands or millions of events per second.
  • Continuous Insights: Ongoing analysis of data in motion.

Enter stream processing—a paradigm where data is continuously processed as it flows through the system. Instead of waiting to store data first, stream processing engines like Kafka Streams and Apache Flink enable you to act on data instantly as it arrives.

Use Case: Fraud Prevention in Real-Time

The blog post uses a fraud prevention scenario to illustrate the power of stream processing. In this example, transactions from various sources (e.g., credit card payments, mobile app purchases) are monitored in real time.

Fraud Detection and Prevention with Stream Processing in Real-Time

The system flags suspicious activities using three methods:

  1. Stateless Processing: Each transaction is evaluated independently, and high-value payments are flagged immediately.
  2. Stateful Processing: Transactions are analyzed over a time window (e.g., 1 hour) to detect patterns, such as an unusually high number of transactions.
  3. AI Integration: A pre-trained machine learning model is used for real-time fraud detection by predicting the likelihood of fraudulent activity.

This example highlights how stream processing enables instant, scalable, and intelligent fraud detection, something not achievable with traditional batch processing.

To avoid confusion: while I use Kafka Streams for stateless and Apache Flink for stateful in the example, both frameworks are capable of handling both types of processing.

Other Industry Examples of Stream Processing

  • Predictive Maintenance (Industrial IoT): Continuously monitor sensor data to predict equipment failures and schedule proactive maintenance.
  • Real-Time Advertisement (Retail): Deliver personalized ads based on real-time user interactions and behavior patterns.
  • Real-Time Portfolio Monitoring (Finance): Continuously analyze market data and portfolio performance to trigger instant alerts or automated trades during market fluctuations.
  • Supply Chain Optimization (Logistics): Track shipments in real time to optimize routing, reduce delays, and improve efficiency.
  • Condition Monitoring (Healthcare): Analyze patient vitals continuously to detect anomalies and trigger immediate alerts.
  • Network Monitoring (Telecom): Detect outages or performance issues in real time to improve service reliability.

These examples highlight how stream processing drives real-time insights and actions across diverse industries.

What is Stateless Stream Processing?

Stateless stream processing focuses on processing each event independently. In this approach, the system does not need to maintain any context or memory of previous events. Each incoming event is handled in isolation, meaning the logic applied depends solely on the data within that specific event.

This makes stateless processing highly efficient and easy to scale, as it doesn’t require state management or coordination between events. It is ideal for use cases such as filtering, transformations, and simple ETL operations where individual events can be processed with no need of historical data or context.

1. Example: Real-Time Payment Monitoring

Imagine a fraud prevention system that monitors transactions in real time to detect and prevent suspicious activities. Each transaction, whether from a credit card, mobile app, or payment gateway, is evaluated as it occurs. The system checks for anomalies such as unusually high amounts, transactions from unfamiliar locations, or rapid sequences of purchases.

Fraud Detection - Stateless Transaction Monitoring with Kafka Streams

By analyzing these attributes instantly, the system can flag high-risk transactions for further inspection or automatically block them. This real-time evaluation ensures potential fraud is caught immediately, reducing the likelihood of financial loss and enhancing overall security.

You want to flag high-value payments for further inspection. In the following Kafka Streams example:

  • Each transaction is evaluated as it arrives.
  • If the transaction amount exceeds 100 (in your chosen currency), it’s sent to a separate topic for further review.

Java Example (Kafka Streams):

KStream<String, Payment> payments = builder.stream(“payments”);

payments.filter((key, payment) -> payment.getAmount() > 100)
.to(“high-risk-payments”);

Benefits of Stateless Processing

  • Low Latency: Immediate processing of individual events.
  • Simplicity: No need to track or manage past events.
  • Scalability: Handles large volumes of data efficiently.

This approach is ideal for use cases like filtering, data enrichment, and simple ETL tasks.

What is Stateful Stream Processing?

Stateful stream processing takes it a step further by considering multiple events together. The system maintains state across events, allowing for complex operations like aggregations, joins, and windowed analyses. This means the system can correlate data over a defined period, track patterns, and detect anomalies that emerge across multiple transactions or data points.

2. Example: Fraud Prevention through Continuous Pattern Detection

In fraud prevention, individual transactions may appear normal, but patterns over time can reveal suspicious behavior.

For example, a fraud prevention system might identify suspicious behavior by analyzing all transactions from a specific credit card within a one-hour window, rather than evaluating each transaction in isolation.

Fraud Detection - Stateful Anomaly Detection with Apache Flink SQL

Let’s detect anomalies by analyzing transactions with Apache Flink using Flink SQL. In this example:

  • The system monitors transactions for each credit card within a 1-hour window.
  • If a card is used over 10 times in an hour, it flags potential fraud.

SQL Example (Apache Flink):

SELECT card_number, COUNT(*) AS transaction_count
FROM payments
GROUP BY TUMBLE(transaction_time, INTERVAL ‘1’ HOUR), card_number
HAVING transaction_count > 10;

Key Concepts in Stateful Processing

Stateful processing relies on maintaining context across multiple events, enabling the system to perform more sophisticated analyses. Here are the key concepts that make stateful stream processing possible:

  1. Windows: Define a time range to group events (e.g., sliding windows, tumbling windows).
  2. State Management: The system remembers past events within the defined window.
  3. Joins: Combine data from multiple sources for enriched analysis.

Benefits of Stateful Processing

Stateful processing is essential for advanced use cases like anomaly detection, real-time monitoring, and predictive analytics:

  • Complex Analysis: Detect patterns over time.
  • Event Correlation: Combine events from different sources.
  • Real-Time Decision-Making: Continuous monitoring without reprocessing data.

Bringing AI and Machine Learning into Stream Processing

Stream processing engines like Kafka Streams and Apache Flink also enable real-time AI and machine learning model inference. This allows you to integrate pre-trained models directly into your data processing pipelines.

3. Example: Real-Time Fraud Detection with AI/ML Models

Consider a payment fraud detection system that uses a TensorFlow model for real-time inference. In this system, transactions from various sources — such as credit cards, mobile apps, and payment gateways — are streamed continuously. Each incoming transaction is preprocessed and sent to the TensorFlow model, which evaluates it based on patterns learned during training.

Fraud Detection - Anomaly Detection with Predictive Al ML using Apache Flink Python API

The model analyzes features like transaction amount, location, device ID, and frequency to predict the likelihood of fraud. If the model identifies a high probability of fraud, the system can trigger immediate actions, such as flagging the transaction, blocking it, or alerting security teams. This real-time inference ensures that potential fraud is detected and addressed instantly, reducing risk and enhancing security.

Here is a code example using Apache Flink’s Python API for predictive AI:

Python Example (Apache Flink):

def predict_fraud(payment):
prediction = model.predict(payment.features)
return prediction > 0.5

stream = payments.map(predict_fraud)

Why Combine AI with Stream Processing?

Integrating AI with stream processing unlocks powerful capabilities for real-time decision-making, enabling businesses to respond instantly to data as it flows through their systems. Here are some key benefits of combining AI with stream processing:

  • Real-Time Predictions: Immediate fraud detection and prevention.
  • Automated Decisions: Integrate AI into critical business processes.
  • Scalability: Handle millions of predictions per second.

Apache Kafka and Flink deliver low-latency, scalable, and robust predictions. My article “Real-Time Model Inference with Apache Kafka and Flink for Predictive AI and GenAI” compares remote inference (via APIs) and embedded inference (within the stream processing application).

For large AI models (e.g., generative AI or large language models), inference is often done via remote calls to avoid embedding large models within the stream processor.

Stateless vs. Stateful Stream Processing: When to Use Each

Choosing between stateless and stateful stream processing depends on the complexity of your use case and whether you need to maintain context across multiple events. The following table outlines the key differences to help you determine the best approach for your specific needs.

FeatureStatelessStateful
Use CaseSimple Filtering, ETLAggregations, Joins
LatencyVery Low LatencySlightly Higher Latency due o State Management
ComplexitySimple LogicComplex Logic Involving Multiple Events
State ManagementNot RequiredRequired for Context-aware Processing
ScalabilityHighDepends on the Framework

Read my article “Apache Kafka (including Kafka Streams) + Apache Flink = Match Made in Heaven” to learn more about choosing the right stream processing engine for your use case.

And to clarify again: while this article uses Kafka Streams for stateless and Flink for stateful stream processing, both frameworks are capable of handling both types.

Video Recording

Below, I summarize this content as a ten-minute video on my YouTube channel:

Why Stream Processing is a Fundamental Change

Whether stateless or stateful, stream processing with Kafka Streams, Apache Flink, and similar technologies unlocks real-time capabilities that traditional databases simply cannot offer. From simple ETL tasks to complex fraud detection and AI integration, stream processing empowers organizations to build scalable, low-latency applications.

Stream Processing with Apache Kafka Flink SQL Java Python and AI ML

Investing in stream processing means:

  • Faster Innovation: Real-time insights drive competitive advantage.
  • Operational Efficiency: Automate decisions and reduce latency.
  • Scalability: Handle millions of events seamlessly.

Stream processing isn’t just an evolution of data handling—it’s a revolution. If you’re not leveraging it yet, now is the time to explore this powerful paradigm. If you want to learn more, check out my light board video exploring the core value of Apache Flink:

Stay ahead of the curve! Subscribe to my newsletter for insights into data streaming and connect with me on LinkedIn to continue the conversation.

The post Stateless vs. Stateful Stream Processing with Kafka Streams and Apache Flink appeared first on Kai Waehner.

]]>
Top Trends for Data Streaming with Apache Kafka and Flink in 2025 https://www.kai-waehner.de/blog/2024/12/02/top-trends-for-data-streaming-with-apache-kafka-and-flink-in-2025/ Mon, 02 Dec 2024 14:02:07 +0000 https://www.kai-waehner.de/?p=6923 Apache Kafka and Apache Flink are leading open-source frameworks for data streaming that serve as the foundation for cloud services, enabling organizations to unlock the potential of real-time data. Over recent years, trends have shifted from batch-based data processing to real-time analytics, scalable cloud-native architectures, and improved data governance powered by these technologies. Looking ahead to 2025, the data streaming ecosystem is set to undergo even greater changes. Here are the top trends shaping the future of data streaming for businesses.

The post Top Trends for Data Streaming with Apache Kafka and Flink in 2025 appeared first on Kai Waehner.

]]>
The evolution of data streaming has transformed modern business infrastructure, establishing real-time data processing as a critical asset across industries. At the forefront of this transformation, Apache Kafka and Apache Flink stand out as leading open-source frameworks that serve as the foundation for cloud services, enabling organizations to unlock the potential of real-time data. Over recent years, trends have shifted from batch-based data processing to real-time analytics, scalable cloud-native architectures, and improved data governance powered by these technologies. Looking ahead to 2025, the data streaming ecosystem is set to undergo even greater changes. Here are the top trends shaping the future of data streaming for businesses.

Data Streaming Trends for 2025 - Leading with Apache Kafka and Flink

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch.

The Top Data Streaming Trends

Some followers might notice that this became a series with articles about the top 5 data streaming trends for 2021, the top 5 for 2022, the top 5 for 2023, and the top 5 for 2024. Trends change over time, but the huge value of having a scalable real-time infrastructure as the central data hub stays. Data streaming with Apache Kafka is a journey and evolution to set data in motion.

I recently explored the past, present, and future of data streaming tools and strategies from the past decades. Data streaming is becoming more and more mature and standardized, but also innovative.

Let’s now look at the top trends coming up more regularly in conversations with customers, prospects, and the broader data streaming community across the globe:

  1. The Democratization of Kafka: Apache Kafka has transitioned from a specialized tool to a key pillar in modern data infrastructure.
  2. Kafka Protocol as the Standard: Vendors adopt the Kafka wire protocol, enabling flexibility with compatibility and performance trade-offs.
  3. BYOC Deployment Model: Bring Your Own Cloud gains traction for balancing security, compliance, and managed services.
  4. Flink Becomes the Standard for Stream Processing: Apache Flink rises as the premier framework for stream processing, building integration pipelines and business applications.
  5. Data Streaming for Real-Time Predictive AI and GenAI: Real-time model inference drives predictive and generative AI applications.
  6. Data Streaming Organizations: Companies unify real-time data strategies to standardize processes, tools, governance, and collaboration.

The following sections describe each trend in more detail. The trends are relevant for many scenarios; no matter if you use open-source frameworks like Apache Kafka and Flink, a commercial platform, or a fully managed cloud service like Confluent Cloud.

Trend 1: The Democratization of Kafka

In the last decade, Apache Kafka has become the standard for data streaming, evolving from a specialized tool to an essential utility in the modern tech stack. With over 150,000 organizations using Kafka today, it has become the de facto choice for stream processing. Yet, with a market crowded by offerings from AWS, Microsoft Azure, Google GCP, IBM, Oracle, Confluent, and various startups, companies can no longer rely solely on Kafka for differentiation. The vast array of Kafka-compatible solutions means that businesses face more choices than ever, but also new challenges in selecting the solution that balances cost, performance, and features.

The Challenge: Finding the Right Fit in a Crowded Kafka Market

For end users, choosing the right Kafka solution is becoming increasingly complex. Basic Kafka offerings cover standard streaming needs but may lack advanced features, such as enhanced security, data governance, or integration and processing capabilities, that are essential for specific industries. In such a diverse market, businesses must navigate trade-offs, considering whether a low-cost option meets their needs or whether investing in a premium solution with added capabilities provides better long-term value.

The Solution: Prioritizing Features for Your Strategic Needs

As Kafka solutions evolve, users must look beyond price and consider features that offer real strategic value. For example, companies handling sensitive customer data might benefit from Kafka products with top-tier security features. Those focused on analytics may look for solutions with strong integrations into data platforms and low cost for high throughput. By carefully selecting a Kafka product that aligns with industry-specific requirements, businesses can leverage the full potential of Kafka while optimizing for cost and capabilities.

For instance, look at Confluent’s various cluster types for different requirements and use cases in the cloud:

Confluent Cloud Cluster Types for Different Requirements and Use Cases
Source: Confluent

As an example, Freight Clusters was introduced to provide an offering with up to 90 percent less cost. The major trade-off is higher latency. But this is perfect for high volume log analytics at GB/sec scale.

The Business Value: Affordable and Customized Data Streaming

Kafka’s commoditization means more affordable, customizable options for businesses of all sizes. This competition reduces costs, making high-performance data streaming more accessible, even to smaller organizations. By choosing a tailored solution, businesses can enhance customer satisfaction, speed up decision-making, and innovate faster in a competitive landscape.

Trend 2: The Kafka Protocol, not Apache Kafka, is the New Standard for Data Streaming

With the rise of cloud-native architectures, many vendors have shifted to supporting the Kafka protocol rather than the open-source Kafka framework itself, allowing for greater flexibility and cloud optimization. This change enables businesses to choose Kafka-compatible tools that better align with specific needs, moving away from a one-size-fits-all approach.

Confluent introduced its KORA engine, i.e., Kafka re-architected to be cloud-native. A deep technical whitepaper goes into the details (this is not a marketing document but really for software engineers).

Confluent KORA - Apache Kafka Re-Architected to be Cloud Native
Source: Confluent

Other players followed Confluent and introduced their own cloud-native “data streaming engines”. For instance, StreamNative has URSA powered by Apache Pulsar, Redpanda talks about its R1 Engine implementing the Kafka protocol, and Ververica recently announced VERA for its Flink-based platform.

Some vendors rely only on the Kafka protocol with a proprietary engine from the beginning. For instance, Azure Event Hubs or WarpStream. Amazon MSK also goes in this direction by adding proprietary features like Tiered Storage or even introducing completely new product options such as Amazon MSK Express brokers.

The Challenge: Limited Compatibility Across Kafka Solutions

When vendors implement the Kafka protocol instead of the entire Kafka framework, it can lead to compatibility issues, especially if the solution doesn’t fully support Kafka APIs. For end users, this can complicate integration, particularly for advanced features like Exactly-Once Semantics, the Transaction API, Compacted Topics, Kafka Connect, or Kafka Streams, which may not be supported or working as expected.

The Solution: Evaluating Kafka Protocol Solutions Critically

To fully leverage the flexibility of Kafka protocol-based solutions, a thorough evaluation is essential. Businesses should carefully assess the capabilities and compatibility of each option, ensuring it meets their specific needs. Key considerations include verifying the support of required features and APIs (such as the Transaction API, Kafka Streams, or Connect).

It is also crucial to evaluate the level of product support provided, including 24/7 availability, uptime SLAs, and compatibility with the latest versions of open-source Apache Kafka. This detailed evaluation ensures that the chosen solution integrates seamlessly into existing architectures and delivers the reliability and performance required for modern data streaming applications.

The Business Value: Expanded Options and Cost-Efficiency

Kafka protocol-based solutions offer greater flexibility, allowing businesses to select Kafka-compatible services optimized for their specific environments. This flexibility opens doors for innovation, enabling companies to experiment with new tools without vendor lock-in.

For instance, innovations such as a “direct write to S3 object store” architecture, as seen in WarpStream, Confluent Freight Clusters, and other data streaming startups that also build proprietary engines around the Kafka protocol. The result is a more cost-effective approach to data streaming, though it may come with trade-offs, such as increased latency. Check out this video about the evolution of Kafka Storage to learn more.

Trend 3: BYOC (Bring Your Own Cloud) as a New Deployment Model for Security and Compliance

As data security and compliance concerns grow, the Bring Your Own Cloud (BYOC) model is gaining traction as a new way to deploy Apache Kafka. BYOC allows businesses to host Kafka in their own Virtual Private Cloud (VPC) while the vendor manages the control plane to handle complex orchestration tasks like partitioning, replication, and failover.

This BYOC approach offers organizations enhanced control over their data while retaining the operational benefits of a managed service. BYOC provides a middle ground between self-managed and fully managed solutions, addressing specific regulatory and security needs without sacrificing scalability or flexibility.

Cloud-Native BYOC for Apache Kafka with WarpStream in the Public Cloud
Source: Confluent

The Challenge: Balancing Security and Ease of Use

Ensuring data sovereignty and compliance is non-negotiable for organizations in highly regulated industries. However, traditional fully managed cloud solutions can pose risks due to vendor access to sensitive data and infrastructure. Many BYOC solutions claim to address these issues but fall short when it comes to minimizing external access to customer environments. Common challenges include:

  • Vendor Access to VPCs: Many BYOC offerings require vendors to have access to customer VPCs for deployment, cluster management, and troubleshooting. This introduces potential security vulnerabilities.
  • IAM Roles and Elevated Privileges: Cross-account Identity and Access Management (IAM) roles are often necessary for managing BYOC clusters, which can expose sensitive systems to unnecessary risks.
  • VPC Peering Complexity: Traditional BYOC solutions often rely on VPC peering, a complex and expensive setup that increases operational overhead and opens additional points of failure.

These limitations create significant challenges for security-conscious organizations, as they undermine the core promise of BYOC: control over the data environment.

The Solution: Gaining Control with a “Zero Access” BYOC Model

WarpStream redefines the BYOC model with a “zero access” architecture, addressing the challenges of traditional BYOC solutions. Unlike other BYOC offerings using the Kafka protocol, WarpStream ensures that no data leaves the customer’s environment, delivering a truly secure-by-default platform. Hence this section discusses specifically WarpStream, not BYOC Kafka offerings in general.

WarpStream BYOC Zero Access Kafka Architecture with Control and Data Plane
Source: WarpStream

Key features of WarpStream include:

  • Zero Access to Customer VPCs: WarpStream eliminates vendor access by deploying stateless agents within the customer’s environment, handling compute operations locally without requiring cross-account IAM roles or elevated privileges to reduce security risks.
  • Data/Metadata Separation: Raw data remains entirely within the customer’s network for full sovereignty, while only metadata is sent to WarpStream’s control plane for centralized management, ensuring data security and compliance.
  • Simplified Infrastructure: WarpStream avoids complex setups like VPC peering and cross-IAM roles, minimizing operational overhead while maintaining high performance.

Comparison with Other BYOC Solutions using the Kafka protocol:

Unlike most other BYOC offerings (e.g., Redpanda), WarpStream doesn’t require direct VPC access or elevated permissions, avoiding risks like data exposure or remote troubleshooting vulnerabilities. Its “zero access” architecture ensures unparalleled security and compliance.

The Business Value: Secure, Compliant, and Scalable Data Streaming

WarpStream’s innovative approach to BYOC delivers exceptional business value by addressing security and compliance concerns while maintaining operational simplicity and scalability:

  • Uncompromised Security: The zero-access architecture ensures that raw data remains entirely within the customer’s environment, meeting the strictest security and compliance requirements for regulated industries like finance, healthcare, and government.
  • Operational Efficiency: By eliminating the need for VPC peering, cross-IAM roles, and remote vendor access, WarpStream simplifies BYOC deployments and reduces operational complexity.
  • Cost Optimization: WarpStream’s reliance on cloud-native technologies like object storage reduces infrastructure costs compared to traditional disk-based approaches. Stateless agents also enable efficient scaling without unnecessary overhead.
  • Data Sovereignty: The data/metadata split guarantees that data never leaves the customer’s environment, ensuring compliance with regulations such as GDPR and HIPAA.
  • Peace of Mind for Security Teams: With no vendor access to the VPC or object storage, WarpStream’s zero-access model eliminates concerns about external breaches or elevated privileges, making it easier to gain buy-in from security and infrastructure teams.
BYOC Strikes the Balance Between Control and Managed Services

BYOC offers businesses the ability to strike a balance between control and managed services, but not all BYOC solutions are created equal. WarpStream’s “zero access” architecture sets a new standard, addressing the critical challenges of security, compliance, and operational simplicity. By ensuring that raw data never leaves the customer’s environment and eliminating the need for vendor access to VPCs, WarpStream delivers a BYOC model that meets the highest standards of security and performance. For organizations seeking a secure, scalable, and compliant approach to data streaming, WarpStream represents the future of BYOC data streaming.

But just to be clear: If a data streaming project goes to the cloud, fully managed Kafka (and Flink) should always be the first option as it is much easier to manage and operate to focus on fast time-to-market and business innovation. Choose BYOC only if fully managed does not work for you because of security requirements.

Apache Flink has emerged as the premier choice for organizations seeking a robust and versatile framework for continuous stream processing. Its ability to handle complex data pipelines with high throughput, low latency, and advanced stateful operations has solidified its position as the de facto standard for stream processing. Flink’s support for Java, Python, and SQL further enhances its appeal, enabling developers to build powerful data-driven applications using familiar tools.

Apache Flink Adoption Curve Compared to Kafka

As Flink adoption grows, it increasingly complements Apache Kafka as part of the modern data streaming ecosystem, while the Kafka Streams (Java-only) library remains relevant for lightweight, application-embedded use cases.

The Challenge: Handling Complex, High-Throughput Data Streams

Modern businesses increasingly rely on real-time data for both operational and analytical needs, spanning mission-critical applications like fraud detection, predictive maintenance, and personalized customer experiences, as well as Streaming ETL for integrating and transforming data. These diverse use cases demand robust stream processing capabilities that can address the challenges of:

Apache Flink’s versatility makes it uniquely positioned to meet the demands of both streaming ETL for data integration and building real-time business applications. Flink provides:

  • Low Latency: Near-instantaneous processing is crucial for enabling real-time decision-making in business applications, timely updates in analytical systems, and supporting transactional workloads where rapid processing and immediate consistency are essential for ensuring smooth operations and seamless user experiences.
  • High Throughput and Scalability: The ability to process millions of events per second, whether for aggregating operational metrics or moving massive volumes of data into data lakes or warehouses, without bottlenecks.
  • Stateful Processing: Support for maintaining and querying the state of data streams, essential for performing complex operations like aggregations, joins, and pattern detection in business applications, as well as data transformations and enrichment in ETL pipelines.
  • Multiple Programming Languages: Support for Java, Python, and SQL ensures accessibility for a wide range of developers, enabling efficient implementation across various use cases.

The rise of cloud services has further propelled Flink’s adoption, with offerings from major providers like Confluent, Amazon, IBM, and emerging startups. These cloud-native solutions simplify Flink deployments, making it easier for organizations to operationalize real-time analytics.

While Apache Flink has emerged as the de facto standard for stream processing, other frameworks like Apache Spark and its streaming module, Structured Streaming, continue to compete in this space. However, Spark Streaming has notable limitations that make it less suitable for many of the complex, high-throughput workloads modern enterprises demand.

The Challenges with Spark Streaming

Apache Spark, originally designed as a batch processing framework, introduced Spark Streaming and later Structured Streaming to address real-time processing needs. However, its batch-oriented roots present inherent challenges:

  • Micro-Batch Architecture: Spark Structured Streaming relies on micro-batches, where data is divided into small time intervals for processing. This approach, while effective for certain workloads, introduces higher latency compared to Flink’s true streaming architecture. Applications requiring millisecond-level processing or transactional workloads may find Spark unsuitable.
  • Limited Stateful Processing: While Spark supports stateful operations, its reliance on micro-batches adds complexity and latency. This makes Spark Streaming less efficient for use cases that demand continuous state updates, such as fraud detection or complex event processing (CEP).
  • Fault Tolerance Complexity: Spark’s recovery model is rooted in its lineage-based approach to fault tolerance, which can be less efficient for long-running streaming applications. Flink, by contrast, uses checkpointing and savepoints to handle failures more gracefully to ensure state consistency with minimal overhead.
  • Performance Overhead: Spark’s general-purpose design often results in higher resource consumption compared to Flink, which is purpose-built for stream processing. This can lead to increased infrastructure costs for high-throughput workloads.
  • Scalability Challenges for Stateful Workloads: While Spark scales effectively for batch jobs, its scalability for complex stateful stream processing is more limited, as distributed state management in micro-batches can become a bottleneck under heavy load.

By addressing these limitations, Apache Flink provides a more versatile and efficient solution than Apache Spark for organizations looking to handle complex, real-time data processing at scale.

Flink’s architecture is purpose-built for streaming, offering native support for stateful processing, low-latency event handling, and fault-tolerant operation, making it the preferred choice for modern real-time applications. But to be clear: Apache Spark, including Spark Streaming, has its place in data lakes and lakehouses to process analytical workloads.

Flink’s technical capabilities bring tangible business benefits, making it an essential tool for modern enterprises. By providing real-time insights, Flink enables businesses to respond to events as they occur, such as detecting and mitigating fraudulent transactions instantly, reducing losses, and enhancing customer trust.

The support of Flink for both transactional workloads (e.g., fraud detection or payment processing) and analytical workloads (e.g., real-time reporting or trend analysis) ensures versatility across a range of critical business functions. Scalability and resource optimization keep infrastructure costs manageable, even for demanding, high-throughput workloads, while features like checkpointing streamline failure recovery and upgrades, minimizing operational overhead.

Flink stands out with its dual focus on streaming ETL for data integration and building business applications powered by real-time analytics. Its rich APIs for Java, Python, and SQL make it easy for developers to implement complex workflows, accelerating time-to-market for new applications.

Data streaming has powered AI/ML infrastructure for many years because of its capabilities to scale to high volumes, process data in real-time, and integrate with transactional (payments, orders, ERP, etc.) and analytical (data warehouse, data lake, lakehouse) systems. My first article about Apache Kafka and Machine Learning was published in 2017: “How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka“.

As AI continues to evolve, real-time model inference powered by data streaming is opening up new possibilities for predictive and generative AI applications. By integrating model inference with stream processors such as Apache Flink, businesses can perform on-demand predictions for fraud detection, customer personalization, and more.

The Challenge: Provide Context for AI Applications In Real-Time

Traditional batch-based AI inference is too slow for many applications, delaying responses and leading to missed opportunities or wrong business decisions. To fully harness AI in real-time, businesses need to embed model inference directly within streaming pipelines.

Generative AI (GenAI) demands new design patterns like Retrieval Augmented Generation (RAG) to ensure accuracy, relevance, and reliability in its outputs. Without data streaming, RAG struggles to provide large language models (LLMs) with the real-time, domain-specific context they need, leading to outdated or hallucinated responses. Context is essential to ensure that LLMs deliver accurate and trustworthy outputs by grounding them in up-to-date and precise information.

Apache Flink enables real-time model inference by connecting data streams to external AI models through APIs. This setup allows companies to use centralized model servers for inference, providing flexibility and scalability while keeping data streams fast and responsive.

Flink’s ability to process data in real-time also enables advanced machine learning workflows, supporting predictive analytics and generative AI use cases that drive innovation.

GenAI Remote Model Inference with Stream Processing using Apache Kafka and Flink

Apache Flink enables real-time model inference by connecting data streams to external AI models through APIs. This setup allows companies to use centralized model servers for inference, providing flexibility and scalability while keeping data streams fast and responsive. By processing data in real-time, Flink ensures that generative AI models operate with the most current and relevant context, reducing errors and hallucinations.

Flink’s real-time processing capabilities also support advanced machine learning workflows. This enables use cases like predictive analytics, anomaly detection, and generative AI applications that require instantaneous decision-making. The ability to join live data streams with historical or external datasets enriches the context for model inference, enhancing both accuracy and relevance.

Additionally, Flink facilitates feature extraction and data preprocessing directly within the stream to ensure that the inputs to AI models are optimized for performance. This seamless integration with model servers and vector databases allows organizations to scale their AI systems effectively, leveraging real-time insights to drive innovation and deliver immediate business value.

The Business Value: Immediate, Actionable AI Insights

Real-time AI model inference with Flink enables businesses to provide personalized customer experiences, detect fraud as it happens, and perform predictive maintenance with minimal latency. This real-time responsiveness empowers companies to make AI-driven decisions in milliseconds, improving customer satisfaction and operational efficiency.

By integrating Flink with event-driven architectures like Apache Kafka, businesses can ensure that AI systems are always fed with up-to-date and trustworthy data, further enhancing the reliability of predictions.

The integration of Flink and data streaming offers a clear path to measurable business impact. By aligning real-time AI capabilities with organizational goals, they can drive innovation while reducing operational costs, such as automating customer support to lower reliance on service agents.

Furthermore, Flink’s ability to process and enrich data streams at scale supports strategic initiatives like hyper-personalized marketing or optimizing supply chains in real-time. These benefits directly translate into enhanced competitive positioning, faster time-to-market for AI-driven solutions, and the ability to make more confident, data-driven decisions at the speed of business.

Trend 6: Becoming a Data Streaming Organization

To harness the full potential of data streaming, companies are shifting toward structured, enterprise-wide data streaming strategies. Moving from a tactical, ad-hoc approach to a cohesive top-down strategy enables businesses to align data streaming with organizational goals, driving both efficiency and innovation.

The Challenge: Fragmented Data Streaming Efforts

Many companies face challenges due to disjointed streaming efforts, leading to data silos and inconsistencies that prevent them from reaping the full benefits of real-time data processing. At Confluent, we call this the enterprise adoption barrier:

Data Streaming Maturity Model - The Enterprise Adoption Barrier
Source: Confluent

This fragmentation results in inefficiencies, duplication of efforts, and a lack of standardized processes. Without a unified approach, organizations struggle with:

  • Data Silos: Limited data sharing across teams creates bottlenecks for broader use cases.
  • Inconsistent Standards: Different teams often use varying schemas, patterns, and practices, leading to integration challenges and data quality issues.
  • Governance Gaps: A lack of defined roles, responsibilities, and policies results in limited oversight, increasing the risk of data misuse and compliance violations.

These challenges prevent organizations from scaling their data streaming capabilities and realizing the full value of their real-time data investments.

The Solution: Building an Integrated Data Streaming Organization

By adopting a comprehensive data streaming strategy, businesses can create a unified data platform with standardized tools and practices. A dedicated streaming platform team, often called the Center of Excellence (CoE), ensures consistent operations. An internal developer platform provides governed, self-serve access to streaming resources.

Key elements of a data streaming organization include:

  • Unified Platform: Move from disparate tools and approaches to a single, standardized data streaming platform. This includes consistent policies for cluster management, multi-tenancy, and topic naming, ensuring a reliable foundation for data streaming initiatives.
  • Self-Service: Provide APIs, UIs, and other interfaces for teams to onboard, create, and manage data streaming resources. Self-service capabilities ensure governed access to topics, schemas, and streaming capabilities, empowering developers while maintaining compliance and security.
  • Data as a Product: Adopt a product-oriented mindset where data streams are treated as reusable assets. This includes formalizing data products with clear contracts, ownership, and metadata, making them discoverable and consumable across the organization.
  • Alignment: Define clear roles and responsibilities, from platform operators and developers to data product owners. Establishing an enterprise-wide data streaming function ensures coordination and alignment across teams.
  • Governance: Implement automated guardrails for compliance, quality, and access control. This ensures that data streaming efforts remain secure, trustworthy, and scalable.

The Business Value: Consistent, Scalable, and Agile Data Streaming

Becoming a Data Streaming Organization unlocks significant value by turning data streaming into a strategic asset. The benefits include:

  • Enhanced Agility: A unified platform reduces time-to-market for new data-driven products and services, allowing businesses to respond quickly to market trends and customer demands.
  • Operational Efficiency: Streamlined processes and self-service capabilities reduce the overhead of managing multiple tools and teams, improving productivity and cost-effectiveness.
  • Scalable Innovation: Standardized patterns and reusable data products enable the rapid development of new use cases, fostering a culture of innovation across the enterprise.
  • Improved Governance: Clear policies and automated controls ensure data quality, security, and compliance, building trust with customers and stakeholders.
  • Cross-Functional Collaboration: By breaking down silos, organizations can leverage data streams across teams, creating a network effect that accelerates value creation.

To successfully adopt a Data Streaming Organization model, companies must combine technical capabilities with cultural and structural change. This involves not just deploying tools but establishing shared goals, metrics, and education to bring teams together around the value of real-time data. As organizations embrace data streaming as a strategic function, they position themselves to thrive in a data-driven world.

Embracing the Future of Data Streaming

As data streaming continues to mature, it has become the backbone of modern digital enterprises. It enables real-time decision-making, operational efficiency, and transformative AI applications. Trends such as the commoditization of Kafka, the adoption of the Kafka protocol, BYOC deployment models, and the rise of Flink as the standard for stream processing demonstrate the rapid evolution and growing importance of this technology. These innovations not only streamline infrastructure but also empower organizations to harness real-time insights, foster agility, and remain competitive in the ever-changing digital landscape.

These advancements in data streaming present a unique opportunity to redefine data strategy. Leveraging data streaming as a central pillar of IT architecture allows businesses to break down silos, integrate machine learning into critical workflows, and deliver unparalleled customer experiences. The convergence of data streaming with generative AI, particularly through frameworks like Flink, underscores the importance of embracing a real-time-first approach to data-driven innovation.

Looking ahead, organizations that invest in scalable, secure, and strategic data streaming infrastructures will be positioned to lead in 2025 and beyond. By adopting these trends, enterprises can unlock the full potential of their data, drive business transformation, and solidify their place as leaders in the digital era. The journey to set data in motion is not just about technology—it’s about building the foundation for a future where real-time intelligence powers every decision and every experience.

What trends do you see for data streaming? Which ones are your favorites? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Top Trends for Data Streaming with Apache Kafka and Flink in 2025 appeared first on Kai Waehner.

]]>
When NOT to Use Apache Kafka? (Lightboard Video) https://www.kai-waehner.de/blog/2024/03/26/when-not-to-use-apache-kafka-lightboard-video/ Tue, 26 Mar 2024 06:45:11 +0000 https://www.kai-waehner.de/?p=6262 Apache Kafka is the de facto standard for data streaming to process data in motion. With its significant adoption growth across all industries, I get a very valid question every week: When NOT to use Apache Kafka? What limitations does the event streaming platform have? When does Kafka simply not provide the needed capabilities? How to qualify Kafka out as it is not the right tool for the job? This blog post contains a lightboard video that gives you a twenty-minute explanation of the DOs and DONTs.

The post When NOT to Use Apache Kafka? (Lightboard Video) appeared first on Kai Waehner.

]]>
Apache Kafka is the de facto standard for data streaming to process data in motion. With its significant adoption growth across all industries, I get a very valid question every week: When NOT to use Apache Kafka? What limitations does the event streaming platform have? When does Kafka simply not provide the needed capabilities? How to qualify Kafka out as it is not the right tool for the job? This blog post contains a lightboard video that gives you a twenty-minute explanation of the DOs and DONTs.

When NOT to Use Apache Kafka?

DisclaimeR: This blog post shares a lightboard video to watch an explanation about when NOT to use Apache Kafka. For a much more detailed and technical blog post with various use cases and case studies, check out this blog post from 2022 (which is still valid today – whenever you read it).

What is Apache Kafka, and what is it NOT?

Kafka is often misunderstood. For instance, I still hear way too often that Kafka is a message queue. Part of the reason is that some vendors only pitch it for a specific problem (such as data ingestion into a data lake or data warehouse) to sell their products. So, in short:

Kafka is…

  • a scalable real-time messaging platform to process millions of messages per second.
  • a data streaming platform for massive volumes of big data analytics and small volumes of transactional data processing.
  • a distributed storage provides true decoupling for backpressure handling, support of various communication protocols, and replayability of events with guaranteed ordering.
  • a data integration framework (Kafka Connect) for streaming ETL.
  • a data processing framework (Kafka Streams) for continuous stateless or stateful stream processing.

This combination of characteristics in a single platform makes Kafka unique (and successful).

Kafka is NOT…

  • a proxy for millions of clients (like mobile apps) – but Kafka-native proxies (like REST or MQTT) exist for some use cases.
  • an API Management platform – but these tools are usually complementary and used for the creation, life cycle management, or the monetization of Kafka APIs.
  • a database for complex queries and batch analytics workloads – but good enough for transactional queries and relatively simple aggregations (especially with ksqlDB).
  • an IoT platform with features such as device management  – but direct Kafka-native integration with (some) IoT protocols such as MQTT or OPC-UA is possible and the appropriate approach for (some) use cases.
  • a technology for hard real-time applications such as safety-critical or deterministic systems – but that’s true for any other IT framework, too. Embedded systems are a different software!

For these reasons, Kafka is complementary, not competitive, to these other technologies. Choose the right tool for the job and combine them!

Lightboard Video: When NOT to use Apache Kafka

The following video explores the key concepts of Apache Kafka. Afterwards, the DOs and DONTs of Kafka show how to complement data streaming with other technologies for analytics, APIs, IoT, and other scenarios.

Data Streaming Vendors and Cloud Services

The research company Forrester defines data streaming platforms as a new software category in a new Forrester Wave. Apache Kafka is the de facto standard used by over 100,000 organizations.

Plenty of vendors offer Kafka platforms and cloud services. Many complementary open source stream processing frameworks like Apache Flink and related cloud offerings emerged. And competitive technologies like Pulsar, Redpanda, or WarpStream try to get market share leveraging the Kafka protocol. Check out the data streaming landscape of 2024 to summarize existing solutions and market trends. The end of the article gives an outlook to potential new entrants in 2025.

Data Streaming Landscape 2024 around Kafka Flink and Cloud

Apache Kafka is a Data Streaming Platform: Combine it with other Platforms when needed!

Over 150,000 organizations use Apache Kafka in the meantime. The Kafka protocol is the de facto standard for many open source frameworks, commercial products and serverless cloud SaaS offerings.

However, Kafka is not an allrounder for every use case. Many projects combine Kafka with other technologies, such as databases, data lakes, data warehouses, IoT platforms, and so on. Additionally, Apache Flink is becoming the de facto standard for stream processing (but Kafka Streams is not going away and is the better choice for specific use cases).

Where do you (not) use Apache Kafka? What other technologies do you combine Kafka with? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post When NOT to Use Apache Kafka? (Lightboard Video) appeared first on Kai Waehner.

]]>
Machine Learning and Data Science with Kafka in Healthcare https://www.kai-waehner.de/blog/2022/04/18/machine-learning-data-science-with-kafka-in-healthcare-pharma/ Mon, 18 Apr 2022 11:44:09 +0000 https://www.kai-waehner.de/?p=4446 IT modernization and innovative new technologies change the healthcare industry significantly. This blog series explores how data streaming with Apache Kafka enables real-time data processing and business process automation. Real-world examples show how traditional enterprises and startups increase efficiency, reduce cost, and improve the human experience across the healthcare value chain, including pharma, insurance, providers, retail, and manufacturing. This is part five: Machine Learning and Data Science. Examples include Recursion and Humana.

The post Machine Learning and Data Science with Kafka in Healthcare appeared first on Kai Waehner.

]]>
IT modernization and innovative new technologies change the healthcare industry significantly. This blog series explores how data streaming with Apache Kafka enables real-time data processing and business process automation. Real-world examples show how traditional enterprises and startups increase efficiency, reduce cost, and improve the human experience across the healthcare value chain, including pharma, insurance, providers, retail, and manufacturing. This is part five: Machine Learning and Data Science. Examples include Recursion and Humana.

Machine Learning and Data Science with Apache Kafka in Healthcare

Blog Series – Kafka in Healthcare

Many healthcare companies leverage Kafka today. Use cases exist in every domain across the healthcare value chain. Most companies deploy data streaming in different business domains. Use cases often overlap. I tried to categorize a few real-world deployments into different technical scenarios and added a few real-world examples:

Stay tuned for a dedicated blog post for each of these topics as part of this blog series. I will link the blogs here as soon as they are available (in the next few weeks). Subscribe to my newsletter to get an email after each publication (no spam or ads).

Machine Learning and Data Science with Data Streaming using Apache Kafka

The relationship between Apache Kafka and machine learning (ML) is getting more and more traction for data engineering at scale and robust model deployment with low latency.

The Kafka ecosystem helps in different ML use cases for model training, model serving, and model monitoring. The core of most ML projects requires reliable and scalable data engineering pipelines across

  • different technologies
  • communication paradigms (REST, gRPC, data streaming)
  • programming languages (like Python for the data scientist or Java/Go/C++ for the production engineer)
  • APIs
  • commercial products
  • SaaS offerings

Here is an architecture diagram that shows how Kafka helps in data science projects:

The beauty of Kafka is that it combines real-time data processing with extreme scalability and true decoupling between systems.

Tiered Storage adds cost-efficient storage of big data sets and replayability with guaranteed ordering.

I’ve written about this relationship between Kafka and Machine Learning in various articles:

Let’s look at a few real-world deployments for Apache Kafka and Machine Learning in the healthcare sector.

Humana – Real-Time Interoperability at the Point of Care

Humana Inc. is a for-profit American health insurance company. They leverage data streaming with Apache Kafka to improve real-time interoperability at the point of care.

The interoperability platform to transition from an insurance company with elements of health to truly a health company with elements of insurance.

Their core principles include:

  • Consumer-centric
  • Health plan agnostic
  • Provider agnostic
  • Cloud resilient
  • Elastic scale
  • Event-driven and real-time

A critical characteristic is inter-organization data sharing (known as “data exchange/data sharing”).

Humana’s use cases include

  • real-time updates of health information, for instance
  • connecting health care providers to pharmacies
  • reducing pre-authorizations from 20-30 minutes to 1 minute
  • real-time home healthcare assistant communication

The Humana interoperability platform combines data streaming (= the Kafka ecosystem) with artificial intelligence and machine learning (= IBM Watson) to correlate data, train analytic models, and act on new events in real-time.

Humana’s data journey is described in this diagram from their Kafka Summit talk:

Real-Time Healthcare Insurance at Humana with Apache Kafka Data Streaming

Learn more details about Humana’s use cases and architecture in the keynote of another Kafka Summit session.

Recursion – Industrial Revolution of Drug Discovery with Kafka and Deep Learning

Recursion is a clinical-stage biotechnology company that built the “industrial revolution of drug discovery“. They decode biology by integrating technological innovations across biology, chemistry, automation, machine learning, and engineering to industrialize drug discovery.

Industrial pharma revolution - accelerate drug discovery at recursion

Kafka-powered data streaming speeds up the pharma processes significantly. Recursion has already made significant strides in accelerating drug discovery, with over 30 disease models in discovery, another nine in preclinical development, and two in clinical trials.

With serverless Confluent Cloud and the new data streaming approach, the company has built a platform that makes it possible to screen much larger experiments with thousands of compounds against hundreds of disease models in minutes and less expensive than alternative discovery approaches.

From a technical perspective, Recursion finds drug treatments by processing biological images. A massively parallel system combines experimental biology, artificial intelligence, automation, and real-time data streaming:

Apache Kafka and Machine Learning at Recursion for Drug Discovery in Pharma

Recursion went from ‘drug discovery in manual and slow, not scalable, bursty BATCH MODE’ to ‘drug discovery in automated, scalable, reliable REAL-TIME MODE’.

Recursion leverages Dagger, an event-driven workflow and orchestration library for Kafka Streams that enables engineers to orchestrate services by defining workloads as high-level data structures. Dagger combines Kafka topics and schemas with external tasks for actions completed outside of the Kafka Streams applications.

Drug Discovery in automated, scalable, reliable real time Mode

In the meantime, Recursion did not just migrate from manual batch workloads to Kafka but also migrated to serverless Kafka, leveraging Confluent Cloud to focus its resources on business problems instead of infrastructure operations.

Machine Learning and Data Science with Kafka for Intelligent Healthcare Applications

Think about IoT sensor analytics, cybersecurity, patient communication, insurance, research, and many other domains. Real-time data beats slow data in the healthcare supply chain almost everywhere.

This blog post explored the capabilities of the Apache Kafka ecosystem for machine learning infrastructures. Real-world deployments from Humana and Recursion showed how enterprises successfully deploy Kafka together with Machine Learning frameworks like TensorFlow for use cases.

How do you leverage data streaming with Apache Kafka in the healthcare industry? What architecture does your platform use? Which products do you combine with data streaming? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Machine Learning and Data Science with Kafka in Healthcare appeared first on Kai Waehner.

]]>
Streaming ETL with Apache Kafka in the Healthcare Industry https://www.kai-waehner.de/blog/2022/04/01/streaming-etl-with-apache-kafka-healthcare-pharma-industry/ Fri, 01 Apr 2022 05:47:00 +0000 https://www.kai-waehner.de/?p=4402 IT modernization and innovative new technologies change the healthcare industry significantly. This blog series explores how data streaming with Apache Kafka enables real-time data processing and business process automation. This is part three: Streaming ETL. Examples include Babylon Health and Bayer.

The post Streaming ETL with Apache Kafka in the Healthcare Industry appeared first on Kai Waehner.

]]>
IT modernization and innovative new technologies change the healthcare industry significantly. This blog series explores how data streaming with Apache Kafka enables real-time data processing and business process automation. Real-world examples show how traditional enterprises and startups increase efficiency, reduce cost, and improve the human experience across the healthcare value chain, including pharma, insurance, providers, retail, and manufacturing. This is part three: Streaming ETL. Examples include Babylon Health and Bayer.

Streaming ETL with Apache Kafka in Healthcare

Blog Series – Kafka in Healthcare

Many healthcare companies leverage Kafka today. Use cases exist in every domain across the healthcare value chain. Most companies deploy data streaming in different business domains. Use cases often overlap. I tried to categorize a few real-world deployments into different technical scenarios and added a few real-world examples:

Stay tuned for a dedicated blog post for each of these topics as part of this blog series. I will link the blogs here as soon as they are available (in the next few weeks). Subscribe to my newsletter to get an email after each publication (no spam or ads).

Streaming ETL with Apache Kafka

Streaming ETL is similar to concepts you might know from traditional ETL tools. I have already explored how data streaming with Kafka differs from data integration tools and iPaaS cloud services. The critical difference is that you leverage a single platform for data integration and processing at scale in real-time. There is no need to combine several platforms to achieve this. The result is a Kappa architecture that enables real-time but also batch workloads with a single integration architecture.

Streaming ETL with Apache Kafka Streams Connect ksqlDB

Streaming ETL with Kafka combines different components and features:

  • Kafka Connect as Kafka-native integration framework
  • Kafka Connect source and sink connectors to consume and produce data from/to any other database, application, or API
  • Single Message Transform (SMT) – an optional Kafka Connect feature – to process (filter, change, remove, etc.) incoming or outgoing messages within the connector deployment
  • Kafka Streams or ksqlDB for continuous data processing in real-time at scale for stateless or stateful ETL jobs
  • Data governance via schema management, enforcement and versioning using the Schema Registry
  • Security and access control using features like role-based access control, audit logs, and end-to-end encryption

In the cloud, you can leverage a serverless Kafka offering for the whole Streaming ETL pipeline. Confluent Cloud fully manages Kafka’s end-to-end infrastructure, including connectors, ksqlDB workloads, data governance, and security.

One last general note: Don’t Design for Data at Rest to Reverse it! Learn more here: “When to Use Reverse ETL and when it is an Anti-Pattern“. Instead, use real-time Streaming ETL for Data in Motion and the Kappa architecture from scratch.

Let’s look at a few real-world deployments in the healthcare sector.

Babylon Health – PII and GDRP compliant Security

Babylon Health is a digital-first health service provider and value-based care company that combines an artificial intelligence-powered platform with virtual clinical operations for patients. Patients are connected with health care professionals through its web and mobile application.

Babylon’s mission is to put an accessible and affordable health service in the hands of every person on earth. For that mission, Babylon built an agile microservice architecture with the Kafka ecosystem:

Kafka for Streaming ETL at Babylon Health

Here are the “wonders of working” in Healthcare for Babylon (= reasons to choose Kafka):

  • Real-time data processing
  • Replayability of historical information
  • Order matters and is ensured with guaranteed ordering
  • GDPR and data ownership for PII compliant security
  • Data governance via the schema registry to provide true decoupling and access via many programming languages like Java, Python, and Ruby

Bayer – Data Integration and Processing in R&D

Bayer AG is a German multinational pharmaceutical and life sciences company and one of the largest pharmaceutical companies in the world. They leverage Kafka in various use cases and business domains.

The following scenario is from the research and development department of the pharma business unit. Their focus areas are cardiovascular diseases, oncology, and women’s health. The division employs over 7,500 R&D people and expenses over 2.75 billion euros for R&D.

The use case Bayer presented at a recent Kafka Summit is about analyzing clinical trials, patents, reports, news, and literature leveraging the Kafka ecosystem. The R&D team processes 250 Million documents from 30+ individual data sources. The data includes 7 TB of raw text-rich data with daily updates, additions, and deletions. Algorithms and data evolve. Bayer needs to completely reprocess the data regularly. Various document streams with different formats and schemas flow through several text processing and enrichment steps.

Research and Development from Molecules to Medicine at Bayer

Scalable, reliable Kafka pipelines with Kafka Streams (Java) and Faust (Python) replaced custom, error-prone, non-scalable scripts. Schemas are used as the data interface to ensure data governance. Avro is the first-class citizen data format to enable compression and better throughput.

Streaming ETL Pipeline with Apache Kafka at Bayer

The true decoupling of Kafka in conjunction with the Schema Registry guarantees interoperability among different components and technologies (java, python, commercial tools, open-source, scientific, proprietary).

Streaming ETL with Kafka for Real-Time Data Integration at any Scale

Think about IoT sensor analytics, cybersecurity, patient communication, insurance, research, and many other domains. Real-time data beats slow data in the healthcare supply chain almost everywhere.

This blog post explored the capabilities of the Apache Kafka Ecosystem for Streaming ETL. Real-world deployments from Babylon Health and Bayer showed how enterprises successfully deploy Kafka for different enterprise architecture use cases.

How do you leverage data streaming with Apache Kafka in the healthcare industry? What architecture does your platform use? Which products do you combine with data streaming? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Streaming ETL with Apache Kafka in the Healthcare Industry appeared first on Kai Waehner.

]]>
Apache Kafka in the Public Sector – Part 3: Government and Citizen Services https://www.kai-waehner.de/blog/2021/10/15/apache-kafka-public-sector-part-3-government-citizen-services/ Fri, 15 Oct 2021 06:54:26 +0000 https://www.kai-waehner.de/?p=3807 The public sector includes many different areas. Some groups leverage cutting-edge technology, like military leverage. Others like the public administration are years or even decades behind. This blog series explores both edges to show how data in motion powered by Apache Kafka adds value for innovative new applications and modernizing legacy IT infrastructures. This is part 3: Use cases and architectures for Government and Citizen Services.

The post Apache Kafka in the Public Sector – Part 3: Government and Citizen Services appeared first on Kai Waehner.

]]>
The public sector includes many different areas. Some groups leverage cutting-edge technology, like military leverage. Others like the public administration are years or even decades behind. This blog series explores how the public sector leverages data in motion powered by Apache Kafka to add value for innovative new applications and modernizing legacy IT infrastructures. This post is part 3: Use cases and architectures for Government and Citizen Services.

Apache Kafka for Government and Citizen Services in the Public Sector

Blog series: Apache Kafka in the Public Sector and Government

This blog series explores why many governments and public infrastructure sectors leverage event streaming for various use cases. Learn about real-world deployments and different architectures for Kafka in the public sector:

  1. Life is a Stream of Events
  2. Smart City
  3. Citizen Services (THIS POST)
  4. Energy and Utilities
  5. National Security

Subscribe to my newsletter to get updates immediately after the publication. Besides, I will also update the above list with direct links to this blog series’s posts once published.

As a side note: If you wonder why healthcare is not on the above list. Healthcare is another blog series on its own. While the government can provide public health care through national healthcare systems, it is part of the private sector in many other cases.

Government and Citizen Services powered by Apache Kafka

We talk a lot about customer 360 and customer experience in the private sector. The public sector is different. Increasing revenue is usually not the primary motivation. For that reason, most countries’ citizen-related processes are terrible experiences.

Nevertheless, the same fact is actual as in the private sector: Real-time data beats slow data! It increases efficiency and makes the citizens happy. I want to share some real-world examples from the US and Europe, where data in motion improved processes and reduced bureaucracy.

Norwegian Work and Welfare Department (NAV) – Personal Life as a Stream of Events

The Norwegian Work and Welfare Department (NAV) supports unemployment benefits, health insurance, social security, pensions, parental benefits. The organization has 23,000 employees and provides USD 50Bn disbursements per year. They assist people through all phases of life within work, family, health, retirement, and social security.

NAV’s had an impressive vision: Imagine a government that knows just enough about you to provide services without you applying for them first:

True Decoupling and Domain Driven Design with Apache Kafka at NAV

This vision is a reality today. NAV presented the implementation at the Kafka Summit 2018 already. NAV Implemented the life is a stream of events by leveraging event streaming technology. Citizens have a tremendous real-time experience while the public administration has optimized processes and reduced cost:

Life is a Stream of Events powered by Apache Kafka at NAV

NAV’s real-time event streaming infrastructure provides several vast benefits to the enterprise architecture:

  • The integration infrastructure enables proper decoupling between the applications with domain-driven design (DDD) powered by Apache Kafka. This approach supports diverging forces in the government.
  • Data as the product enables to announce the arrival of events so the various business domains can trigger processes and update domain-specific data. You might know this concept from the new buzzword “data mesh“.
  • Digitalization reduced bureaucracy, not by turning paper into digital forms, but by reengineering and optimizing the business processes (by avoiding spaghetti architectures).
  • Data privacy and accuracy are ensured, including compliance to laws like GDPR via data minimization, purpose limitation, and data portability.

U.S. Department of Veterans Affairs (VA) – Improved Benefit Services

The United States Department of Veterans Affairs (VA) is a department of the federal government charged with integrating life-long healthcare services to eligible military veterans.

Event streaming powered by Apache Kafka improved the government’s veteran benefit services for ratings, awards, and claims through data streaming. The business value is enormous for the veterans and their families and the government:

  • Assess, route, and verify the service request in real-time.
  • Improve the experience for veterans and their families.
  • Reduce management and operations complexity.
Government providing Omnichannel Claim Management leveraging Kafka

Let’s take a look at the claim example. Several ways exist to file a claim: In-Person (VSO), internet, or postal mail into VA. Such a challenge is called omnichannel customer experience in the retail industry. The IT infrastructure needs to handle changes in the status of a claim, requests for additional documentation, context when calling into the call center, due benefits when checking in at a hospital, and so on.

Event Streaming powered by Apache Kafka is a great approach to implement omnichannel requirements due to the unique combination of real-time data processing and storage, not just in retail but also in government services. VA chose Confluent for true decoupling, real-time integration, and omnichannel communication. Components include Kafka Streams, ksqlDB, Oracle CDC, Tiered Storage, etc.

Here is an excellent quote from the Kafka Summit presentation: “Implementing Confluent enables our agile teams to create reusable solutions that unlock Veteran data; provide real-time actionable information all while reducing complexity and load on existing data sources.

University of California, San Diego – Integration Platform as a Service (iPaaS)

The University of California, San DieUC (UC) is one of the world’s leading public research universities, located in beautiful La Jolla, California. The covid pandemic forced them to do a “once in a life transition”.

UC had to build out their online learning platform due to the pandemic, plus adding the new #1 priority: A comfortable and reliable student testing process. The biggest challenge in this process is the integration between many legacy applications and modern technologies.

Apache Kafka is the de facto standard for modern integration and middleware projects today. You might have seen my content discussing the difference between traditional middleware like MQ, ETL, ESB, iPaaS, and the Apache Kafka ecosystem. For similar reasons, the University of California, San Diego chose Confluent as the cloud-native Integration-platform-as-a-service (iPaaS) middleware layer to set data in motion for 90 million records a day.

iPaaS “Swiss Army Knife” of Integration for the Government with Kafka

iPaas Swiss Army Knife of Integration powered by Apache Kafka

A key benefit is the increased time to market with agile development and decoupled applications. Additionally, this opens up new revenue streams for other UC campuses – including the UC Office of President, and tracing student health.

Government Benefit: From Cost Center to Profit Center with Kafka

A modern, scalable, and reliable real-time middleware layer enables new use cases beyond integration. A great example from the UC is their use case to provide the next best action and contextual knowledge in real-time:

Next Best Action with Kafka Streams and Stream Processing

Continuous data processing from different sources in motion (aka stream processing or streaming analytics) makes these use cases possible. UC leverages Kafka Streams. Kafka-native tools such as Kafka Streams or ksqlDB enable end-to-end data processing at scale in real-time without the need for yet another big data framework (like Storm, Flink, or Spark Streaming).

Data in Motion for Comfortable Citizen Services and Reduced Government Bureaucracy

Real-time data beats slow data. That’s not just true for the private sector. This post showed several real-world examples of how the government can improve processes, reduce costs and bureaucracy, and improve citizens’ experience.

Apache Kafka and its ecosystem provide the capabilities to implement a modern, scalable, reliable real-time middleware layer. Additionally, stream processing allows building new innovative applications that have not been possible before.

How do you leverage event streaming in the public sector? Are you working on citizen services or other government projects? What technologies and architectures do you use? What projects did you already work on or are in the planning? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka in the Public Sector – Part 3: Government and Citizen Services appeared first on Kai Waehner.

]]>
Kafka for Cybersecurity (Part 3 of 6) – Cyber Threat Intelligence https://www.kai-waehner.de/blog/2021/07/15/kafka-cybersecurity-siem-soar-part-3-of-6-cyber-threat-intelligence/ Thu, 15 Jul 2021 06:39:35 +0000 https://www.kai-waehner.de/?p=3555 This blog series explores use cases and architectures for Apache Kafka in the cybersecurity space, including situational awareness, threat intelligence, forensics, air-gapped and zero trust environments, and SIEM / SOAR modernization. This post is part three: Cyber Threat Intelligence.

The post Kafka for Cybersecurity (Part 3 of 6) – Cyber Threat Intelligence appeared first on Kai Waehner.

]]>
Apache Kafka became the de facto standard for processing data in motion across enterprises and industries. Cybersecurity is a key success factor across all use cases. Kafka is not just used as a backbone and source of truth for data. It also monitors, correlates, and proactively acts on events from various real-time and batch data sources to detect anomalies and respond to incidents. This blog series explores use cases and architectures for Kafka in the cybersecurity space, including situational awareness, threat intelligence, forensics, air-gapped and zero trust environments, and SIEM / SOAR modernization. This post is part three: Cyber Threat Intelligence.

Cyber Threat Intelligence with Apache Kafka and SIEM SOAR Machine Learning

Blog series: Apache Kafka for Cybersecurity

This blog series explores why security features such as RBAC, encryption, and audit logs are only the foundation of a secure event streaming infrastructure. Learn about use cases,  architectures, and reference deployments for Kafka in the cybersecurity space:

Subscribe to my newsletter to get updates immediately after the publication. Besides, I will also update the above list with direct links to this blog series’s posts as soon as published.

Cyber Threat Intelligence

Threat intelligence, or cyber threat intelligence, reduces harm by improving decision-making before, during, and after cybersecurity incidents reducing operational mean time to recovery, and reducing adversary dwell time for information technology environments.

Threat intelligence is evidence-based knowledge, including context, mechanisms, indicators, implications, and action-oriented advice about an existing or emerging menace or hazard to assets. This intelligence can be used to inform decisions regarding the subject’s response to that menace or hazard.

Threat intelligence solutions gather raw data about emerging or existing threat actors & threats from various sources. This data is then analyzed and filtered to produce threat intel feeds and management reports that contain information that automated security control solutions can use.

Threat intelligence keeps organizations informed of the risks of advanced persistent threats, zero-day threats and exploits, and how to protect against them.

Situational Awareness is Not Enough…

… but the foundation to collect and pre-process data in real-time at scale. Only real-time situational awareness enables real-time threat intelligence to provide huge benefits to the enterprise:

  • Mitigate harmful events in cyberspace
  • Proactive cybersecurity posture that is predictive, not just reactive
  • Bolster overall risk management policies
  • Improved detection of threats
  • Better decision-making during and following the detection of a cyber intrusion

In summary, threat intelligence allows to:

  • See the whole board. And see it more quickly.
  • See around corners.
  • See the enemy before they see you.

Threat Intelligence for Prevention or Mitigation across the Cyber Kill Chain

Threat intelligence is the knowledge that allows you to prevent or mitigate cyberattacks. It covers all the phases of the so-called “Cyber Kill Chain“:

Intrusion Kill Chain for InfoSec

Threat intelligence provides several benefits:

  • Empowers organizations to develop a proactive cybersecurity posture and to bolster overall risk management policies
  • Drives momentum toward a cybersecurity posture that is predictive, not just reactive
  • Enables improved detection of threats
  • Informs better decision-making during and following the detection of a cyber intrusion

Transactional Data vs. Analytics Data

Most use cases around data-in-motion are about all the data. This is true for all transactional use cases and even for many analytical use cases. Each event is valuable: A sale, an order, a payment, an alert from a machine, etc.

However, data is often full of noise. As I discussed earlier in this blog series, the goal in the cybersecurity space is to find the needle in the haystack and to reduce false-positive alerts.

SIEM, SOAR, OT, and ICS are almost always analytic processing regimes, BUT knowing when they are not is important. Kafka can configure topics to be tuned for transactions or analytics.  That is unprecedented in the history of data processing. Threat intelligence (= awareness-in-motion) assumes the PATTERN is valuable, not the data.

Analytics in Motion powered by Kafka Streams / ksqlDB

As you can hopefully imagine from the above requirements and characteristics, event streaming with Apache Kafka and its streaming analytics ecosystem is a perfect fit for the technical infrastructure for threat intelligence.

Threat detection makes sense of the signal and the noise of the data by continuously processing signatures. This enables to detect, contain and neutralize threats proactively:

Threat Intelligence with Kafka Streams ksqlDB and Machine Learning

Analytics can be many things in such a scenario:

On a high level, the advantages of using Kafka Streams or ksqlDB for threat intelligence can be described as follows:

  • A single scalable and reliable real-time infrastructure for end-to-end data integration and data processing
  • Flexibility to write custom rules and embed other rules engines, frameworks, or trained models
  • Integration with other threat detection systems like IDS, SIEM, SOAR

The business logic for cyber threat detection looks different for every use case. Known attack patterns like MITRE ATT&ACK help with the implementation. However, situational awareness and threat detection also need to detect unknown anomalies.

Let’s now take a look at a concrete example.

Intel’s Cyber Intelligence Platform

Let me quote Intel themselves:

As cyber threats continuously grow in sophistication and frequency, companies need to quickly acclimate to effectively detect, respond, and protect their environments. At Intel, we’ve addressed this need by implementing a modern, scalable Cyber Intelligence Platform (CIP) based on Splunk and Confluent. We believe that CIP positions us for the best defense against cyber threats well into the future.

Our CIP ingests tens of terabytes of data each day and transforms it into actionable insights through streams processing, context-smart applications, and advanced analytics techniques. Kafka serves as a massive data pipeline within the platform. It provides us the ability to operate on data in-stream, enabling us to reduce Mean Time to Detect (MTTD) and Mean Time to Respond (MTTR). Faster detection and response ultimately leads to better prevention.

Let’s explore Intel’s CIP for threat intelligence in more detail.

Detecting Vulnerabilities with Stream Processing

Intel’s CIP leverages the whole Kafka ecosystem provided by Confluent:

  • Ingestion: Kafka producer clients for various sources such as databases, scanning engines, IP address management, asset management inventory, etc.
  • Streaming analytics: Kafka Streams for filtering vulnerabilities by business unit, joining asset ownership with vulnerable assets, etc.
  • Egress: Kafka Connect sink connectors for data lakes, IT partners, other business units, SIEM, SOAR, etc.
  • High availability: Multi-Region Clusters (MRC) for high availability across regions
  • And much more…

Here is a high-level architecture:

Stream Processing with Kafka at Intel

Intel’s Kafka Maturity Timeline

Building a cybersecurity infrastructure is not a big bang. A step-by-step approach starts with integrating the first sources and sinks, some simple stream processing, and deployment as a pilot project. Over time, more and more data sources and sinks are added, the business logic gets more powerful, and the scale increases.

Intel’s Kafka maturity timeline shows their learning curve:

Intel Kafka Maturity Timeline

Kafka Benefits to Intel

Intel describes their benefits for leveraging event streaming as follows:

  • Economies of scale
  • Operate on data in the stream
  • Reduce technical debt and downstream costs
  • Generates contextually rich data
  • Global scale and reach
  • Always on
  • Modern architecture with a thriving community
  • Kafka leadership through Confluent expertise

That’s pretty much the same reasons I use in many of my other blog posts to explain the rise of data in motion powered by Apache Kafka across industries and use cases🙂

For more intel on Intel’s Cyber Intelligence Platform powered by Confluent and Splunk, check out their whitepaper and Kafka Summit talk.

Scalable Real-time Cyber Threat Intelligence with Kafka

Kafka is not just used as a backbone and source of truth for data. It also monitors, correlates, and proactively acts on events from various real-time and batch data sources to implement cyber threat intelligence.

The Cyber Intelligence Platform from Intel is a great example of a Kafka-powered cybersecurity solution. It leverages the whole Kafka ecosystem to build a scalable and reliable real-time integration and processing layer. The streaming analytics logic depends on the use case. It can cover simple business logic but also external rules engines or analytic models.

How do you fight against cybersecurity risks? What technologies and architectures do you use to implement cyber threat intelligence? How does Kafka complement other tools? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Kafka for Cybersecurity (Part 3 of 6) – Cyber Threat Intelligence appeared first on Kai Waehner.

]]>
Kafka for Cybersecurity (Part 2 of 6) – Cyber Situational Awareness https://www.kai-waehner.de/blog/2021/07/08/kafka-cybersecurity-part-2-of-6-cyber-situational-awareness-real-time-scalability/ Thu, 08 Jul 2021 09:01:29 +0000 https://www.kai-waehner.de/?p=3529 This blog series explores use cases and architectures for Apache Kafka in the cybersecurity space, including situational awareness, threat intelligence, forensics, air-gapped and zero trust environments, and SIEM / SOAR modernization. This post is part two: Situational awareness with continuous real-time data integration and data processing at scale.

The post Kafka for Cybersecurity (Part 2 of 6) – Cyber Situational Awareness appeared first on Kai Waehner.

]]>
Apache Kafka became the de facto standard for processing data in motion across enterprises and industries. Cybersecurity is a key success factor across all use cases. Kafka is not just used as a backbone and source of truth for data. It also monitors, correlates, and proactively acts on events from various real-time and batch data sources to detect anomalies and respond to incidents. This blog series explores use cases and architectures for Kafka in the cybersecurity space, including situational awareness, threat intelligence, forensics, air-gapped and zero trust environments, and SIEM / SOAR modernization. This post is part two: Cyber Situational Awareness.

Cyber Situational Awareness with Apache Kafka

Blog series: Apache Kafka for Cybersecurity

This blog series explores why security features such as RBAC, encryption, and audit logs are only the foundation of a secure event streaming infrastructure. Learn about use cases,  architectures, and reference deployments for Kafka in the cybersecurity space:

Subscribe to my newsletter to get updates immediately after the publication. Besides, I will also update the above list with direct links to this blog series’s posts as soon as published.

The Four Stages of an Adaptive Security Architecture

Gartner defines four stages of adaptive security architecture to prevent, detect, respond and predict cybersecurity incidents:

The Four Stages of an Adaptive Security Architecture by Gartner

Continuous monitoring and analytics are the keys to building a successful proactive cybersecurity solution. It should be obvious: Continuous monitoring and analytics require a scalable real-time infrastructure. Data at rest, i.e., stored in a database, data warehouse, or data lake, cannot continuously monitor data in real-time with batch processes.

Situational Awareness

Situation awareness is the perception of the elements in the environment within a volume of time and space, the comprehension of their meaning, and the projection of their status in the near future.” Source: Endsley, M. R. SAGAT: A methodology for the measurement of situation awareness (NOR DOC 87-83). Hawthorne, CA: Northrop Corp.

Here is a theoretical view on situational awareness and the relation between humans and computers:

Human – Computer Interface for Decision Making
Endsley, M. R. Toward a Theory of Situation Awareness in Dynamic Systems. Human Factors, 1995

Cyber Situational Awareness = Continuous Real-Time Analytics

Cyber Situational Awareness is the subset of all situation awareness necessary to support taking action in cyber. It is the mandatory key concept to defend against cybersecurity attacks.

Automation and analytics in real-time are key characteristics:

Situational Awareness and Automated Analytics
Endsley, M. R. Toward a Theory of Situation Awareness in Dynamic Systems. Human Factors, 1995

No matter how good the threat detection algorithms and security platforms are. Prevention or at least detection of attacks need to happen in real-time. And predictions with cutting-edge machine learning models do not help if they are executed in a batch process over might.

Situational awareness covers various levels beyond the raw network events. It includes all environments, including application data, logs, people, and processes.

I covered the challenges in the first post of this blog series. In summary, cybersecurity experts’ key challenge is finding the needle(s) in the haystack. The haystack is typically huge, i.e., massive volumes of data. Often, it is not just one haystack but many. Hence, a key task is to reduce false positives.

Situational awareness is not just about viewing the dashboard but understand what’s going on in real-time. Situational awareness finds the relevant data to create critical (rare) alerts automatically. No human can handle the huge volumes of data.

Situational Awareness in Motion with Kafka

The Kafka ecosystem provides the components to correlate massive volumes of data in real-time to provide situational awareness across the enterprise find all needles in the haystack:

The Confluent Curation Fabric for Cybersecurity powered by Apache Kafka and KSQL

Event streaming powered by the Kafka ecosystem delivers contextually rich data to reduce false positives:

  • Collect all events from data sources with Kafka Connect
  • Filter event streams with Kafka Connect’s Single Message Transforms (SMT) so that only relevant data gets into the Kafka topic
  • Empower real-time streaming applications with Kafka Streams or ksqlDB to correlate events across various source interfaces
  • Forward priority events to other systems such as the SIEM/SOAR with Kafka Connect or any other Kafka client (Java, C, C++, .NET, Go, JavaScript, HTTP via REST Proxy, etc.)

Example: Situational Awareness with Kafka and SIEM/SOAR

SIEM/SOAR modernization is its own blog post of this series. But the following picture depicts how Kafka enables true decoupling between applications in a domain-driven design (DDD):

Deliver Contextually Rich Data To Reduce False Positives

A traditional data store like a data lake is NOT the right spot for implementing situational awareness as it is data at rest. Data at rest is not a bad thing. Several use cases such as reporting (business intelligence), analytics (batch processing), and model training (machine learning) require this approach. Real-time beats slow data. Hence, event streaming with the de facto standard Apache Kafka is the right fit for situational awareness. 

Event streaming and data lake technologies are complementary, not competitive. The blog post “Serverless Kafka in a Cloud-native Data Lake Architecture” explores this discussion in much more detail by looking at AWS’ lake house strategy and its relation to event streaming.

The Data

Situational awareness requires data. A lot of data. Across various interfaces and communication paradigms. A few examples:

  • Text Files TXT
  • Firewalls and network devices
  • Binary files
  • Antivirus
  • Databases
  • Access
  • APIs
  • Audit logs
  • Network flows
  • Intrusion detection
  • Syslog
  • And many more…

Let’s look at the three steps of implementing situational awareness: Data producers, data normalization and enrichment, and data consumers.

Data Producers

Data comes from various sources. This includes real-time systems, batch systems, files, and much more.  The data includes high-volume logs (including Netflow and indirectly PCAP) and low volume transactional events:

Data Producers

Data Normalization and Enrichment

The key success factor to implementing situational awareness is data correlation in real-time at scale. This includes data normalization and processing such as filter, aggregate, transform, etc.:

Data Normalization and Enrichment for Situational Awareness with Kafka

With Kafka, end-to-end data integration and continuous stream processing are possible with a single scalable and reliable platform. This is very different from the traditional MQ/ETL/ESB approach. Data governance concepts for enforcing data structures and ensuring data quality are crucial on the client-side and server-side. For this reason, the Schema Registry is a mandatory component in most Kafka architectures.

Data Consumers

A single system cannot implement cyber situational awareness. Different questions, challenges, and problems require different tools. Hence, most Kafka deployments run various Kafka consumers using different communication paradigms and different speeds:

Data Consumers

Some workloads require data correlation in real-time to detect anomalies or even prevent threats as soon as possible. Kafka-native applications come into play. The client technology is flexible depending on the infrastructure, use case, and developer experience. Java, C, C++, Go are some coding options. Kafka Streams or ksqlDB provide out-of-the-box stream processing capabilities. The latter is the recommended option as it provides many features built-in such as sliding windows to build stateful aggregations.

A SIEM, SOAR, or data lake is complementary to run other analytics use cases for threat detection, intrusion detection, or reporting. The SIEM/SOAR modernization blog post of this series explores this combination in more detail.

Situational Awareness with Kafka and Sigma

Let’s take a look at a concrete example. A few of my colleagues built a great implementation for cyber situational awareness: Confluent Sigma. So, what is it?

Sigma – An Open Signature Format for Cyber Detection

Sigma is a generic and open signature format that allows you to describe relevant log events straightforwardly. The rule format is very flexible, easy to write, and applicable to any log file. The main purpose of this project is to provide a structured form in which cybersecurity engineers can describe their developed detection methods and make them shareable with others – either within the company or even share with the community.

A few characteristics that describe Sigma:

  • Open-source framework
  • A domain-specific language (DSL)
  • Specify patterns in cyber data
  • Sigma is for log files what Snort is for network traffic, and YARA is for files

Sigma provides integration with various tools such as ArcSight, QRadar, Splunk, Elasticsearch, Sumologic, and many more. However, as you learned in this post, many scenarios for cyber situational awareness require real-time data correlation at scale. That’s where Kafka comes into play. Having said this, a huge benefit is that you can specify a Sigma signature once and then use all the mentioned tools.

Confluent Sigma

Confluent Sigma is an open-source project implemented by a few of my colleagues. Kudos to Michael Peacock, William LaForest, and a few more. The project integrates Sigma into Kafka by embedding the Sigma rules into stream processing applications powered by Kafka Streams and ksqlDB:

Confluent Sigma for Situational Awareness powered by Apache Kafka

Situational Awareness with Zeek, Kafka Streams, KSQL, and Sigma

Here is a concrete event streaming architecture for situational awareness:

Confluent Sigma for Kafka powered Cybersecurity and Situational Awareness

A few notes on the above picture:

  • Sigma defines the signature rules
  • Zeek provides incoming IDS log events at high volume in real-time
  • Confluent Platform processes and correlates the data in real-time
  • The stream processing part built with Kafka Streams and ksqlDB includes stateless functions such as filtering and stateful functions such as aggregations
  • The calculated detections get ingested into a Zeek dashboard and other Kafka consumers

Here is an example of a Sigma Rule for windowing and aggregation of logs:

Sigma Rule with Aggregation

The Kafka Streams topology of the example looks like this:

Sigma Stream Topology with Kafka Streams

My colleagues will do a webinar to demonstrate Confluent Sigma in more detail, including a live demo. I will update and share the on-demand link here as soon as available. Some demo code is available on Github.

Cisco ThousandEyes Endpoint Agents

Let’s take a look at a concrete Kafka-native example to implement situational awareness at scale in real-time. Cisco ThousandEyes Endpoint Agents is a monitoring tool to gain visibility from any employee to any application over any network. It provides proactive and real-time monitoring of application experience and network connectivity.

The platform leverage the whole Kafka ecosystem for data integration and stream processing:

  • Kafka Streams for stateful network tests
  • Interactive queries for fetching results
  • Kafka Streams for windowed aggregations for alerting use cases
  • Kafka Connect for integration with backend systems such as MySQL, Elastic, MongoDB

ThousandEyes’ tech blog is a great resource to understand the implementation in more detail.

Kafka is a Key Piece to Implement Cyber Situational Awareness

Cyber situational awareness is mandatory to defend against cybersecurity attacks. A successful implementation requires continuous real-time analytics at scale. This is why the Kafka ecosystem is a perfect fit.

The Confluent Sigma implementation shows how to build a flexible but scalable and reliable architecture for realizing situational awareness. Event streaming is a key piece of the puzzle.

However, it is not a replacement for other tools such as Zeek for network analysis or Splunk as SIEM. Instead, event streaming complements these technologies and provides a central nervous system that connects and truly decouples these other systems. Additionally, the Kafka ecosystem provides the right tools for real-time stream processing.

How did you implement cyber situational awareness? Is it already real-time and scalable? What technologies and architectures do you use? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Kafka for Cybersecurity (Part 2 of 6) – Cyber Situational Awareness appeared first on Kai Waehner.

]]>