Kafka Connect Archives - Kai Waehner https://www.kai-waehner.de/blog/category/kafka-connect/ Technology Evangelist - Big Data Analytics - Middleware - Apache Kafka Mon, 28 Apr 2025 06:29:25 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.2 https://www.kai-waehner.de/wp-content/uploads/2020/01/cropped-favicon-32x32.png Kafka Connect Archives - Kai Waehner https://www.kai-waehner.de/blog/category/kafka-connect/ 32 32 Fraud Detection in Mobility Services (Ride-Hailing, Food Delivery) with Data Streaming using Apache Kafka and Flink https://www.kai-waehner.de/blog/2025/04/28/fraud-detection-in-mobility-services-ride-hailing-food-delivery-with-data-streaming-using-apache-kafka-and-flink/ Mon, 28 Apr 2025 06:29:25 +0000 https://www.kai-waehner.de/?p=7516 Mobility services like Uber, Grab, and FREE NOW (Lyft) rely on real-time data to power seamless trips, deliveries, and payments. But this real-time nature also opens the door to sophisticated fraud schemes—ranging from GPS spoofing to payment abuse and fake accounts. Traditional fraud detection methods fall short in speed and adaptability. By using Apache Kafka and Apache Flink, leading mobility platforms now detect and block fraud as it happens, protecting their revenue, users, and trust. This blog explores how real-time data streaming is transforming fraud prevention across the mobility industry.

The post Fraud Detection in Mobility Services (Ride-Hailing, Food Delivery) with Data Streaming using Apache Kafka and Flink appeared first on Kai Waehner.

]]>
Mobility services like Uber, Grab, FREE NOW (Lyft), and DoorDash are built on real-time data. Every trip, delivery, and payment relies on accurate, instant decision-making. But as these services scale, they become prime targets for sophisticated fraud—GPS spoofing, fake accounts, payment abuse, and more. Traditional, batch-based fraud detection can’t keep up. It reacts too late, misses complex patterns, and creates blind spots that fraudsters exploit. To stop fraud before it happens, mobility platforms need data streaming technologies like Apache Kafka and Apache Flink for fraud detection. This blog explores how leading platforms are using real-time event processing to detect and block fraud as it happens—protecting revenue, user trust, and platform integrity at scale.

Fraud Prevention in Mobility Services with Data Streaming using Apache Kafka and Flink with AI Machine Learning

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases.

The Business of Mobility Services (Ride-Hailing, Food Delivery, Taxi Aggregators, etc.)

Mobility services have become an essential part of modern urban life. They offer convenience and efficiency through ride-hailing, food delivery, car-sharing, e-scooters, taxi aggregators, and micro-mobility options. Companies such as Uber, Lyft, FREE NOW (former MyTaxi; acquired by Lyft recently), Grab, Careem, and DoorDash connect millions of passengers, drivers, restaurants, retailers, and logistics partners to enable seamless transactions through digital platforms.

Taxis and Delivery Services in a Modern Smart City

These platforms operate in highly dynamic environments where real-time data is crucial for pricing, route optimization, customer experience, and fraud detection. However, this very nature of mobility services also makes them prime targets for fraudulent activities. Fraud in this sector can lead to financial losses, reputational damage, and deteriorating customer trust.

To effectively combat fraud, mobility services must rely on real-time data streaming with technologies such as Apache Kafka and Apache Flink. These technologies enable continuous event processing and allow platforms to detect and prevent fraud before transactions are finalized.

Why Fraud is a Major Challenge in Mobility Services

Fraudsters continually exploit weaknesses in digital mobility platforms. Some of the most common fraud types include:

  1. Fake Rides and GPS Spoofing: Drivers manipulate GPS data to simulate trips that never occurred. Passengers use location spoofing to receive cheaper fares or exploit promotions.
  1. Payment Fraud and Stolen Credit Cards: Fraudsters use stolen payment methods to book rides or order food.
  1. Fake Drivers and Passengers: Fraudsters create multiple accounts and pretend to be both the driver and passenger to collect incentives. Some drivers manipulate fares by manually adjusting distances in their favor.
  1. Promo Abuse: Users create multiple fake accounts to exploit referral bonuses and promo discounts.
  1. Account Takeovers and Identity Fraud: Hackers gain access to legitimate accounts, misusing stored payment information. Fraudsters use fake identities to bypass security measures.

Fraud not only impacts revenue but also creates risks for legitimate users and drivers. Without proper fraud prevention measures, ride-hailing and delivery companies could face serious losses, both financially and operationally.

The Unseen Enemy: Core Challenges in Mobility Fraud
Detection

Traditional fraud detection relies on batch processing and manual rule-based systems. However, these approaches are no longer effective due to the speed and complexity of modern mobile apps with real-time experiences combined with modern fraud schemes.

Payment Fraud - The Hidden Enemy in a Digital World
Payment Fraud – The Hidden Enemy in a Digital World

Key challenges in mobility fraud detection include:

  • Fraud occurs in real-time, requiring instant detection and prevention before transactions are completed.
  • Millions of events per second must be processed, requiring scalable and efficient systems.
  • Fraud patterns constantly evolve, making static rule-based approaches ineffective.
  • Platforms operate across hybrid and multi-cloud environments, requiring seamless integration of fraud detection systems.

To overcome these challenges, real-time streaming analytics powered by Apache Kafka and Apache Flink provide an effective solution.

Event-driven Architecture for Mobility Services with Data Streaming using Apache Kafka and Flink

Apache Kafka: The Backbone of Event-Driven Fraud Detection

Kafka serves as the core event streaming platform. It captures and processes real-time data from multiple sources such as:

  • GPS location data
  • Payment transactions
  • User and driver behavior analytics
  • Device fingerprints and network metadata

Kafka provides:

  • High-throughput data streaming, capable of processing millions of events per second to support real-time decision-making.
  • An event-driven architecture that enables decoupled, flexible systems—ideal for scalable and maintainable mobility platforms.
  • Seamless scalability across hybrid and multi-cloud environments to meet growing demand and regional expansion.
  • Always-on reliability, ensuring 24/7 data availability and consistency for mission-critical services such as fraud detection, pricing, and trip orchestration.

An excellent success story about the transition to data streaming comes from DoorDash: Why DoorDash migrated from Cloud-native Amazon SQS and Kinesis to Apache Kafka and Flink.

Apache Flink enables real-time fraud detection through advanced event correlation and applied AI:

  • Detects anomalies in GPS data, such as sudden jumps, route manipulation, or unrealistic movement patterns.
  • Analyzes historical user behavior to surface signs of account takeovers or other forms of identity misuse.
  • Joins multiple real-time streams—including payment events, location updates, and account interactions—to generate accurate, low-latency fraud scores.
  • Applies machine learning models in-stream, enabling the system to flag and stop suspicious transactions before they are processed.
  • Continuously adapts to new fraud patterns, updating models with fresh data in near real-time to reflect evolving user behavior and emerging threats.

With Kafka and Flink, fraud detection can shift from reactive to proactive to stop fraudulent transactions before they are completed.

I already covered various data streaming success stories from financial services companies such as Paypal, Capital One and ING Bank in a dedicated blog post. And a separate case study from about “Fraud Prevention in Under 60 Seconds with Apache Kafka: How A Bank in Thailand is Leading the Charge“.

Real-World Fraud Prevention Stories from Mobility Leaders

Fraud is not just a technical issue—it’s a business-critical challenge that impacts trust, revenue, and operational stability in mobility services. The following real-world examples from industry leaders like FREE NOW (Lyft), Grab, and Uber show how data streaming with advanced stream processing and AI are used around the world to detect and stop fraud in real time, at massive scale.

FREE NOW (Lyft): Detecting Fraudulent Trips in Real Time by Analyzing GPS Data of Cars

FREE NOW operates in more than 150 cities across Europe with 48 million users. It integrates multiple mobility services, including taxis, private vehicles, car-sharing, e-scooters, and bikes.

The company was recently acquired by Lyft, the U.S.-based ride-hailing giant known for its focus on multimodal urban transport and strong presence in North America. This acquisition marks Lyft’s strategic entry into the European mobility ecosystem, expanding its footprint beyond the U.S. and Canada.

FREE NOW - former MyTaxi - Company Overview
Source: FREE NOW

Fraud Prevention Approach leveraging Data Streaming (presented at Kafka Summit)

  • Uses Kafka Streams and Kafka Connect to analyze GPS trip data in real-time.
  • Deploys fraud detection models that identify anomalies in trip routes and fare calculations.
  • Operates data streaming on fully managed Confluent Cloud and applications on Kubernetes for scalable fraud detection.
Fraud Prevention in Mobility Services with Data Streaming using Kafka Streams and Connect at FREE NOW
Source: FREE NOW

Example: Detecting Fake Rides

  1. A driver inputs trip details into the app.
  2. Kafka Streams predicts expected trip fare based on distance and duration.
  3. GPS anomalies and unexpected route changes are flagged.
  4. Fraud alerts are triggered for suspicious transactions.

By implementing real-time fraud detection with Kafka and Flink, FREE NOW (Lyft) has significantly reduced fraudulent trips and improved platform security.

Grab: AI-Powered Fraud Detection for Ride-Hailing and Delivery with Data Streaming and AI/ML

Grab is a leading mobility platform in Southeast Asia, handling millions of transactions daily. Fraud accounts for 1.6 percent of total revenue loss in the region.

To address these significant fraud numbers, Grab developed GrabDefence—an AI-powered fraud detection engine that leverages real-time data and machine learning to detect and block suspicious activity across its platform.

Fraud Detection and Presentation with Kafka and AI ML at Grab in Asia
Source: Grab

Fraud Detection Approach

  • Uses Kafka Streams and machine learning for fraud risk scoring.
  • Leverages Flink for feature aggregation and anomaly detection.
  • Detects fraudulent transactions before they are completed.
GrabDefence - Fraud Prevention with Data Streaming and AI / Machine Learning in Grab Mobility Service
Source: Grab

Example: Fake Driver and Passenger Fraud

  1. Fraudsters create accounts as both driver and passenger to claim rewards.
  2. Kafka ingests device fingerprints, payment transactions, and ride data.
  3. Flink aggregates historical fraud behavior and assigns risk scores.
  4. High-risk transactions are blocked instantly.

With GrabDefence built with data streaming, Grab reduced fraud rates to 0.2 percent, well below the industry average. Learn more about GrabDefence in the Kafka Summit talk.

Uber: Project RADAR – AI-Powered Fraud Detection with Human Oversight

Uber processes millions of payments per second globally. Fraud detection is complex due to chargebacks and uncollected payments.

To combat this, Uber launched Project RADAR—a hybrid system that combines machine learning with human reviewers to continuously detect, investigate, and adapt to evolving fraud patterns in near real time. Low latency is not required in this scenario. And humans are in the loop of the business process. Hence, Apache Spark is sufficient for Uber.

Uber Project Radar for Scam Detection with Humans in the Loop
Source: Uber

Fraud Prevention Approach

  • Uses Kafka and Spark for multi-layered fraud detection.
  • Implements machine learning models to detect chargeback fraud.
  • Incorporates human analysts for rule validation.
Uber Project RADAR with Apache Kafka and Spark for Scam Detection with AI and Machine Learning
Source: Uber

Example: Chargeback Fraud Detection

  1. Kafka collects all ride transactions in real time.
  2. Stream processing detects anomalies in payment patterns and disputes.
  3. AI-based fraud scoring identifies high-risk transactions.
  4. Uber’s RADAR system allows human analysts to validate fraud alerts.

Uber’s combination of AI-driven detection and human oversight has significantly reduced chargeback-related fraud.

Fraud in mobility services is a real-time challenge that requires real-time solutions that work 24/7, even at extreme scale for millions of events. Traditional batch processing systems are too slow, and static rule-based approaches cannot keep up with evolving fraud tactics.

By leveraging data streaming with Apache Kafka in conjunction with Kafka Streams or Apache Flink, mobility platforms can:

  • Process millions of events per second to detect fraud in real time.
  • Prevent fraudulent transactions before they occur.
  • Use AI-driven real-time fraud scoring for accurate risk assessment.
  • Adapt dynamically through continuous learning to evolving fraud patterns.

Mobility platforms such as Uber, Grab, and FREE NOW (Lyft) are leading the way in using real-time streaming analytics to protect their platforms from fraud. By implementing similar approaches, other mobility businesses can enhance security, reduce financial losses, and maintain customer trust.

Real-time fraud prevention in mobility services is not an option; it is a necessity. The ability to detect and stop fraud in real time will define the future success of ride-hailing, food delivery, and urban mobility platforms.

Stay ahead of the curve! Subscribe to my newsletter for insights into data streaming and connect with me on LinkedIn to continue the conversation. And download my free book about data streaming use cases.

The post Fraud Detection in Mobility Services (Ride-Hailing, Food Delivery) with Data Streaming using Apache Kafka and Flink appeared first on Kai Waehner.

]]>
Apache Flink: Overkill for Simple, Stateless Stream Processing and ETL? https://www.kai-waehner.de/blog/2025/01/14/apache-flink-overkill-for-simple-stateless-stream-processing/ Tue, 14 Jan 2025 07:48:04 +0000 https://www.kai-waehner.de/?p=7210 Discover when Apache Flink is the right tool for your stream processing needs. Explore its role in stateful and stateless processing, the advantages of serverless Flink SaaS solutions like Confluent Cloud, and how it supports advanced analytics and real-time data integration together with Apache Kafka. Dive into the trade-offs, deployment options, and strategies for leveraging Flink effectively across cloud, on-premise, and edge environments, and when to use Kafka Streams or Single Message Transforms (SMT) within Kafka Connect for ETL instead of Flink.

The post Apache Flink: Overkill for Simple, Stateless Stream Processing and ETL? appeared first on Kai Waehner.

]]>
When discussing stream processing engines, Apache Flink often takes center stage for its advanced capabilities in stateful stream processing and real-time data analytics. However, a common question arises: is Flink too heavyweight for simple, stateless stream processing  and ETL tasks? The short answer for open-source Flink is often yes. But the story evolves significantly when looking at SaaS Flink products such as Confluent Cloud’s Flink offering, with its serverless architecture, multi-tenancy, consumption-based pricing, and no-code/low-code capabilities like Flink Actions. This post explores the considerations and trade-offs to help you decide when Flink is the right tool for your data streaming needs, and when Kafka Streams or Single Message Transform (SMT) within Kafka Connect are the better choice.

Apache Flink - Overkill for Simple Stateless Stream Processing

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch.

The Nature of Stateless Stream Processing

Stateless stream processing, as the name implies, processes each event independently, with no reliance on prior events or context. This simplicity lends itself to use cases such as filtering, transformations, and simple ETL operations. Stateless tasks are:

  • Efficient: They don’t require state management, reducing overhead.
  • Scalable: Easily parallelized since there is no dependency between events.
  • Minimalistic: Often achievable with simpler, lightweight frameworks like Kafka Streams or Kafka Connect’s Single Message Transforms (SMT).

For example, filtering transactions above a certain amount or transforming event formats for downstream systems are classic stateless tasks that demand minimal computational complexity.

In these scenarios, deploying a robust and feature-rich framework like open-source Apache Flink might seem excessive. Flink’s rich API and state management features are unnecessary for such straightforward use cases. Instead, tools with smaller footprints, and simpler deployment models, such as Kafka Streams, often suffice.

Apache Flink is a powerhouse. It’s designed for advanced analytics, stateful processing, and complex event patterns. But this sophistication of the open source framework comes with complexity:

  1. Operational Overhead: Setting up and maintaining Flink in an open-source environment can require significant infrastructure and expertise.
  2. Resource Intensity: Flink’s distributed architecture and stateful processing capabilities are resource-hungry, often overkill for tasks that don’t require stateful processing.
  3. Complexity in Development: The Flink API is robust but comes with a steeper learning curve. The combination with Kafka (or another streaming engine) requires understanding two frameworks. In contrast, Kafka Streams is Kafka-native, offering a single, unified framework for stream processing, which can reduce complexity for basic tasks.

For organizations that need to perform straightforward stateless operations, investing in the full Flink stack can feel like using a sledgehammer to crack a nut. Having said this, FlinkSQL simplifies development for certain personas, providing a more accessible interface beyond just Java and Python.

The conversation shifts dramatically with Serverless Flink Cloud offerings, such as Confluent Cloud, which address many of the challenges associated with running open-source Flink. Let’s unpack how Serverless Flink makes a more attractive choice, even for simpler use cases.

1. Serverless Architecture

With a Serverless stream processing service, Flink operates on a fully serverless model, eliminating the need for heavy infrastructure management. This means:

  • No Setup Hassles: Developers focus purely on application logic, not cluster provisioning or tuning.
  • Elastic Scaling: Resources automatically scale with the workload, ensuring efficient handling of varying traffic patterns without manual intervention. One of the biggest challenges of self-managing Flink is over-provisioning resources to handle anticipated peak loads.  Elastic Scaling mitigates this inefficiency.

2. Multi-Tenancy

Multi-tenant design allows multiple applications, teams or organizations to share the same infrastructure securely. This reduces operational costs and complexity compared to managing isolated clusters for each workload.

3. Consumption-Based Pricing

One of the key barriers to adopting Flink for simple tasks is cost. A truly Serverless Flink offering mitigates this with a pay-as-you-go pricing model:

  • You only pay for the resources you use, making it cost-effective for both lightweight and high-throughput workloads.
  • It aligns with the scalability of stateless stream processing, where workloads may spike temporarily and then taper off.

4. Bridging the Gap with No-Code and Low-Code Solutions

The rise of citizen integrators and the demand for low-code/no-code solutions have reshaped how organizations approach data streaming. Less-technical users, such as business analysts or operational teams, often face challenges when trying to engage with technical platforms designed for developers.

Low-code/no-code tools address this by providing intuitive interfaces that allow users to build, deploy, and monitor pipelines without deep programming knowledge. These solutions empower business users to take charge of simple workflows and integrations, significantly reducing time-to-value while minimizing the reliance on technical teams.

For example, capabilities like Flink Actions in Confluent Cloud offer a user-friendly approach to deploying stream processing pipelines without coding. By simplifying the process and making it accessible to non-technical stakeholders, these tools enhance collaboration and ensure faster outcomes without compromising performance or scalability. For instance, you can do ETL functions such as transformation, deduplication or masking field:

Confluent Cloud - Apache Flink Action UI for No Code Low Code Streaming ETL Integration
Source: Confluent

Fully Managed (SaaS) vs. Partially Managed (PaaS) Cloud Products

When choosing between SaaS and PaaS for data streaming, it’s essential to understand the key differences.

SaaS solutions, like Confluent Cloud, offer a fully managed, serverless experience with automatic scaling, low operational overhead, and pay-as-you-go pricing.

In contrast, PaaS requires users to manage infrastructure, configure scaling policies, and handle more operational complexity.

While many products are marketed as “serverless,” not all truly abstract infrastructure or eliminate idle costs—so scrutinize claims carefully.

SaaS is ideal for teams focused on rapid deployment and simplicity, while PaaS suits those needing deep customization and control. Ultimately, SaaS ensures scalability and ease of use, making it a compelling choice for most modern streaming needs. Always dive into the technical details to ensure the platform aligns with your goals. Don’t trust the marketing slogans of the vendors!

Stateless vs. Stateful Stream Processing: Blurring the Lines

Even if your current use case is stateless, it’s worth considering the potential for future needs. Stateless pipelines often evolve into more complex systems as businesses grow, requiring features like:

  • State Management: For event correlation and pattern detection.
  • Windows and Aggregations: To derive insights from time-series data.
  • Joins: To enrich data streams with contextual information.
  • Integrating Multiple Data Sources: To seamlessly combine information from various streams for a comprehensive and cohesive analysis.
  • AI/ML Integration: Incorporating machine learning models for real-time inference, enabling intelligent decision-making directly within data streams.

Stateless and Stateful Stream Processing with Kafka Streams and Apache Flink

With a SaaS Flink service such as Confluent Cloud, you can start small with stateless tasks and seamlessly scale into stateful operations as needed, leveraging Flink’s full capabilities without a complete overhaul.

While Flink may feel like overkill for simple, stateless tasks in its open-source form, its potential is unmatched in these scenarios:

  • Enterprise Workloads: Scalable, reliable, and fault-tolerant systems for mission-critical applications.
  • Data Integration and Preparation (Streaming ETL): Flink enables preprocessing, cleansing, and enriching data at the streaming layer, ensuring high-quality data reaches downstream systems like data lakes and warehouses.
  • Complex Event Processing (CEP): Detecting patterns across events in real time.
  • Advanced Analytics: Stateful stream processing for aggregations, joins, and windowed computations.
  • AI/ML Integration: Incorporating machine learning models for real-time inference, enabling intelligent decision-making directly within data streams.

Stateless stream processing is often achieved using lightweight tools like Kafka Streams or Single Message Transforms (SMTs) within Kafka Connect. SMTs enable inline transformations, such as normalization, enrichment, or filtering, as events pass through the integration framework. This functionality is available in Kafka Connect (provided by Confluent, IBM/Red Hat, Amazon MSK and others) and tools like Benthos for Redpanda. SMTs are particularly useful for quick adjustments and filtering data before it reaches the Kafka cluster, optimizing resource usage and data flow.

While Kafka Streams and Kafka Connect’s SMTs handle many stateless workloads effectively, Apache Flink offers significant advantages for all types of workloads—whether simple or complex, stateless or stateful.

Stream processing in Flink enables true decoupling within the enterprise architecture (as it is not bound to the Kafka cluster like Kafka Streams and Kafka Connect). The benefits are separation of concerns with a domain-driven design (DDD), and improved data governance. And Flink provides interfaces for Java, Python and SQL. Something for (almost) everyone. This makes ideal Flink for ensuring clean, modular architectures and easier scalability.

Stream Processing and ETL with Apache Kafka Streams Connect SMT and Flink

By processing events from diverse sources and preparing them for downstream consumption, Flink supports both lightweight and comprehensive workflows while aligning with domain boundaries and governance requirements. This brings us to the shift left architecture.

The Shift Left Architecture

No matter what specific use cases you have in mind: The Shift Left Architecture brings data processing upstream with real-time stream processing, transforming raw data into high-quality data products early in the pipeline.

Shift Left Architecture with Data Streaming into Data Lake Warehouse Lakehouse

Apache Flink plays a key role as part of a complete data streaming platform by enabling advanced streaming ETL, data curation, and on-the-fly transformations, ensuring consistent, reliable, and ready-to-use data for both operational and analytical workloads, while reducing costs and accelerating time-to-market.

Shift Left Architecture with Apacke Kafka Flink and Iceberg

The decision to use Flink boils down to your use case, expertise, and growth trajectory:

  • For basic stateless tasks, consider lightweight options like Kafka Streams or SMTs within Kafka Connect unless you’re already invested in a SaaS such as Confluent Cloud where Flink is also the appropriate choice for simple ETL processes.
  • For evolving workloads or scenarios requiring scalability and advanced analytics, a Flink SaaS offers unparalleled flexibility and ease of use.
  • For on-premise or edge deployments, Flink’s flexibility makes it an excellent choice for environments where data processing must occur locally due to latency, security, or compliance requirements.

Understanding the deployment environment—cloud, on-premise, or edge— and the capabilities of the Flink product is crucial to choosing the right streaming technology. Flink’s adaptability ensures it can serve diverse needs across these contexts.

Kafka Streams is another excellent, Kafka-native stream processing alternative. Most importantly for this discussion, Kafka Streams is “just” a lightweight Java library, not a server infrastructure like Flink. Hence, it brings different trade-offs with it. I wrote a dedicated article about the trade-offs between Apache Flink and Kafka Streams for stream processing.

In its open-source form, Flink can seem excessive for simple, stateless tasks. However, a serverless Flink SaaS like Confluent Cloud changes the equation. Multi-tenancy and pay-as-you-go pricing make it suitable for a wider range of use cases, from basic ETL to advanced analytics. Serverless features like Confluent’s Flink Actions further reduce complexity, allowing non-technical users to harness the power of stream processing without coding.

Whether you’re just beginning your journey into stream processing or scaling up for enterprise-grade applications, Flink—as part of a complete data streaming platform such as Confluent Cloud—is a future-proof investment that adapts to your needs.

The Data Streaming Landscape 2025 highlights how data streaming has evolved into a key software category, moving from niche adoption to a fundamental part of modern data architecture.

The Data Streaming Landcape 2025 with Kafka Flink Confluent Amazon MSK Cloudera Event Hubs and Other Platforms

With frameworks like Apache Kafka and Flink at its core, the landscape now spans self-managed, BYOC, and fully managed SaaS solutions, driving real-time use cases, unifying transactional and analytical workloads, and enabling innovation across industries.

Stay ahead of the curve! Subscribe to my newsletter for insights into data streaming and connect with me on LinkedIn to continue the conversation.

The post Apache Flink: Overkill for Simple, Stateless Stream Processing and ETL? appeared first on Kai Waehner.

]]>
When NOT to Use Apache Kafka? (Lightboard Video) https://www.kai-waehner.de/blog/2024/03/26/when-not-to-use-apache-kafka-lightboard-video/ Tue, 26 Mar 2024 06:45:11 +0000 https://www.kai-waehner.de/?p=6262 Apache Kafka is the de facto standard for data streaming to process data in motion. With its significant adoption growth across all industries, I get a very valid question every week: When NOT to use Apache Kafka? What limitations does the event streaming platform have? When does Kafka simply not provide the needed capabilities? How to qualify Kafka out as it is not the right tool for the job? This blog post contains a lightboard video that gives you a twenty-minute explanation of the DOs and DONTs.

The post When NOT to Use Apache Kafka? (Lightboard Video) appeared first on Kai Waehner.

]]>
Apache Kafka is the de facto standard for data streaming to process data in motion. With its significant adoption growth across all industries, I get a very valid question every week: When NOT to use Apache Kafka? What limitations does the event streaming platform have? When does Kafka simply not provide the needed capabilities? How to qualify Kafka out as it is not the right tool for the job? This blog post contains a lightboard video that gives you a twenty-minute explanation of the DOs and DONTs.

When NOT to Use Apache Kafka?

DisclaimeR: This blog post shares a lightboard video to watch an explanation about when NOT to use Apache Kafka. For a much more detailed and technical blog post with various use cases and case studies, check out this blog post from 2022 (which is still valid today – whenever you read it).

What is Apache Kafka, and what is it NOT?

Kafka is often misunderstood. For instance, I still hear way too often that Kafka is a message queue. Part of the reason is that some vendors only pitch it for a specific problem (such as data ingestion into a data lake or data warehouse) to sell their products. So, in short:

Kafka is…

  • a scalable real-time messaging platform to process millions of messages per second.
  • a data streaming platform for massive volumes of big data analytics and small volumes of transactional data processing.
  • a distributed storage provides true decoupling for backpressure handling, support of various communication protocols, and replayability of events with guaranteed ordering.
  • a data integration framework (Kafka Connect) for streaming ETL.
  • a data processing framework (Kafka Streams) for continuous stateless or stateful stream processing.

This combination of characteristics in a single platform makes Kafka unique (and successful).

Kafka is NOT…

  • a proxy for millions of clients (like mobile apps) – but Kafka-native proxies (like REST or MQTT) exist for some use cases.
  • an API Management platform – but these tools are usually complementary and used for the creation, life cycle management, or the monetization of Kafka APIs.
  • a database for complex queries and batch analytics workloads – but good enough for transactional queries and relatively simple aggregations (especially with ksqlDB).
  • an IoT platform with features such as device management  – but direct Kafka-native integration with (some) IoT protocols such as MQTT or OPC-UA is possible and the appropriate approach for (some) use cases.
  • a technology for hard real-time applications such as safety-critical or deterministic systems – but that’s true for any other IT framework, too. Embedded systems are a different software!

For these reasons, Kafka is complementary, not competitive, to these other technologies. Choose the right tool for the job and combine them!

Lightboard Video: When NOT to use Apache Kafka

The following video explores the key concepts of Apache Kafka. Afterwards, the DOs and DONTs of Kafka show how to complement data streaming with other technologies for analytics, APIs, IoT, and other scenarios.

Data Streaming Vendors and Cloud Services

The research company Forrester defines data streaming platforms as a new software category in a new Forrester Wave. Apache Kafka is the de facto standard used by over 100,000 organizations.

Plenty of vendors offer Kafka platforms and cloud services. Many complementary open source stream processing frameworks like Apache Flink and related cloud offerings emerged. And competitive technologies like Pulsar, Redpanda, or WarpStream try to get market share leveraging the Kafka protocol. Check out the data streaming landscape of 2024 to summarize existing solutions and market trends. The end of the article gives an outlook to potential new entrants in 2025.

Data Streaming Landscape 2024 around Kafka Flink and Cloud

Apache Kafka is a Data Streaming Platform: Combine it with other Platforms when needed!

Over 150,000 organizations use Apache Kafka in the meantime. The Kafka protocol is the de facto standard for many open source frameworks, commercial products and serverless cloud SaaS offerings.

However, Kafka is not an allrounder for every use case. Many projects combine Kafka with other technologies, such as databases, data lakes, data warehouses, IoT platforms, and so on. Additionally, Apache Flink is becoming the de facto standard for stream processing (but Kafka Streams is not going away and is the better choice for specific use cases).

Where do you (not) use Apache Kafka? What other technologies do you combine Kafka with? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post When NOT to Use Apache Kafka? (Lightboard Video) appeared first on Kai Waehner.

]]>
Streaming ETL with Apache Kafka in the Healthcare Industry https://www.kai-waehner.de/blog/2022/04/01/streaming-etl-with-apache-kafka-healthcare-pharma-industry/ Fri, 01 Apr 2022 05:47:00 +0000 https://www.kai-waehner.de/?p=4402 IT modernization and innovative new technologies change the healthcare industry significantly. This blog series explores how data streaming with Apache Kafka enables real-time data processing and business process automation. This is part three: Streaming ETL. Examples include Babylon Health and Bayer.

The post Streaming ETL with Apache Kafka in the Healthcare Industry appeared first on Kai Waehner.

]]>
IT modernization and innovative new technologies change the healthcare industry significantly. This blog series explores how data streaming with Apache Kafka enables real-time data processing and business process automation. Real-world examples show how traditional enterprises and startups increase efficiency, reduce cost, and improve the human experience across the healthcare value chain, including pharma, insurance, providers, retail, and manufacturing. This is part three: Streaming ETL. Examples include Babylon Health and Bayer.

Streaming ETL with Apache Kafka in Healthcare

Blog Series – Kafka in Healthcare

Many healthcare companies leverage Kafka today. Use cases exist in every domain across the healthcare value chain. Most companies deploy data streaming in different business domains. Use cases often overlap. I tried to categorize a few real-world deployments into different technical scenarios and added a few real-world examples:

Stay tuned for a dedicated blog post for each of these topics as part of this blog series. I will link the blogs here as soon as they are available (in the next few weeks). Subscribe to my newsletter to get an email after each publication (no spam or ads).

Streaming ETL with Apache Kafka

Streaming ETL is similar to concepts you might know from traditional ETL tools. I have already explored how data streaming with Kafka differs from data integration tools and iPaaS cloud services. The critical difference is that you leverage a single platform for data integration and processing at scale in real-time. There is no need to combine several platforms to achieve this. The result is a Kappa architecture that enables real-time but also batch workloads with a single integration architecture.

Streaming ETL with Apache Kafka Streams Connect ksqlDB

Streaming ETL with Kafka combines different components and features:

  • Kafka Connect as Kafka-native integration framework
  • Kafka Connect source and sink connectors to consume and produce data from/to any other database, application, or API
  • Single Message Transform (SMT) – an optional Kafka Connect feature – to process (filter, change, remove, etc.) incoming or outgoing messages within the connector deployment
  • Kafka Streams or ksqlDB for continuous data processing in real-time at scale for stateless or stateful ETL jobs
  • Data governance via schema management, enforcement and versioning using the Schema Registry
  • Security and access control using features like role-based access control, audit logs, and end-to-end encryption

In the cloud, you can leverage a serverless Kafka offering for the whole Streaming ETL pipeline. Confluent Cloud fully manages Kafka’s end-to-end infrastructure, including connectors, ksqlDB workloads, data governance, and security.

One last general note: Don’t Design for Data at Rest to Reverse it! Learn more here: “When to Use Reverse ETL and when it is an Anti-Pattern“. Instead, use real-time Streaming ETL for Data in Motion and the Kappa architecture from scratch.

Let’s look at a few real-world deployments in the healthcare sector.

Babylon Health – PII and GDRP compliant Security

Babylon Health is a digital-first health service provider and value-based care company that combines an artificial intelligence-powered platform with virtual clinical operations for patients. Patients are connected with health care professionals through its web and mobile application.

Babylon’s mission is to put an accessible and affordable health service in the hands of every person on earth. For that mission, Babylon built an agile microservice architecture with the Kafka ecosystem:

Kafka for Streaming ETL at Babylon Health

Here are the “wonders of working” in Healthcare for Babylon (= reasons to choose Kafka):

  • Real-time data processing
  • Replayability of historical information
  • Order matters and is ensured with guaranteed ordering
  • GDPR and data ownership for PII compliant security
  • Data governance via the schema registry to provide true decoupling and access via many programming languages like Java, Python, and Ruby

Bayer – Data Integration and Processing in R&D

Bayer AG is a German multinational pharmaceutical and life sciences company and one of the largest pharmaceutical companies in the world. They leverage Kafka in various use cases and business domains.

The following scenario is from the research and development department of the pharma business unit. Their focus areas are cardiovascular diseases, oncology, and women’s health. The division employs over 7,500 R&D people and expenses over 2.75 billion euros for R&D.

The use case Bayer presented at a recent Kafka Summit is about analyzing clinical trials, patents, reports, news, and literature leveraging the Kafka ecosystem. The R&D team processes 250 Million documents from 30+ individual data sources. The data includes 7 TB of raw text-rich data with daily updates, additions, and deletions. Algorithms and data evolve. Bayer needs to completely reprocess the data regularly. Various document streams with different formats and schemas flow through several text processing and enrichment steps.

Research and Development from Molecules to Medicine at Bayer

Scalable, reliable Kafka pipelines with Kafka Streams (Java) and Faust (Python) replaced custom, error-prone, non-scalable scripts. Schemas are used as the data interface to ensure data governance. Avro is the first-class citizen data format to enable compression and better throughput.

Streaming ETL Pipeline with Apache Kafka at Bayer

The true decoupling of Kafka in conjunction with the Schema Registry guarantees interoperability among different components and technologies (java, python, commercial tools, open-source, scientific, proprietary).

Streaming ETL with Kafka for Real-Time Data Integration at any Scale

Think about IoT sensor analytics, cybersecurity, patient communication, insurance, research, and many other domains. Real-time data beats slow data in the healthcare supply chain almost everywhere.

This blog post explored the capabilities of the Apache Kafka Ecosystem for Streaming ETL. Real-world deployments from Babylon Health and Bayer showed how enterprises successfully deploy Kafka for different enterprise architecture use cases.

How do you leverage data streaming with Apache Kafka in the healthcare industry? What architecture does your platform use? Which products do you combine with data streaming? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Streaming ETL with Apache Kafka in the Healthcare Industry appeared first on Kai Waehner.

]]>
When to use Apache Camel vs. Apache Kafka? https://www.kai-waehner.de/blog/2022/01/28/when-to-use-apache-camel-vs-apache-kafka-for-etl-application-integration-event-streaming/ Fri, 28 Jan 2022 06:31:02 +0000 https://www.kai-waehner.de/?p=4161 Should I use Apache Camel or Apache Kafka for my next integration project? The question is very valid and comes up regularly. This blog post explores both open-source frameworks and explains the difference between application integration and event streaming. The comparison discusses when to use Kafka or Camel, when to combine them, when not to use them at all. A decision tree shows how you can quickly qualify out one for the other.

The post When to use Apache Camel vs. Apache Kafka? appeared first on Kai Waehner.

]]>
Should I use Apache Camel or Apache Kafka for my next integration project? The question is very valid and comes up regularly. This blog post explores both open-source frameworks and explains the difference between application integration and event streaming. The comparison discusses when to use Kafka or Camel, when to combine them, when not to use them at all. A decision tree shows how you can quickly qualify out one for the other.

Apache Camel vs Apache Kafka Comparison

 

The history of application integration and event streaming

My personal history and experience in application integration and event streaming are the following. It shows my background and how I see the integration and data streaming markets.

A discussion that started over a decade ago…

With my background of work in the last decade at Talend, TIBCO, and Confluent, the comparison between Camel and Kafka is very exciting as I have spent a lot of time with both open-source frameworks:

Apache Camel powered Talend ESB. Talend had a visual coding tool to design Camel routes with code generation. Unfortunately, the tool’s primary focus was Talend Data Integration (ETL and batch). The Camel-powered ESB code was integrated, but it was neither perfect nor complete.

TIBCO BusinessWorks competed with Talend ESB while TIBCO StreamBase competed with other stream processing solutions. The Kafka ecosystem came up more and more in conversations with customers.

CamelOne Kai Waehner Conference Speaker Apache Camel Open SourceI posted about “When to use Apache Camel” in 2011 already. In 2012, I did my first talk at an international software conference in the US. The name of the conference? CamelOne! A forum only about Apache Camel. What an exciting time. Claus Ibsen, THE Camel guy, wrote an excellent summary of CamelOne 2012 in Boston.

In my conference summary, I talked about my two talks. One of them covered a comparison between Apache Camel, Spring Integration, and Mulesoft ESB. The presentation has over 35000 views, and the number still goes up today.

 

… from application integration to event streaming

Over time, the buzzword “big data” came up more and more. I spent some time at Talend and TIBCO to learn new programming concepts such as Map-Reduce and Shuffling, mainly powered by Apache Hadoop and Apache Spark. The big data ecosystem snowballed with tens of frameworks such as Hive, HBase, Pig, and many more.

However, the first people realized that real-time data beats slow data in almost all use cases. The Lambda architecture was invented to separate real-time workloads from batch workloads. Event Streaming was born. Apache Kafka became the de facto standard for data streaming. Like CamelOne a decade ago, Kafka Summit is the one-stop-show for Kafka use cases, architectures, and success stories. Contrary to the small CamelOne, Kafka Summit is a global event with events across the globe, plus online events.

Data in motion with the Kappa architecture replacing Lambda

In 2014, a guy called Jay Kreps (few people knew him) was already questioning the Lambda architecture. Instead, he proposed to provide a single real-time layer to provide data for real-time and batch consumers. The Kappa architecture was born. Today, the Kappa architecture is mainstream, replacing Lambda. Various vendors adopt Kafka in the meantime.

Kappa Architecture with one Pipeline for Real Time and Batch

Confluent became the clear leader in the event streaming software category. Confluent Platform is powered by Apache Kafka. The focus is on event streaming. That’s different from most other vendors like Cloudera; they focus on 10-20 frameworks or products and try to combine and integrate them somehow. Today, Confluent Cloud is a complete game-changer providing Apache Kafka and its ecosystem for application integration and stream processing as a serverless cloud offering.

This is where we are today in 2022. Application integration (= Camel) and event streaming (= Kafka) play a critical role in every modern enterprise architecture. Open-source is widely adopted and usually preferred compared to proprietary solutions for various reasons, including avoiding vendor lock-in. That’s true for self-managed and serverless cloud offerings.

Hence, the question arises: Should I use Apache Camel for application integration or Apache Kafka for event streaming? Or both? Or does one solve the other, too? These questions will be answered in the following sections, concluding with a decision tree to help you make the right choice for your project.

Let’s look at the similarities between Camel and Kafka, when to use which framework, when and how to combine them, and when not to use them at all.

Features in Apache Camel AND Apache Kafka

Camel and Kafka have many positive and negative characteristics in common. Hence, it is no surprise that people compare the two frameworks:

  • Open source under Apache 2.0 license
  • Vibrant community and adoption in the industry
  • Mature framework with deployments in enterprises across the globe
  • Fixing point-to-point spaghetti architectures with a central integration backbone
  • Open architecture and extensibility with custom functions and connectors
  • Small and big deployments possible, plus single-node deployments for non-mission-critical use cases
  • Re-engineered and optimized for cloud-native deployments (container, Kubernetes, cloud)
  • Connectivity to any technology, API, communication paradigm, and SaaS
  • Transformation of any data types and formats
  • Processes transactional and analytical workloads
  • Domain-specific language (DSL) for message at a time processing, with similar logic such as aggregation, filtering, conditional processing
  • Relative complex frameworks because of their robust feature set, hence not suitable for solving a minor problem
  • Not a replacement of a database, data warehouse, or data lake

Beyond the similarities, Kafka and Camel have very different sweet spots built to solve distinct problems. Hence, comparing these two tools is a bit comparison of apples and oranges. Some minor projects might use one or the other to solve the problem, but critical enterprise projects show the differences more quickly.

When to use Apache Camel?

The mission of Camel

Apache Camel is an integration framework. It solves a particular problem: Data integration between different applications, APIs, protocols, and communication paradigms. This concept is often called application integration or enterprise integration. Camel implements the famous Enterprise Integration Patterns (EIP). EIPs are based on messaging principles.

Camel’s strengths

  • Event-based backbone based on well-known and adopted EIP concepts
  • Connectivity to almost any API
  • Integration, processing, and routing of information with an intuitive domain-specific language (DSL) with a focus on integration; providing the ability of composability in a programming context for finer grain control in code for doing conditional logic or transformations/reformatting
  • Powerful routing capabilities with many built-in EIPs
  • Many deployment options (standalone, web container, application server, Spring, OSGi, Kubernetes via the Camel K sub-project) – okay, I guess some options are not relevant in this decade anymore 🙂
  • Lightweight alternative to proprietary ETL and ESB tools

Camel’s weaknesses

  • Only a “routing machine”, i.e., not built for long-term storage (additional cache or storage needed), for that reason, Camel is not the right choice for a central nervous system like Kafka
  • No stream processing (like you know it from Kafka Streams or Apache Flink)
  • Limited scalability, not built for massive volumes of data
  • No powerful visual coding like you know it from proprietary ETL/ESB/iPaaS tools
  • No serverless cloud offering, with that also not competing with other iPaaS offerings
  • Red Hat is the only vendor supporting it
  • Built to be deployed in a single data center or cloud region, not across hybrid or multi-cloud scenarios

The evolution of Apache Camel

Camel is widely adopted and has a strong community. Unfortunately, from a vendor and support perspective, the offerings declined in the last few years. One of the most significant pain points: I still don’t see a serverless cloud offering anywhere today:

The Evolution of Apache Camel 2

Camel TL;DR

Camel is an application integration framework to connect different applications and interfaces. Camel is NOT built for processing data in motion continuously, i.e., stream processing. Hence, it should be compared to ETL and ESB tools, not data streaming technologies like Kafka, Kinesis, or Flink. If you look for a serverless cloud offering, you are out of luck. If you look for vendor support, Red Hat is the only option.

When to use Apache Kafka?

The mission of Kafka

Real-time data beats slow data at any scale. The event streaming platform enables processing data in motion. Kafka is the de facto standard for event streaming, including messaging, data integration, stream processing, and storage. Kafka provides all capabilities in one infrastructure at scale. It is reliable and allows to process analytics and transactional workloads.

Kafka’s strengths

  • Event-based streaming platform
  • A unique combination of pub/sub messaging, data processing, data integration, and storage in a single framework
  • Built for massive volumes of data and extreme scale from the beginning, with that a single framework can be used for transactional (low volume) and analytics (high volume) use cases
  • True decoupling between producers and consumers because of its storage component makes it the de facto standard for microservice architectures
  • Guaranteed ordering of events in the distributed commit log
  • Distributed data processing with fault-tolerance and recoverability built-in
  • Replayability of events
  • The de facto standard for event streaming
  • Built with hybrid and multi-cloud data replication in mind (with included tools like MirrorMaker and separate, more advanced, and more straightforward tools like Confluent Cluster Linking)
  • Support from many vendors, including Confluent, Cloudera, IBM, Red Hat, Amazon, Microsoft, and many more
  • Paradigm shift: Built to process data in motion end-to-end from source to one or more sinks

Kafka’s weaknesses

  • Paradigm shift: Enterprises need to learn and understand the added value of event streaming, a new software category that enables new use cases but also requires different design patterns and operations approaches
  • No powerful visual coding like you know it from proprietary ETL/ESB/iPaaS tools
  • Limited out-of-the-box routing capabilities (Kafka Connect SMT or Kafka Streams / ksqlDB app do the job very well, but not as simple as Camel)
  • Complex operations (if you run it by yourself instead of using 3rd party tools or even better a serverless cloud offering)

The evolution of Apache Kafka

Kafka was built at LinkedIn to process high volumes of data, as no other open-source framework could do this. Kafka found quick adoption after LinkedIn open-sourced it. Several vendors adopted Kafka and added it to their product portfolio. Some vendors just added Kafka for the sake of having it. Others innovated and used additional tools to make Kafka cloud-native for the next generation of event streaming. Kafka as a serverless cloud offering is a critical piece of many modern enterprise architectures today:

The Evolution of Apache Kafka 2

Kafka TL;DR

Kafka is an event streaming platform to process data in motion continuously. If you “just” need an integration framework to route data from a source to one or more sinks (= ETL / ESB), then Camel can be used, too. However, Kafka kills two birds with one stone (= integrating data AND processing it in motion where needed).

Plenty of Kafka offerings are available on the market. Check out the Apache Kafka landscape and comparison to understand the differences between offerings from Confluent, Cloudera, IBM, Red Hat, Amazon, Microsoft, and others.

Decision tree – Camel or Kafka?

The above sections explored when to use Camel and Kafka. So far, so good. Nevertheless, both frameworks overlap with their capabilities. Let’s get some help to choose the right one in that case.

Qualify out – the easiest way to start an evaluation!

The easiest way to decide on a specific option is to qualify out other frameworks that cannot fulfill the requirements.

Therefore, do you need

  • Big data processing?
  • A storage component for true decoupling and replayability of events?
  • Stateless or stateful stream processing?
  • A serverless cloud offering?

The above section discussed these differentiators of Kafka. In all these cases, you can qualify out Camel. It does not fulfill these requirements. These requirements are not necessarily a complete list. And you might also find a few aspects to qualify out Kafka from the beginning. Hence, you could also start from the Camel perspective and ask yourself: When should I not use Kafka. But I think it is easier the other way round.

Qualifying out solutions because of their limitations makes the decision tree and evaluation process much easier from the beginning.

Decision Tree for Camel and Kafka

Here is my decision tree to find out if Camel or Kafka is the right choice and what vendors you could evaluate:

Decision Tree Apache Camel vs Apache Kafka Comparison

When to use Camel and Kafka together?

It is possible to use Camel and Kafka together in a single integration architecture. Should you do that? Two options exist. One makes more sense than the other:

Kafka for event streaming and Camel for ETL

Camel and Kafka integrate well with each other. The native Kafka component of Camel is the best native integration point as a bridge between both environments:

Apache Camel and Apache Kafka in the Enterprise Architecture

The above architecture shows how Camel and Kafka live next to each other. Camel is used in a business domain for application integration. Kafka is the central nervous system between the Camel integration application and many other applications. I also added Kong as API Gateway to clarify that Camel or Kafka is not a silver bullet to solve every problem.

Once again, the vast advantage of Kafka as central integration layer is its unique combination of characteristics within a single infrastructure, including:

  • Real-time messaging at any scale
  • Storage for true decoupling between different applications and communication paradigms
  • Built-in backpressure handling and replayability of events
  • Data integration
  • Stream processing

Real-time data replication across hybrid and multi-cloud is not shown in the above picture but is also part of the enterprise architecture out-of-the-box leveraging take Kafka protocol.

With true decoupling within modern microservice architecture, each business team can decide whether they need application integration (using Camel) or event streaming (using Kafka). Often, both could be used. Additional questions around single vs. multi frameworks and APIs, vendor support, scalability needs, and other characteristics need to be evaluated to make the right choice for your business problem.

Camel connectors embedded into Kafka Connect

There is another way to combine Kafka and Camel: The “Camel Kafka Connector” sub-project of Apache Camel. Don’t get confused. This feature is not the Kafka component (= connector) of Camel! Instead, it is a relatively new initiative to deploy camel components into the Kafka Connect infrastructure.

The obvious benefit: This way, you get hundreds of new connectors “for free” within the Kafka ecosystem. This capability sounds excellent. And it is!

However, consider the total cost of ownership and the overall efforts using this approach. Application integration is one of the most challenging problems in computer science – especially if you talk about transactional data sets that require zero data loss, exactly-once semantics, and no downtime. The more components you combine in the end-to-end data flow, the harder it gets to keep your performance and reliability SLAs.

Hence, using Camel components within Kafka Connect has a considerable disadvantage: Combining two frameworks with complexities and different design concepts. Just a few examples:

  • Kafka world: Partitions, Offsets, Leader and Follower, Key/Value/Header, connectors (based on Kafka Connect),  Bootstrap Server, ConsumerRecord, Retention Time, etc.
  • Camel world: Routes, RouteBuilder CamelContext, Exchange, Processor, components (Camel connectors), Endpoints, Type Converters, Registry, etc.

Please think twice before mixing two integration tools that are powerful but complex on their own. Getting this running is just one piece of the puzzle (the simple part). Don’t forget end-to-end testing, resiliency, SLAs, support across technologies and APIs. Even buying support for Camel and Kafka from Red Hat (i.e., a single vendor) does not improve this approach.

It is likely better to take the business logic and API calls out of the Camel component and copy it into a Kafka Connect connector template to run the integration natively with only Kafka code. This workaround allows a clean architecture, end-to-end integration with a single framework, a single vendor behind it, and much easier testing / debugging / monitoring.

TL;DR: I recommend only using the “Camel Kafka Connector” sub-project if the following options do not work:

  • Use only Apache Camel for application integration
  • Leverage Apache Kafka for event streaming and application integration
  • Choose separate deployments of Camel and Kafka and use the Camel-Kafka-Bridge

When NOT to use Camel or Kafka at all?

Once again, the easiest way for your evaluation to start is qualifying out tools that do not work to solve the problem.

Both Camel and Kafka are NOT built for the following scenarios:

  • A proxy for millions of clients (like mobile apps) – but native proxies (like a REST or MQTT Proxy for Kafka) exist for some use cases.
  • An API Management platform – but these tools are usually complementary and used to create life cycle management or monetize APIs deployed with Camel or Kafka.
  • A database for complex queries and batch analytics workloads
  • an IoT platform with features such as device management – but direct native integration with (some) IoT protocols such as MQTT or OPC-UA is possible and the approach for (some) use cases.
  • A technology for hard real-time applications such as safety-critical or deterministic systems – but that’s true for any other IT framework, too. Embedded systems are a different software than Camel or Kafka!

I wrote a very detailed post about this topic from a Kafka perspective. It maps almost 1:1 to the Camel world, too (and any related technology such as Flink, Spark, Pulsar, etc.): “When NOT to use Apache Kafka?

Apache Camel vs. Apache Kafka – Who is the winner?

Simple answer: Both!

When you compare apples and oranges, you might become happy when you are hungry as both are good to eat. The same is true for Camel and Kafka. Both can do application integration. But they serve very different needs.

Many integration scenarios can use Camel or Kafka.

Camel is the right tool if you need to integrate data within an application context or business unit (with no need for stream processing, true decoupling, replayability, large scale, replication across data centers or cloud regions).

Kafka is the central event-based nervous system across business units, regions, and hybrid clouds. Kafka is all about event streaming. Application integration is just a piece of this puzzle. On the other side, I have seen plenty of integration projects powered by Apache Kafka. It is often replacing other middleware. That’s true for ETL/ESB legacy modernization and in discussions about using a cloud-native iPaaS.

Do you use Camel or Kafka today? What use cases? How do you decide which one to choose? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post When to use Apache Camel vs. Apache Kafka? appeared first on Kai Waehner.

]]>
Kafka for Cybersecurity (Part 6 of 6) – SIEM / SOAR Modernization https://www.kai-waehner.de/blog/2021/08/09/kafka-cybersecurity-part-6-of-6-siem-soar-modernization/ Mon, 09 Aug 2021 14:07:51 +0000 https://www.kai-waehner.de/?p=3614 This blog series explores use cases and architectures for Apache Kafka in the cybersecurity space, including situational awareness, threat intelligence, forensics, air-gapped and zero trust environments, and SIEM / SOAR modernization. This post is part six: SIEM / SOAR modernization and integration.

The post Kafka for Cybersecurity (Part 6 of 6) – SIEM / SOAR Modernization appeared first on Kai Waehner.

]]>
Apache Kafka became the de facto standard for processing data in motion across enterprises and industries. Cybersecurity is a key success factor across all use cases. Kafka is not just used as a backbone and source of truth for data. It also monitors, correlates, and proactively acts on events from real-time and batch data sources to detect anomalies and respond to incidents. This blog series explores use cases and architectures for Kafka in the cybersecurity space, including situational awareness, threat intelligence, forensics, air-gapped and zero trust environments, and SIEM / SOAR modernization. This post is part six: SIEM / SOAR Modernization.

SIEM and SOAR Modernization with Apache Kafka Elasticsearch Splunk QRadar Arcsight Cortex

Blog series: Apache Kafka for Cybersecurity

This blog series explores why security features such as RBAC, encryption, and audit logs are only the foundation of a secure event streaming infrastructure. Learn about use cases,  architectures, and reference deployments for Kafka in the cybersecurity space:

What are SIEM and SOAR?

SIEM (Security information and event management) and SOAR (security orchestration, automation and response) are terms coined by Gartner (like so often in the industry).

SIEM combines security information management (SIM) and security event management (SEM). They provide analysis of security alerts generated by applications and network hardware. Vendors sell SIEM as software, as appliances, or as managed services; these products are also used for logging security data and generating reports for compliance purposes.

SOAR tools automate security incident management investigations via a workflow automation workbook. The cyber intelligence API enables the playbook to automate research related to the ticket (lookup potential phishing URL, suspicious hash, etc.). The first responder determines the criticality of the event. At this level, it is either a normal or an escalation event. SOAR includes security incident response platforms (SIRPs), Security orchestration and automation (SOA), and threat intelligence platforms (TIPs).

In summary, SIEM and SOAR are key pieces of a modern cybersecurity infrastructure. The capabilities, use cases, and architectures are different for every company.

SIEM and SOAR Vendors

In practice, many products in this area will mix these functions, so there will often be some overlap. Many commercial vendors also promote their own terminology.

The leaders in Gartner’s Magic Quadrant for SIEM 2021 are Exabeam, IBM, Securonix, Splunk, Rapid, LogRhythm. Elastic is a niche player for SIEM but very prevalent in Kafka architectures for other use cases.

The Gartner Market Guide for SOAR 2020 includes Anomali, Cyware, D3 Security, DFLabs, EclecticIQ, FireEye, Fortinet (CyberSponse), Honeycomb.

These are obviously not complete lists of SIEM and SOAR vendors. Even more complex: Gartner says, “SIEM vendors are adopting and acquiring/integrating SOAR solutions in their ecosystems”. Now, if you ask another research analyst or the vendors themselves, you will get even more different opinions 🙂

Hence, as always in the software business, do your own evaluation to solve your business problems. Instead of evaluating vendors, you might first check your pain points and capabilities that solve the problems. Capabilities include data aggregation, correlation, dashboards, alerting, compliance, forensic analysis to implement log analytics, threat detection, and incident management.

If you read some of the analyst reports or vendor websites/whitepapers, it becomes clear that the Kafka ecosystem also has many overlaps regarding capabilities.

The Challenge with SIEM / SOAR Platforms

SIEM and SOAR platforms provide you with various challenges:

  • Proprietary forwarders can only send data to a single tool
  • Data is locked from being shared
  • Difficult to scale with growing data volumes
  • High indexing costs of proprietary tools hinder wide adoption
  • Filtering out noisy data is complex and slows response
  • No one tool can support all security and SIEM / SOAR requirements

The consequence is a complex and expensive spaghetti architecture with different proprietary protocols:

The Challenge with SIEM and SOAR Platforms

The first post of this blog series explored why Kafka helps a central streaming backbone to avoid such a costly and complex spaghetti architecture. But let’s dig deeper into this discussion.

Kafka for SIEM and SOAR Modernization

A modern cybersecurity architecture has the following requirements:

  • Real-time data access to all your security experts
  • Historical and contextual data access for forensic reporting
  • Rapid detection of vulnerabilities and malicious behavior
  • Predictive modeling of security incidents using newer capabilities like ML/AI

Flexible Enterprise Architecture including Kafka, SIEM and SOAR:

Kafka is NOT a SIEM or SOAR. But it enables an open, real-time, and portable data architecture:

  • Ingest diverse, voluminous, and high-velocity data at scale
  • Scalable platform that grows with your data needs
  • Reduce indexing costs and OPEX managing legacy SIEM
  • Enable data portability to any SIEM tool or downstream application

The following diagram how the event streaming platform fits into the enterprise architecture together with SIEM/SOAR and other applications:

SIEM and SOAR Modernization with Apache Kafka

Here are the benefits of this approach for an enterprise architecture:

  • Data integration just once, ingest and aggregate events from everywhere: Web / Mobile, SaaS, Applications, Datastores, PoS Systems, loT Sensors, Legacy Apps, and Systems, Machine data
  • Data correlation independent from specific products: Join, enrich, transform and analyze data in real-time to convert raw events into clean, usable data, avoid the “shit in shit out” issue from raw data lake architectures
  • Standardized, open, and elastic integration layer: Standardize schemas to ensure data compatibility to all downstream apps
  • Long-term storage: Store and persist events in a highly available and scalable platform for real-time analytics and forensics
  • Integration with one or more SIEM/SOAR tools
  • Additional truly decoupled consumers (real-time, batch, request-response): Choose the right tool or technology for the job with a domain-driven design

Reference Architectures for Kafka-powered SIEM and SOAR Infrastructures

Kafka-powered enterprise architecture is open and flexible—no need to try a big bang change. Business units change solutions based on their current need. Often, a combination of different technologies and migration from legacy to modern tools is the consequence. Streaming ETL as ingestion and pre-processing layer for SIEM/SOAR is one use case for Kafka.

However, ingestion and preprocessing for SIEM/SOAR is a tiny fraction of what Kafka can do with the traffic. Other business applications, ML platforms, BI tools, etc., can also consume the data from the event-based central nervous system. At their own speed. With their own technology. That’s what makes Kafka so unique and successful.

This section shows a few reference architectures and migration scenarios for SIEM and SOAR deployments.

SIEM Modernization: Kafka and Elasticsearch

Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine. Elasticsearch is developed in Java and dual-licensed under the source-available Server Side Public License and the Elastic license. Other parts fall under the proprietary (source-available) Elastic License.

The open-source Elasticsearch project is NOT a complete SIEM but often a key building block. That’s why it comes up in SIEM discussions regularly.

Most scalable Elasticsearch architectures leverage Kafka as the ingestion layer. Other tools like Logstash or Beam are okay for small deployments. But Kafka provides various advantages:

  • Scalable: Start small, but scale up without changing the architecture
  • Decoupling: Elastic is one sink, but the integration layer provides the events to every consumer
  • Out-of-the-box integration: The Kafka Connect connector for Elastic is battle-tested and used in hundreds of companies across the globe.
  • Fully managed: In the cloud, the complete integration pipeline including Kafka, Kafka Connect, and Elastic is truly serverless end-to-end.
  • Backpressure handling: Elastic is not built for real-time ingestion but is a so-called slow consumer. Kafka handles the backpressure. Elastic indexes the events at its own speed, no matter how fast the data source produces events. Elastic Data Streams is actually improving this to provide more native streaming ingestion.

The following diagram shows the reference architecture for an end-to-end integration from data sources via Kafka into Elastic:

Apache Kafka and Elasticsearch Integration for SIEM with Kafka Connect and Elastic Connector

SIEM Modernization: Kafka and Splunk

Splunk provides proprietary software. It is a leading SIEM player in the market. Splunk makes machine data accessible across an organization by identifying data patterns, providing metrics, diagnosing problems, and providing intelligence for business operations. The biggest complaint I hear about Splunk regularly is the costly licensing model. Also, the core of Splunk (like almost every other SIEM) is based on batch processing.

The combination of Kafka and Splunk reduces costs significantly. Here is the Confluent reference architecture:

Apache Kafka and Splunk Reference Architecture with S2S Forwarders and HEC Indexers

A prominent example of combining Confluent and Splunk is Intel’s Cyber Intelligence Platform (CIP), which I covered in part 3 of this blog series.

The above architecture shows an open, flexible architecture. Splunk provides several integration points. But at its core, Splunk uses the proprietary S2S (“Splunk-to-Splunk”) protocol. All universal forwarders (UF) broadcast directly to indexers or heavy forwarders (HF). The Confluent Splunk S2S Source Connector provides a way to integrate Splunk with Apache Kafka. The connector receives data from Splunk UFs.

This approach allows customers to cost-effectively & reliably read data from Splunk Universal Forwarders to Kafka. It enables users to forward data from universal forwarders into a Kafka topic to unlock the analytical capabilities of the data.

The direct S2S integration is beneficial if a company does not have Kafka out front and the data goes straight to Splunk indexers. To leverage Kafka, the connector “taps” into the Universal Forwarder infrastructure. Often, companies have 10,000s of UFs. If you are in the lucky situation of not having hundreds or thousands of Splunk UFs in place, then the regular Splunk sink connector for Kafka Connect might be sufficient for you.

The data is processed, filtered, aggregated in real-time at scale with Kafka-native tools such as Kafka Streams or ksqlDB. The processed data is ingested into Splunk; and potentially many other real-time or batch consumers that are completely decoupled from Splunk.

Using the right SIEM and SOAR for the Job

Most customers I talk to don’t use just one tool for solving their cybersecurity challenges. For that reason, Kafka is the perfect backbone for true decoupling and ingestion layer for different SIEM and SOAR tools:

True Decoupling for Multiple SIEM and SOAR Tools with Kafka

Kafka’s commit log stores the incoming data. Each consumer consumes the data as it can. In this example, different SIEM tools consume the same events. Elasticsearch and Splunk consume the same raw data in different near real-time or batch processes. Both are completely independent in how they consume the events.

IBM QRadar cannot process high volumes of data. Hence, ksqlDB continuously preprocesses the raw data and puts it into a new Kafka topic. QRadar consumes the aggregated data.

Obviously, SOAR can consume data similarly. As discussed earlier in this article, the SOAR functionality could also be part of one of the SIEMs. However, then it would (have to) consume the data in real-time to provide true situational awareness.

Legacy SIEM Replacement and Hybrid Cloud Migration

In the field, I see several reasons to migrate workloads away from a deployed SIEM:

  • Very high costs, usually with a throughput-based license model
  • Scalability issues for the growing volumes of data in the enterprise
  • Processing speed (batch) is not sufficient for real-time situational awareness and threat intelligence
  • Migration from on-premise to (multi) cloud across regions and data centers

Groupon published an exciting success story: “We Replaced Splunk at 100TB Scale in 120 Days“:

Splunk SIEM Replacement with Kafka and Elasticsearch at Groupon

The new platform leverages Kafka for high volume processing in real-time, migration, and backpressure handling. Elasticsearch provides reports, analytics, and dashboards.

The article also covers the repeating message that Logstash is not ideal for these kinds of workloads. Hence, the famous ELK stack with Elasticsearch, Logstash, and Kibana in most real-world deployments is actually an EKK stack with Elasticsearch, Kafka, and Kibana.

This story should make you aware that Logstash, FluentD, Sumo Logic, Cribl, and other log analytics platforms are built for exactly this use case. Kafka-native processing enables the same but also many other use cases.

Another key advantage of Kafka is the ability to operate as a resilient, hybrid migration pathway from on-premise to one or multiple clouds. Confluent can be deployed everywhere to coordinate log traffic across multiple data centers and cloud providers. I explored hybrid Kafka architectures in another blog post in detail.

Kafka-native SOAR: Cortex Data Lake from Palo Alto Networks (PANW)

I covered various SIEMs in this post. SOAR is a more modern concept than SIEM. Hence, the awareness and real-world deployments are still limited. I am glad that I found at least one public example I can show you in this post.

Cortex Data Lake is a Kafka-native SOAR that collects, transforms, and integrates enterprise security data at scale in real-time. Billions of messages pass through their Kafka clusters. Confluent Schema Registry enforces data governance. Palo Alto Networks (PANW) has multiple Kafka clusters in production with a size from 10 to just under 100 brokers each. Check out Palo Alto Networks’ engineering blog for more details.

Here is the architecture of Cortex Data Lake:

Cortex Data Lake SIEM SOAR from Palo Alto Networks PANW powered by Apache Kafka

PANW’s design principles overlap significantly with the unique characteristics of Apache Kafka:

  • Cloud agnostic infrastructure
  • Massively scalable
  • Aggressive ETA on integrations
  • Schema versioning support
  • Microservices architecture
  • Operational efficiency

If you look at these design principles, it is obvious why the backbone of PANW’s product is Kafka.

Most Enterprises have more than one SIEM/SOAR

SIEM and SOAR are critical for every enterprise’s cybersecurity strategy. One SIEM/SOAR is typically not good enough. It is either not cost-efficient or includes scalability/performance issues.

Kafka-native SIEM/SOAR modernization is prevalent across industries. The central event-based backbone enables the integration with different SIEM/SOAR products. Other consumers (like ML platforms, BI tools, business applications) can also access the data. Innovative Kafka integrations like Confluent’s S2S connector enable the modernization of monolithic Splunk deployments and significantly reduce costs. Many next-generation SOARs such as PANW’s Cortex Data Lake are even based on top of Kafka.

Do you use any SIEM / SOAR? How and why do you use or plan to use the Kafka ecosystem together with these tools? How does your (future) architecture look like? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Kafka for Cybersecurity (Part 6 of 6) – SIEM / SOAR Modernization appeared first on Kai Waehner.

]]>
Kafka for Cybersecurity (Part 3 of 6) – Cyber Threat Intelligence https://www.kai-waehner.de/blog/2021/07/15/kafka-cybersecurity-siem-soar-part-3-of-6-cyber-threat-intelligence/ Thu, 15 Jul 2021 06:39:35 +0000 https://www.kai-waehner.de/?p=3555 This blog series explores use cases and architectures for Apache Kafka in the cybersecurity space, including situational awareness, threat intelligence, forensics, air-gapped and zero trust environments, and SIEM / SOAR modernization. This post is part three: Cyber Threat Intelligence.

The post Kafka for Cybersecurity (Part 3 of 6) – Cyber Threat Intelligence appeared first on Kai Waehner.

]]>
Apache Kafka became the de facto standard for processing data in motion across enterprises and industries. Cybersecurity is a key success factor across all use cases. Kafka is not just used as a backbone and source of truth for data. It also monitors, correlates, and proactively acts on events from various real-time and batch data sources to detect anomalies and respond to incidents. This blog series explores use cases and architectures for Kafka in the cybersecurity space, including situational awareness, threat intelligence, forensics, air-gapped and zero trust environments, and SIEM / SOAR modernization. This post is part three: Cyber Threat Intelligence.

Cyber Threat Intelligence with Apache Kafka and SIEM SOAR Machine Learning

Blog series: Apache Kafka for Cybersecurity

This blog series explores why security features such as RBAC, encryption, and audit logs are only the foundation of a secure event streaming infrastructure. Learn about use cases,  architectures, and reference deployments for Kafka in the cybersecurity space:

Subscribe to my newsletter to get updates immediately after the publication. Besides, I will also update the above list with direct links to this blog series’s posts as soon as published.

Cyber Threat Intelligence

Threat intelligence, or cyber threat intelligence, reduces harm by improving decision-making before, during, and after cybersecurity incidents reducing operational mean time to recovery, and reducing adversary dwell time for information technology environments.

Threat intelligence is evidence-based knowledge, including context, mechanisms, indicators, implications, and action-oriented advice about an existing or emerging menace or hazard to assets. This intelligence can be used to inform decisions regarding the subject’s response to that menace or hazard.

Threat intelligence solutions gather raw data about emerging or existing threat actors & threats from various sources. This data is then analyzed and filtered to produce threat intel feeds and management reports that contain information that automated security control solutions can use.

Threat intelligence keeps organizations informed of the risks of advanced persistent threats, zero-day threats and exploits, and how to protect against them.

Situational Awareness is Not Enough…

… but the foundation to collect and pre-process data in real-time at scale. Only real-time situational awareness enables real-time threat intelligence to provide huge benefits to the enterprise:

  • Mitigate harmful events in cyberspace
  • Proactive cybersecurity posture that is predictive, not just reactive
  • Bolster overall risk management policies
  • Improved detection of threats
  • Better decision-making during and following the detection of a cyber intrusion

In summary, threat intelligence allows to:

  • See the whole board. And see it more quickly.
  • See around corners.
  • See the enemy before they see you.

Threat Intelligence for Prevention or Mitigation across the Cyber Kill Chain

Threat intelligence is the knowledge that allows you to prevent or mitigate cyberattacks. It covers all the phases of the so-called “Cyber Kill Chain“:

Intrusion Kill Chain for InfoSec

Threat intelligence provides several benefits:

  • Empowers organizations to develop a proactive cybersecurity posture and to bolster overall risk management policies
  • Drives momentum toward a cybersecurity posture that is predictive, not just reactive
  • Enables improved detection of threats
  • Informs better decision-making during and following the detection of a cyber intrusion

Transactional Data vs. Analytics Data

Most use cases around data-in-motion are about all the data. This is true for all transactional use cases and even for many analytical use cases. Each event is valuable: A sale, an order, a payment, an alert from a machine, etc.

However, data is often full of noise. As I discussed earlier in this blog series, the goal in the cybersecurity space is to find the needle in the haystack and to reduce false-positive alerts.

SIEM, SOAR, OT, and ICS are almost always analytic processing regimes, BUT knowing when they are not is important. Kafka can configure topics to be tuned for transactions or analytics.  That is unprecedented in the history of data processing. Threat intelligence (= awareness-in-motion) assumes the PATTERN is valuable, not the data.

Analytics in Motion powered by Kafka Streams / ksqlDB

As you can hopefully imagine from the above requirements and characteristics, event streaming with Apache Kafka and its streaming analytics ecosystem is a perfect fit for the technical infrastructure for threat intelligence.

Threat detection makes sense of the signal and the noise of the data by continuously processing signatures. This enables to detect, contain and neutralize threats proactively:

Threat Intelligence with Kafka Streams ksqlDB and Machine Learning

Analytics can be many things in such a scenario:

On a high level, the advantages of using Kafka Streams or ksqlDB for threat intelligence can be described as follows:

  • A single scalable and reliable real-time infrastructure for end-to-end data integration and data processing
  • Flexibility to write custom rules and embed other rules engines, frameworks, or trained models
  • Integration with other threat detection systems like IDS, SIEM, SOAR

The business logic for cyber threat detection looks different for every use case. Known attack patterns like MITRE ATT&ACK help with the implementation. However, situational awareness and threat detection also need to detect unknown anomalies.

Let’s now take a look at a concrete example.

Intel’s Cyber Intelligence Platform

Let me quote Intel themselves:

As cyber threats continuously grow in sophistication and frequency, companies need to quickly acclimate to effectively detect, respond, and protect their environments. At Intel, we’ve addressed this need by implementing a modern, scalable Cyber Intelligence Platform (CIP) based on Splunk and Confluent. We believe that CIP positions us for the best defense against cyber threats well into the future.

Our CIP ingests tens of terabytes of data each day and transforms it into actionable insights through streams processing, context-smart applications, and advanced analytics techniques. Kafka serves as a massive data pipeline within the platform. It provides us the ability to operate on data in-stream, enabling us to reduce Mean Time to Detect (MTTD) and Mean Time to Respond (MTTR). Faster detection and response ultimately leads to better prevention.

Let’s explore Intel’s CIP for threat intelligence in more detail.

Detecting Vulnerabilities with Stream Processing

Intel’s CIP leverages the whole Kafka ecosystem provided by Confluent:

  • Ingestion: Kafka producer clients for various sources such as databases, scanning engines, IP address management, asset management inventory, etc.
  • Streaming analytics: Kafka Streams for filtering vulnerabilities by business unit, joining asset ownership with vulnerable assets, etc.
  • Egress: Kafka Connect sink connectors for data lakes, IT partners, other business units, SIEM, SOAR, etc.
  • High availability: Multi-Region Clusters (MRC) for high availability across regions
  • And much more…

Here is a high-level architecture:

Stream Processing with Kafka at Intel

Intel’s Kafka Maturity Timeline

Building a cybersecurity infrastructure is not a big bang. A step-by-step approach starts with integrating the first sources and sinks, some simple stream processing, and deployment as a pilot project. Over time, more and more data sources and sinks are added, the business logic gets more powerful, and the scale increases.

Intel’s Kafka maturity timeline shows their learning curve:

Intel Kafka Maturity Timeline

Kafka Benefits to Intel

Intel describes their benefits for leveraging event streaming as follows:

  • Economies of scale
  • Operate on data in the stream
  • Reduce technical debt and downstream costs
  • Generates contextually rich data
  • Global scale and reach
  • Always on
  • Modern architecture with a thriving community
  • Kafka leadership through Confluent expertise

That’s pretty much the same reasons I use in many of my other blog posts to explain the rise of data in motion powered by Apache Kafka across industries and use cases🙂

For more intel on Intel’s Cyber Intelligence Platform powered by Confluent and Splunk, check out their whitepaper and Kafka Summit talk.

Scalable Real-time Cyber Threat Intelligence with Kafka

Kafka is not just used as a backbone and source of truth for data. It also monitors, correlates, and proactively acts on events from various real-time and batch data sources to implement cyber threat intelligence.

The Cyber Intelligence Platform from Intel is a great example of a Kafka-powered cybersecurity solution. It leverages the whole Kafka ecosystem to build a scalable and reliable real-time integration and processing layer. The streaming analytics logic depends on the use case. It can cover simple business logic but also external rules engines or analytic models.

How do you fight against cybersecurity risks? What technologies and architectures do you use to implement cyber threat intelligence? How does Kafka complement other tools? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Kafka for Cybersecurity (Part 3 of 6) – Cyber Threat Intelligence appeared first on Kai Waehner.

]]>
Apache Kafka and MQTT (Part 1 of 5) – Overview and Comparison https://www.kai-waehner.de/blog/2021/03/15/apache-kafka-mqtt-sparkplug-iot-blog-series-part-1-of-5-overview-comparison/ Mon, 15 Mar 2021 13:04:33 +0000 https://www.kai-waehner.de/?p=3144 Apache Kafka and MQTT are a perfect combination for many IoT use cases. This blog series covers various use cases across industries including connected vehicles, manufacturing, mobility services, and smart city. This is part 1: Overview + Comparison.

The post Apache Kafka and MQTT (Part 1 of 5) – Overview and Comparison appeared first on Kai Waehner.

]]>
Apache Kafka and MQTT are a perfect combination for many IoT use cases. This blog series covers the pros and cons of both technologies. Various use cases across industries, including connected vehicles, manufacturing, mobility services, and smart city are explored. The examples use different architectures, including lightweight edge scenarios, hybrid integrations, and serverless cloud solutions. This post is part one: Overview and Comparison.

Apache Kafka and MQTT - Match Made in Heaven

Gartner predicts: “Around 10% of enterprise-generated data is created and processed outside a traditional centralized data center or cloud. By 2025, this figure will reach 75%“. Hence, looking at the combination of MQTT and Kafka makes a lot of sense!

Apache Kafka + MQTT Blog Series

The first blog post explores the relation between MQTT and Apache Kafka. Afterward, the other four blog posts discuss various use cases, architectures, and reference deployments.

  • Part 1 – Overview (THIS POST): Relation between Kafka and MQTT, pros and cons, architectures
  • Part 2 – Connected Vehicles: MQTT and Kafka in a private cloud on Kubernetes; use case: remote control and command of a car
  • Part 3 – Manufacturing: MQTT and Kafka at the edge in a smart factory; use case: Bidirectional OT-IT integration with Sparkplug B between PLCs, IoT Gateways, Data Historian, MES, ERP, Data Lake, etc.
  • Part 4 – Mobility Services: MQTT and Kafka leveraging serverless cloud infrastructure; use case: Traffic jam prediction service using machine learning
  • Part 5 – Smart City: MQTT at the edge connected to fully-managed Kafka in the public cloud; use case: Intelligent traffic routing by combining and correlating different 1st and 3rd party services

Subscribe to my newsletter to get updates immediately after the publication. Besides, I will also update the above list with direct links to this blog series’s posts as soon as published.

Apache Kafka vs. MQTT

MQTT is an open standard for a publish/subscribe messaging protocol. Open source and commercial solutions provide implementations of different MQTT standard version. MQTT was built for IoT use cases, including constrained devices and unreliable networks. However, it was not built for data integration and data processing.

Contrary to the above, Apache Kafka is not an IoT platform. Instead, Kafka is an event streaming platform and used the underpinning of an event-driven architecture for various use cases across industries. It provides a scalable, reliable, and elastic real-time platform for messaging, storage, data integration, and stream processing.

To clarify, MQTT and Kafka complement each other. Both have their strength and weaknesses. They typically do not compete with each other.

When (not) to use MQTT?

This section explores the trade-offs of both technologies.

Pros of MQTT

  • Lightweight
  • Built for poor connectivity / high latency scenarios (e.g., mobile networks!)
  • High scalability and availability (with the right MQTT broker)
  • ISO Standard
  • Most popular IoT protocol
  • Deployable on all infrastructures (edge, data center, public cloud)

Cons of MQTT

  • Only pub/sub messaging, no stream processing, no data integration
  • Asynchronous processing (clients can be offline for a long time)
  • No reprocessing of events

When (not) to use Apache Kafka?

Pros of Kafka

  • Processing data in motion – not just pub/sub messaging, but also including stream processing and data integration
  • High throughput
  • Large scale
  • High availability
  • Long term storage and buffering for real decoupling of producers and consumers
  • Reprocessing of events
  • Good integration to the rest of the enterprise
  • Deployable on all infrastructures (edge, data center, public cloud)

Cons of Kafka

  • Not built for tens of thousands of connections
  • Requires stable network and good infrastructure
  • No IoT-specific features like keep alive, last will, or testament

TL;DR

Choose MQTT for messaging if you have a bad network, tens of thousands of clients, or the need for a lightweight push-based messaging solution, then MQTT is the right choice. Elsewhere, Kafka, a powerful event streaming platform, is probably a great choice for messaging, data integration, and data processing. In many IoT use cases, the architecture combines both technologies.

Example:  Predictive Maintenance with 100,000 Connected Cars

We have built a “simple demo” with Confluent and HiveMQ running on cloud-native Kubernetes infrastructure. The use case implements data processing from 100,000 connected cars and real-time predictive maintenance with TensorFlow:

Apache Kafka and MQTT for Real Time Analytics and Machine Learning with TensorFlow

More details, a video of the demo, and the code are available on Github: 100,000 Connected Cars and Predictive Maintenance in Real-Time with MQTT, Kafka, Kubernetes, and TensorFlow.

Additionally, I blogged a lot about Kafka and Machine Learning. For instance, take a look at streaming machine learning without a data lake with Kafka and TensorFlow.

Further Slides, Articles, and Demos

I already created a lot of material (including blogs, slides, videos) around Kafka and MQTT. Check this out if you need to learn about the different broker alternatives, integration options, and best practices for the combination.

The following slide deck covers the content of this blog series on a high level:

Kafka + MQTT = Match Made in Heaven

In conclusion, Apache Kafka and MQTT are a perfect combination for many IoT use cases. Follow the blog series to learn about use cases such as connected vehicles, manufacturing, mobility services, and smart city. Every blog post also includes real-world deployments from companies across industries. It is key to understand the different architectural options to make the right choice for your project.

What are your experiences and plans in IoT projects? What use case and architecture did you implement? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka and MQTT (Part 1 of 5) – Overview and Comparison appeared first on Kai Waehner.

]]>
Building a Postmodern ERP with Apache Kafka https://www.kai-waehner.de/blog/2020/11/20/postmodern-erp-mes-scm-with-apache-kafka-event-streaming-edge-hybrid-cloud/ Fri, 20 Nov 2020 09:59:59 +0000 https://www.kai-waehner.de/?p=2847 Postmodern ERP represents the next generation of ERP architectures. It is real-time, scalable, and open by using a combination of open source technologies and proprietary standard software. This blog post explores why and how companies, both software vendors and end-users, leverage event streaming with Apache Kafka to implement a Postmodern ERP.

The post Building a Postmodern ERP with Apache Kafka appeared first on Kai Waehner.

]]>
Enterprise resource planning (ERP) exists for many years. It is often monolithic, complex, proprietary, batch, and not scalable. Postmodern ERP represents the next generation of ERP architectures. It is real-time, scalable, and open. A Postmodern ERP uses a combination of open source technologies and proprietary standard software. This blog post explores why and how companies, both software vendors and end-users, leverage event streaming with Apache Kafka to implement a Postmodern ERP.

Postmodern ERP with Apache Kafka

What is ERP (Enterprise Ressource Planning)?

Let’s define the term “ERP” first. This is not an easy task, as ERP is used for concepts and various standard software products.

Enterprise resource planning (ERP) is the integrated management of main business processes, often in real-time and mediated by software and technology.

ERP is usually referred to as a category of business management software—typically a suite of integrated applications – that an organization can use to collect, store, manage, and interpret data from many business activities.

ERP provides an integrated and continuously updated view of core business processes using common databases. These systems track business resources – cash, raw materials, production capacity – and the status of business commitments: orders, purchase orders, and payroll. The applications that make up the system share data across various departments (manufacturing, purchasing, sales, accounting, etc.) that provide the data  ERP facilitates information flow between all business functions and manages connections to outside stakeholders.

It is important to understand that ERP is not just for manufacturing and relevant across various business domains. Hence, Supply Chain Management (SCM) is orthogonal to ERP.

ERP is a Zoo of Concepts, Technologies, and Products

An ERP is a key concept and typically uses various products as part of every supply chain where tangible goods are produced. For that reason, an ERP is very complex in most cases. It usually is not just one product, but a zoo of different components and technologies:

SAP ERP System - Zoo of Products including SCM MES CRM PLM WMS LMS

 

Example: SAP ERP – More than a Single Product…

SAP is the leading ERP vendor. I explored SAP, its product portfolio, and integration options for Kafka in a separate blog post: “Kafka SAP Integration – APIs, Tools, Connector, ERP et al.

Check that out if you want to get deeper into the complexity of a “single product and vendor”. You will be surprised how many technologies and integration options exist to integrate with SAP. SAP’s stack includes plenty of homegrown products like SAP ERP and acquisitions with their own codebase, including Ariba for supplier network, hybris for e-commerce solutions, Concur for travel & expense management, and Qualtrics for experience management. The article “The ERP is Dead. Long live the Distributed Planning System” from the SAP blog goes in a similar direction.

ERP Requirements are Changing…

This is not different for other big vendors. For instance, if you explore the Oracle website, you will also find a confusing product matrix. 🙂

That’s the status quo of most ERP vendors. However, things change due to shifting requirements: Digital Transformation, Cloud, Internet of Things (IoT), Microservices, Big Data, etc. You know what I mean… Requirements for standard software are changing massively.

Every ERP vendor (that wants to survive) is working on a Postmodern ERP these days by upgrading its existing software products or writing a completely new product – that’s often easier. Let’s explore what a Postmodern ERP is in the next section.

Introducing the Postmodern ERP

The term “Postmodern ERP” was coined by Gartner several years ago, already.

From the Gartner Glossary:

Postmodern ERP is a technology strategy that automates and links administrative and operational business capabilities (such as finance, HR, purchasing, manufacturing, and distribution) with appropriate levels of integration that balance the benefits of vendor-delivered integration against business flexibility and agility.”

This definition shows the tight relation to other non-Core-ERP systems, the company’s whole supply chain, and partner systems.

The Architecture of a Postmodern ERP

According to Gartner’s definition of the postmodern ERP strategy, legacy, monolithic and highly customized ERP suites, in which all parts are heavily reliant on each other, should sooner or later be replaced by a mixture of both cloud-based and on-premises applications, which are more loosely coupled and can be easily exchanged if needed. Hint: This sounds a lot like Kafka, doesn’t it?

The basic idea is that there should still be a core ERP solution that would cover the most important business functions. In contrast, other functions will be covered by specialist software solutions that merely extend the core ERP

There is, however, no golden rule as to what business functions should be part of the core ERP and what should be covered by supplementary solutions. According to Gartner, every company must define its own postmodern ERP strategy, based on its internal and external needs, operations, and processes. For example, a company may define that the core ERP solution should cover those business processes that must stay behind the firewall and choose to leave their core ERP on-premises. At the same time, another company may decide to host the core ERP solution in the cloud and move only a few ERP modules as supplementary solutions to on-premises.

Pros and Cons of a Postmodern ERP

SelectHub explores the pros and cons of a Postmodern ERP compared to legacy ERPs:

Pros and Cons of a Postmodern ERP

The pros are pretty obvious, and the main motivation why companies want or need to move away from their legacy ERP system. Software is eating the world. Companies (need to) become more flexible, elastic, and scalable. Applications (need to) become more personalized and context-specific—all that (need to be) in real-time. There are no ways around a Postmodern ERP and all the related supply chain processes to leverage solve these requirements.

The main benefits that companies will gain from implementing a Postmodern ERP strategy are speed and flexibility when reacting to unexpected changes in business processes or on the organizational level. With most applications having a relatively loose connection, it is fairly easy to replace or upgrade them whenever necessary. Companies can also select and combine cloud-based and on-premises solutions that are most suited for their ERP needs.

The cons are more interesting because they need to be solved to deploy a Postmodern ERP successfully. The key downside of a postmodern ERP is that it will most likely lead to an increased number of software vendors that companies will have to manage and pose additional integration challenges for central IT.

Coincidentally, I had similar discussions with customers in the past quarters regularly. More and more companies adopt Apache Kafka to solve these challenges to build a Postmodern ERP and flexible, scalable supply chain processes.

Kafka as the Foundation of a Postmodern ERP

If you follow my blog and presentations, you know that Kafka is used in all areas where an ERP is relevant, for instance, Industrial IoT (IIoT), Supply Chain Management, Edge Analytics, and many other scenarios. Check out “Kafka in Industry 4.0 and Manufacturing” to learn more details about various use cases.

Example: A Postmodern ERP built on top of Kafka

A Postmodern ERP built on top of Apache Kafka is part of this story:

Postmodern ERP with Apache Kafka SAP S4 Hana Oracle XML Web Services MES

This architecture shows a Postmodern ERP with various components. Note that the Core ERP is built on Apache Kafka. Many other systems and applications are integrated.

Each component of the Postmodern ERP has a different integration paradigm:

  • The TMS (Transportation Management System) is a legacy COTS application providing only a legacy XML-based SOAP Web Service interface. The integration is synchronous and not scalable but works for small transactional data sets.
  • The LMS (Labor Management System) is a legacy homegrown application. The integration is implemented via Kafka Connect and a CDC (Change-Data-Capture) connector to push changes from the relational Oracle database in real-time into Kafka.
  • The SRM (Supplier Relationship Management) is a modern application built on top of Kafka itself. Integration with the Core ERP is implemented with Kafka-native replication technologies like MirrorMaker 2, Confluent Replicator, or Confluent Cluster Linking to provide a scalable real-time integration.
  • The MES (Manufacturing Execution System) is an SAP COTS product and part of the SAP S4/Hana product portfolio. The integration options include REST APIs, the Eventing API, and Java APIs. The right choice depends on the use case. Again, read Kafka SAP Integration – APIs, Tools, Connector, ERP et al. to understand how complex the longer explanation is.
  • The CRM (Customer Relationship Management) is Salesforce, a SaaS cloud service, integrated via Kafka Connect and the Confluent connector.
  • Many more integrations to additional internal and external applications are needed in a real-world architecture.

This is a hypothetical implementation of a Postmodern ERP. However, more and more companies implement this architecture for all the discussed benefits. Unfortunately, such modern architecture also includes some challenges. Let’s explore them and discuss how to solve them with Apache Kafka and its ecosystem.

Solving the Challenges of a Postmodern ERP with Kafka

This section covers three main challenges of implementing a Postmodern ERP and how Kafka and its ecosystem help implement this architecture.

I quote the three main challenges from the blog post “Postmodern ERP: Just Another Buzzword?” and then explain how the Kafka ecosystem solved them more or less out-of-the-box.

Issue 1: More Complexity Between Systems!

“Because ERP modules and tools are built to work together, legacy systems can be a lot easier to configure than a postmodern solution composed entirely of best-of-breed solutions. Because postmodern ERP may involve different programs from different vendors, it may be a lot more challenging to integrate. For example, during the buying process, you would need to ask about compatibility with other systems to ensure that the solution that you have in mind would be sufficient.”

First of all, is your existing ERP system easy to integrate? Any ERP system older than five years uses proprietary interfaces (such as BAPI and iDoc in case of SAP) or ugly/complex SOAP web services to integrate with other systems. Even if all the software components come from one single vendor, it was built by different business units or even acquired. The codebases and interfaces speak very different languages and technologies.

So, while a Postmodern ERP requires complex integration between systems, so does any legacy ERP system! Nevertheless:

How Kafka Helps…

Kafka provides an open, scalable, elastic real-time infrastructure for implementing the middleware between your ERP and other systems. More details in the comparison between Kafka and traditional middleware such as ETL and ESB products.

Kafka Connect is a key piece of this architecture. It provides a Kafka-native integration framework.

Additionally, another key reason why Kafka makes these complex integration successful is that Kafka really decouples systems (in contrary to traditional messaging queues or synchronous SOAP/REST web services):

Domain-Driven Design and Decoupling for your Postmodern ERP with Kafka

The heart of Kafka is real-time and event-based. Additionally, Kafka decouples producers and consumers with its storage capabilities and handles the backpressure and optionally the long-term storage of events and data. This way, batch analytics platforms, REST interfaces (e.e.g mobiles apps) with request-response, and databases can access the data, too. Learn more about “Domain-driven Design (DDD) for decoupling applications and microservices with Kafka“.

Understanding the relation between event streaming with Kafka and non-streaming APIs (usually HTTP/REST) is also very important in this discussion. Check out “Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies?” for more details.

The integration capabilities and real coupling using Kafka enables the integration of the complexity between systems.

Issue 2: More Difficult Upgrades!

“This con goes hand in hand with the increased complexity between systems. Because of this increased complexity and the fact that the solution isn’t an all-in-one program, making system upgrades can be difficult. When updates occur, your IT team will need to make sure that the relationship between the disparate systems isn’t negatively affected.”

How Kafka Helps…

The issue with upgrades is solved with Kafka out-of-the-box. Remember: Kafka really decouples systems from each other due to its storage capabilities. You can upgrade one system without even informing the other systems and without downtime! Two reasons why this works so well and out-of-the-box:

  1. Kafka is backward compatible. Even if you upgrade the server-side (Kafka brokers, ZooKeeper, Schema Registry, etc.), the other applications and interfaces continue to work without breaking changes. Server-side and client-side can be updated independently. Sometimes an older application is not updated anymore at all because it will be replaced soon. That’s totally fine. An old Kafka client can speak to a newer Kafka broker.
  2. Kafka uses rolling upgrades. The system continues to work without any downtime. 24/7. For Mission-critical workloads like ERP or MES transactions. From the outer perspective, the upgrade will not even be recognized at all.

Let’s take a look at an example with different components of the Postmodern ERP:

Postmodern ERP - Replication between Kafka and ERP Components

In this case, we see different versions and distributions of Kafka being used:

  • The Tier 1 Supplier uses the fully-managed and serverless Confluent Cloud solution. It automatically upgrades to the latest Kafka release under the hood (this is never a problem due to backward compatibility). The client applications use pretty old versions of Kafka.
  • The Core ERP uses open-source Kafka as it is a homegrown solution, not standard software. The operations and support are handled by the company itself (pretty risky for such a critical system, but totally valid). The Kafka version is relatively new. One client application even uses a Kafka version, which is newer than the server-side, to leverage a new feature in Kafka Streams (Kafka is backward compatible in both directions, so this is not a problem).
  • The MES vendor uses Confluent Platform, which embeds Apache Kafka. The version is up-to-date as the vendor does regular releases and supports rolling upgrades.
  • Integration between the different ERP applications is implemented with Kafka-native replication tools, MirrorMaker 2, respectively Confluent Cluster Linking. As discussed in a former section, various other integration options are available, including REST, Kafka Connect, native Kafka clients in any programming languages, or any ETL or ESB tool.

Backward compatibility and rolling upgrades make updating systems easy and invisible for integrated systems. Business continuity is guaranteed out-of-the-box.

Issue 3: Lack of Access When Offline

“When you implement a cloud-based software, you need to account for the fact that you won’t be able to access it when you are offline. Many legacy ERP systems offer on-premise solutions, albeit with a high installation cost. However, this software is available offline. For cloud ERP solutions, you are reliant on the internet to access all of your data. Depending on your specific business needs, this may be a dealbreaker.”

How Kafka Helps…

Hybrid architectures are the new black. Local processing on-premise is required in most use cases. It is okay to build the next generation ERP in the cloud. But the integration between cloud and on-premise/edge is key for success. A great example is Mojix, a Kafka-native cloud platform for real-time retail & supply chain IoT processing with Confluent Cloud.

When tangible goods are produced and sold, some processing needs to happen on-premise (e.g., in a factory) or even closer to the edge (e.g., in a restaurant or retail store). No access to your data is a dealbreaker. No capability of local processing is a dealbreaker. Latency and cost for cloud-only can be another deal-breaker.

Kafka works well on-premise and at the edge. Plenty of examples exist. Including Kafka-native bi-directional real-time replication between on-premise / edge and the cloud.

I covered these topics so often already; therefore, I just share a few links to read:

I specifically recommend the latter link. It covers hybrid architectures where processing at the edge (i.e. outside the data center) is key and required even offline, like in the following example running Kafka in a factory (including the server-side):

Edge Computing with Kafka in Manufacturing and Industry 4.0 MES ERP

The hybrid integration capabilities using Kafka and its ecosystem solves the issue with lack of access when offline.

Kafka and Event Streaming as Foundation for a Postmodern ERP Infrastructure

Postmodern ERP represents the next generation of ERP architectures. It is real-time, scalable, and open by using a combination of open source technologies and proprietary standard software. This blog post explored how software vendors and end-users leverage event streaming with Apache Kafka to implement a Postmodern ERP.

What are your experiences with ERP systems? Did you already implement a Postmodern ERP architecture? Which approach works best for you? What is your strategy? Let’s connect on LinkedIn and discuss it!

The post Building a Postmodern ERP with Apache Kafka appeared first on Kai Waehner.

]]>
Use Cases and Architectures for Kafka at the Edge https://www.kai-waehner.de/blog/2020/10/14/use-cases-architectures-apache-kafka-edge-computing-industrial-iot-retail-store-cell-tower-train-factory/ Wed, 14 Oct 2020 12:54:32 +0000 https://www.kai-waehner.de/?p=2723 Use cases and architectures for Kafka deployments at the edge, including retail stores, cell towers, trains, small factories, restaurants... Hardware and software components to realize edge and hybrid Kafka infrastructures.

The post Use Cases and Architectures for Kafka at the Edge appeared first on Kai Waehner.

]]>
Event streaming with Apache Kafka at the edge is not cutting edge anymore. It is a common approach to providing the same open, flexible, and scalable architecture at the edge as in the cloud or data center. Possible locations for a Kafka edge deployment include retail stores, cell towers, trains, small factories, restaurants, etc. I already discussed the concepts and architectures in detail in the past: “Apache Kafka is the New Black at the Edge” and “Architecture patterns for distributed, hybrid, edge and global Apache Kafka deployments“. This blog post is an add-on focusing on use cases across industries for Kafka at the edge.

To be clear before you read on: Edge is NOT a data center.

And “Edge Kafka” is not simply yet another IoT project using Kafka in a remote location. Edge Kafka is actually an essential component of a streaming nervous system that spans IoT (or OT in Industrial IoT) and non-IoT (traditional data-center / cloud infrastructures).

The post’s focus is scenarios where the Kafka clients AND the Kafka brokers are running on the edge. This enables edge processing, integration, decoupling, low latency, and cost-efficient data processing.

Kafka at the Edge - Use Cases and Architectures

Categories and Architectures for Kafka at the Edge

Some IoT projects are built like “normal Kafka projects”, i.e., built in the (edge) data center or cloud. For instance, bigger factories can provide infrastructure to deploy a reliable Kafka cluster with stable network connectivity to the cloud. Unfortunately, many IoT projects require real edge capabilities.

What’s different at the edge?

  • Offline business continuity is important even if the connection to the central data center or cloud is not available. Disconnected / offline sites do often not require or provide high availability (because it is not worth the efforts): Local pre-processing, real-time analytics with low latency, only online (i.e., connection to the data center or cloud) from time to time or with low bandwidth.

  • Often these projects need to deploy Kafka brokers across hundreds of locations. A single broker is often good enough, without high availability, but for back pressure and local processing. Use cases exist across industries, including retail stores, trains, restaurants, cell towers, small factories, etc.

  • Low-footprint, low-touch, little-or-no-DevOps-required installations of Kafka brokers (not just clients) are mandatory for many of these use-cases. In these cases, no IT experts are available “on-site” to operate Kafka. Hence, using certified OEM hardware is a great option to install and operate Kafka at the edge.

  • Many edge use cases are all around sensor and telemetry data. This is not transactional data where every single message counts. An application that processes millions of messages per second is fine with losing a few of the messages as it does not affect the outcome of the calculation.
  • Hybrid and not cloud-only: Consumer IoT (CIoT) always includes the users in their smart home, ride-share, retail store, etc.), Industrial IoT (IIoT) always includes tangible good (cars, food, energy, …)

  • Thousands and tens of thousands of connected interfaces: Sensors, machines, mobile devices, etc.

Using one single technical infrastructure enables building edge and hybrid architectures. No need for a ton of different frameworks and products is required. This is a huge benefit from a development, testing, operations, support point of view!

Use Cases Across Industries

Industries for Kafka at the Edge include manufacturing, pharma, carmakers, telecommunications, retailing, energy, restaurants, gaming, healthcare, public sector, aerospace, transportation, and others.

Architectures and use cases include data integration, pre-processing and replication to the cloud, big and small data edge processing, and analytics, disconnected offline scenarios, very low footprint scenarios with hundreds of locations, scenarios without the high-availability, and others.

Scenarios for Edge Computing with Kafka

Various examples for Kafka deployments at the edge exist. Almost all of these use cases are related to several of the above categories and requirements, such as low hardware footprint, disconnected offline processing, hundred of locations, and hybrid architectures.

I have worked with enterprises across industries and the globe on the following scenarios:

  • Public Sector: Local administration in each city, smart city projects incl. public transportation, traffic management, integration of various connected car platforms from different carmakers, cybersecurity (including IoT use cases such as capturing and processing camera images)
  • Transportation / Logistics / Railway / Aviation: Track&Trace, Kafka in the trains for offline and local processing / storage, traveller information (delayed or canceled flight / train / bus), real-time loyalty platforms (class upgrade, lounge access)
  • Manufacturing (Automotive, Aerospace, Semiconductors, Chemical, Food, and others): IoT aftermarket customer services, OEM in machines and vehicles, embedding into standard software such as ERP or MES systems, cybersecurity, a digital twin of devices/machines/production lines/processes, production line monitoring in factories for predictive maintenance/quality control/production efficiency, operations dashboards and line wellness (on-site for the plant manager, and aggregated global KPIs for executive management), track&trace and geofencing on the shop floor
  • Energy / Utility / Oil&Gas: Smart home, smart buildings, smart meters, monitoring of remote machines (e.g., for drilling, windmills, mining), pipeline and refinery operations (e.g., predictive failure or anomaly detection)
  • Telecommunications / Media: OSS real-time monitoring/problem analysis/metrics reporting/root cause analysis/action response of the network devices and infrastructure (routers, switches, other network devices), BSS customer experience and OTT services (mobile app integration for millions of users), 5G edge (e.g., street sensors)
  • Healthcare: Track&trace in the hospital, remote monitoring, machine sensor analytics
  • Retailing / Food / Restaurants / Banking: Customer communication, cross-/up-selling, loyalty system, payments in retail stores, perpetual inventory, Point-of-Sale (PoS) integration for (local) payments and (remote) CRM integration, EFTPOS (Electronic funds transfer at point of sale)

A great practical example of edge computing in retailing is fast-food chain Chick-fil-A. They deployed a Kubernetes cluster in each of their 2000 restaurants for real-time analytics at the edge without an internet connection. The hardware is pretty small and provides an Intel quadcore processor with 8 GB RAM and SSD:

Chic-fil-A Restaurant Edge Hardware

Example Architecture: Kafka in Transportation and Logistics

Let’s make this “Kafka at the edge” thing more clear with a specific example. In this case, I use the railway and transportation industry. But this can easily be mapped to your industry and use case.

The following example shows an edge and hybrid solution for railways to improve the customer experience and increase the revenue of the railway company. It leverages offline edge processing for customer communication, replication to the cloud for analytics, and integration with 3rd party interfaces and APIs from partners.

Hybrid Architecture – From Edge to Cloud

Local processing at the edge is happening on the train. But each train also replicates relevant data in real-time to the cloud – if there are internet connectivity and free network resources. If the train is not online, Kafka is handling the backpressure and replicating to the cloud when online again:

Hybrid Architecture with Kafka at the edge and in the cloud

Event Streaming in the Train with Kafka

Kafka on the train is NOT just used for real-time messaging and handling backpressure. These are already great reasons for using Kafka at the edge. Still, the even bigger value is created when Kafka is also used for data integration (restaurant, traveler information, loyalty system, etc.) and data processing (up-/cross-selling, real-time delay information, etc.) at the edge. This way, only one single platform is required to solve all the different problems:

Event Streaming with Apache Kafka at the Edge in a Train

Kafka for Disconnected / Offline Scenarios

Trains (and many other edge locations) are offline regularly. For instance, a train drives through a tunnel or reaches an area with no cell connectivity. Local processing is still possible. Business continuity is the key to improve customer experience and increase sales processes – even if the train disconnected from the internet. Passengers can still use the mobile app to see traveler information, buy food in the restaurant, or watch movies stores on the train’s local server. As soon as the train has internet connectivity again, the purchases from passengers are transferred to the loyalty system in the cloud, the latest delay information is consumed from the cloud and stored on the edge Kafka broker in the train, etc. etc. etc.:

Data Processing at the Edge with Kafka in offline and disconnected mode

Cross-Company Kafka Integration

Data processing does not stop with the hybrid integration between the edge (train) and cloud (CRM, loyalty system, etc.). Different divisions or partner companies need to integrate, too. Instead of using non-scalable, synchronous REST API calls / API Management for partner integration, streaming replication with Kafka-native technologies is a much better, scalable approach:

3rd Party and Partner Kafka Replication and API Management

I hope this story about Kafka at the edge helped you better understand how you can leverage event streaming in your industry and use cases to build an end-to-end streaming infrastructure from edge to cloud.

Infrastructure and Hardware Requirement for Deployment of Kafka at the Edge

Finally, it is important to discuss how to deploy Kafka at the edge. To be clear: Kafka still needs some computing power.

Obviously, this depends on many factors: The hardware vendors and infrastructure you are working with, specific SLAs and HA requirements, and so on. The good news is that Kafka can be deployed in many infrastructures, including bare metal, VMs, containers, Kubernetes, etc. The other good news is that new hardware for computing resources (even for the “edge”) typically has 4, 8, or even 16GB RAM because this is the smallest chip vendors produce these days (for these environments such as small factories, retail stores, etc).

Minimum hardware requirements for running a very small footprint Kafka are a single-core processor and a few 100MB RAM. This already allows decent edge processing with 100+Mb/sec throughput on a single Kafka node (with replication factor = 1). However, real values depend on the number of partitions, message size, network speed, and other characteristics. Don’t expect the same performance and scalability as in the data center or cloud!

Thus, you can deploy a Kafka broker on a Raspberry Pi, but not on some small embedded device! The latter is where the Kafka clients can run.

Check out the “Infrastructure Checklist for Apache Kafka at the Edge” if you plan to go that direction!

“The Confluent Way” to Deploy Kafka at the Edge

From a technical perspective, deployment of Kafka at the edge is the same as in a data center or cloud. However, the environment and requirements are a little bit different as we learned above. Some additional features definitely help with deploying and operating Kafka at the edge.

I work for Confluent. Hence, I provide you “The Confluent Way” of deploying Kafka at the edge in your future projects, including innovative, differentiating features:

  1. Confluent Server includes the Kafka Broker and various enhancements such as self-balancing clusters, Tiered Storage, embedded REST API, server-side schema validation, and much more. It can be deployed as a single node (very lightweight, but no high availability) or cluster (for mission-critical workloads that require high availability at the edge).
  2. In 2020, ZooKeeper is still required as an additional component. However, ZooKeeper removal is coming in 2021! Many Confluent engineers work full-time on removing the (ugly) Zookeeper dependency from the Kafka project, which means it will be possible to run a standalone, single process edge solution powered by Kafka. The whole Kafka community is looking forward to this architectural change. You can already run Kafka at the edge today, of course. It works well with ZooKeeper, too.
  3. Cluster Linking allows all these small Kafka edge sites to connect to a bigger Kafka cluster in a data center or cloud using the Kafka protocol. No need to use additional tools and infrastructure such as Confluent Replicator or MirrorMaker. This significantly reduces development efforts and infrastructure costs.
  4. Monitoring and Proactive Support with Confluent’s toolset. For instance, one Control Center can monitor several different (remote) Kafka clusters. Confluent Telemetry Reporter collects data from edge sites. Monitoring includes the technical infrastructure, but also for applications and end-to-end integration.
  5. Kubernetes is becoming the edge orchestration platform of choice for many edge deployments. For instance, using the MicroK8s Kubernetes distribution. Confluent Operator (Custom Resource Definitions, CRD) provides the capabilities to provision and operate single-broker or cluster deployments at the edge. This includes rolling upgrades, security automation, and much more. Confluent Operator is battle-tested in Confluent Cloud and large deployments of Confluent Platform in data centers, already. Chick-fil-A has a great success story about running Kubernetes in their 2000+ fast food stores to provide disconnected edge services and low latency. Kubernetes is clearly optional. Only use it when it really adds value, and not too complex and resource-hungry. Many edge deployments will probably disqualify Kubernetes as too heavyweight and complex.
  6. Data Integration and data processing is key at the edge. Confluent provides connectors to legacy and modern systems (including PLC4X for PLC, OPC-UA, and IIoT integration, database CDC connectors, MQ and MQTT integration, Data Diode connectors for uni-directional UDP networks in high security or dirty environments, cloud connectors, and much more). Kafka Streams and ksqlDB allow lightweight but powerful stream processing without the need for yet another technology.
  7. Lightweight edge clients (e.g., embedded devices) leverage the C or C++ client APIs for Kafka and the REST Proxy to communicate from any programming language via HTTP(S). This is important for many low-power edge devices with limited computing power where Java or similar “resource-hungry technologies” cannot be deployed.
  8. Optionally, certified, pre-configured OEM hardware is available. You put the box at the edge, connect it to LAN or Wifi, and use it. That’s it. Management and monitoring of the infrastructure occur via the remote software of the hardware vendor. “Feasible video processing on Hivecell” demonstrates edge analytics with a Kafka cluster, a Kafka Streams application, and embedded machine learning models for image recognition in real-time.

Kafka at the Edge (and in Hybrid Architectures) is the New Black

Kafka is a great solution for the edge. It enables deploying the same open, scalable, and reliable technology at the edge, data center, and the cloud. This is relevant across industries. Kafka is used in more and more places where nobody has seen it before. Edge sites include retail stores, restaurants, cell towers, trains, and many others. I hope the various use cases and architectures inspired you a little bit.

What are your experiences at the edge? What are your use cases? Did you or do you plan to use Apache Kafka and its ecosystem? What is your strategy? Let’s connect on LinkedIn and discuss it!

The post Use Cases and Architectures for Kafka at the Edge appeared first on Kai Waehner.

]]>