Stream Processing Archives - Kai Waehner

Confluent Data Streaming Platform vs. Databricks Data Intelligence Platform for Data Integration and Processing

Kai Waehner — Mon, 05 May 2025 03:47:21 +0000

Many organizations use both Confluent and Databricks. While these platforms serve different primary goals—real-time data streaming vs. analytical processing—there are areas where they overlap. This blog explores how the Confluent Data Streaming Platform (DSP) and the Databricks Data Intelligence Platform handle data integration and processing. It explains their different roles, where they intersect, and when one might be a better fit than the other.

About the Confluent and Databricks Blog Series

This article is part of a blog series exploring the growing roles of Confluent and Databricks in modern data and AI architectures:

Blog 1: The Past, Present and Future of Confluent (The Kafka Company) and Databricks (The Spark Company)
Blog 2 (THIS ARTICLE): Confluent Data Streaming Platform vs. Databricks Data Intelligence Platform for Data Integration and Processing
Blog 3: Shift-Left Architecture for AI and Data Warehousing with Confluent and Databricks
Blog 4: Databricks and Confluent in Enterprise Software Environments (with SAP as Example)
Blog 5: Databricks and Confluent Leading Data and AI Architectures – and How They Compare to Competitors

Learn how these platforms will affect data use in businesses in future articles. Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including technical architectures and the relation to analytical platforms like Databricks.

Data Integration and Processing: Shared Space, Different Strengths

Confluent is focused on continuous, event-based data movement and processing. It connects to hundreds of real-time and non-real-time data sources and targets. It enables low-latency stream processing using Apache Kafka and Flink, forming the backbone of an event-driven architecture. Databricks, on the other hand, combines data warehousing, analytics, and machine learning on a unified, scalable architecture.

Confluent: Event-Driven Integration Platform

Confluent is increasingly used as modern operational middleware, replacing traditional message queues (MQ) and enterprise service buses (ESB) in many enterprise architectures.

Thanks to its event-driven foundation, it supports not just real-time event streaming but also integration with request/response APIs and batch-based interfaces. This flexibility allows enterprises to standardize on the Kafka protocol as the data hub—bridging asynchronous event streams, synchronous APIs, and legacy systems. The immutable event store and true decoupling of producers and consumers help maintain data consistency across the entire pipeline, regardless of whether data flows in real-time, in scheduled batches or via API calls.

Databricks: Batch-Driven Analytics and AI Platform

Databricks excels in batch processing and traditional ELT workloads. It is optimized for storing data first and then transforming it within its platform, but it’s not built as a real-time ETL tool for directly connecting to operational systems or handling complex, upstream data mappings.

Databricks enables data transformations at scale, supporting complex joins, aggregations, and data quality checks over large historical datasets. Its Medallion Architecture (Bronze, Silver, Gold layers) provides a structured approach to incrementally refine and enrich raw data for analytics and reporting. The engine is tightly integrated with Delta Lake and Unity Catalog, ensuring governed and high-performance access to curated datasets for data science, BI, and machine learning.

For most use cases, the right choice is simple.

Confluent is ideal for building real-time pipelines and unifying operational systems.
Databricks is optimized for batch analytics, warehousing, and AI development.

Together, Confluent and Databricks cover both sides of the modern data architecture—streaming and batch, operational and analytical. And Confluent’s Tableflow and a shift-left architecture enable native integration with earlier data validation, simplified pipelines, and faster access to AI-ready data.

Data Ingestion Capabilities

Databricks recently introduced LakeFlow Connect and acquired Arcion to strengthen its capabilities around Change Data Capture (CDC) and data ingestion into Delta Lake. These are good steps toward improving integration, particularly for analytical use cases.

However, Confluent is the industry leader in operational data integration, serving as modern middleware for connecting mainframes, ERP systems, IoT devices, APIs, and edge environments. Many enterprises have already standardized on Confluent to move and process operational data in real time with high reliability and low latency.

Introducing yet another tool—especially for ETL and ingestion—creates unnecessary complexity. It risks a return to Lambda-style architectures, where separate pipelines must be built and maintained for real-time and batch use cases. This increases engineering overhead, inflates cost, and slows time to market.

In contrast, Confluent supports a Kappa architecture model: a single, unified event-driven data streaming pipeline that powers both operational and analytical workloads. This eliminates duplication, simplifies the data flow, and enables consistent, trusted data delivery from source to sink.

Confluent for Data Ingestion into Databricks

Confluent’s integration capabilities provide:

100+ enterprise-grade connectors, including SAP, Salesforce, and mainframe systems
Native CDC support for Oracle, SQL Server, PostgreSQL, MongoDB, Salesforce, and more
Flexible integration via Kafka Clients for any relevant programming language, REST/HTTP, MQTT, JDBC, and other APIs
Support for operational sinks (not just analytics platforms)
Built-in governance, durability, and replayability

A good example: Confluent’s Oracle CDC Connector uses Oracle’s XStream API and delivers “GoldenGate-level performance”, with guaranteed ordering, high throughput, and minimal latency. This enables real-time delivery of operational data into Kafka, Flink, and downstream systems like Databricks.

Bottom line: Confluent offers the most mature, scalable, and flexible ingestion capabilities into Databricks—especially for real-time operational data. For enterprises already using Confluent as the central nervous system of their architecture, adding another ETL layer specifically for the lakehouse integration with weaker coverage and SLAs only slows progress and increases cost.

Stick with a unified approach—fewer moving parts, faster implementation, and end-to-end consistency.

Real-Time vs. Batch: When to Use Each

Batch ETL is well understood. It works fine when data does not need to be processed immediately—e.g., for end-of-day reports, monthly audits, or historical analysis.

Streaming ETL is best when data must be processed in motion. This enables real-time dashboards, live alerts, or AI features based on the latest information.

Confluent DSP is purpose-built for streaming ETL. Kafka and Flink allow filtering, transformation, enrichment, and routing in real time.

Databricks supports batch ELT natively. Delta Live Tables offers a managed way to build data pipelines on top of Spark. Delta Live Tables lets you declaratively define how data should be transformed and processed using SQL or Python. On the other side, Spark Structured Streaming can handle streaming data in near real-time. But it still requires persistent clusters and infrastructure management.

If you’re already invested in Spark, Structured Streaming or Delta Live Tables might be sufficient. But if you’re starting fresh—or looking to simplify your architecture — Confluent’s Tableflow provides a more streamlined, Kafka-native alternative. Tableflow represents Kafka streams as Delta Lake tables. No cluster management. No offset handling. Just discoverable, governed data in Databricks Unity Catalog.

Real-Time and Batch: A Perfect Match at Walmart for Replenishment Forecasting in the Supply Chain

Walmart demonstrates how real-time and batch processing can work together to optimize a large-scale, high-stakes supply chain.

At the heart of this architecture is Apache Kafka, powering Walmart’s real-time inventory management and replenishment system.

Kafka serves as the central data hub, continuously streaming inventory updates, sales transactions, and supply chain events across Walmart’s physical stores and digital channels. This enables real-time replenishment to ensure product availability and timely fulfillment for millions of online and in-store customers.

Batch processing plays an equally important role. Apache Spark processes historical sales, seasonality trends, and external factors in micro-batches to feed forecasting models. These forecasts are used to generate accurate daily order plans across Walmart’s vast store network.

Source: Walmart

This hybrid architecture brings significant operational and business value:

Kafka provides not just low latency, but true decoupling between systems, enabling seamless integration across real-time streams, batch pipelines, and request-response APIs—ensuring consistent, reliable data flow across all environments
Spark delivers scalable, high-performance analytics to refine predictions and improve long-term planning
The result: reduced cycle times, better accuracy, increased scalability and elasticity, improved resiliency, and substantial cost savings

Walmart’s supply chain is just one of many use cases where Kafka powers real-time business processes, decisioning and workflow orchestration at global scale—proof that combining streaming and batch is key to modern data infrastructure.

Batch as a Subset of Streaming in Apache Flink

Apache Flink supports both streaming and batch processing within the same engine. This enables teams to build unified pipelines that handle real-time events and batch-style computations without switching tools or architectures. In Flink, batch is treated as a special case of streaming—where a bounded stream (or a complete window of events) can be processed once all data has arrived.

This approach simplifies operations by avoiding the need for parallel pipelines or separate orchestration layers. It aligns with the principles of the shift-left architecture, allowing earlier processing, validation, and enrichment—closer to the data source. As a result, pipelines are more maintainable, scalable, and responsive.

That said, batch processing is not going away—nor should it. For many use cases, batch remains the most practical solution. Examples include:

Daily financial reconciliations
End-of-day retail reporting
Weekly churn model training
Monthly compliance and audit jobs

In these cases, latency is not critical, and workloads often involve large volumes of historical data or complex joins across datasets.

This is where Databricks excels—especially with its Delta Lake and Medallion architecture, which structures raw, refined, and curated data layers for high-performance analytics, BI, and AI/ML training.

In summary, Flink offers the flexibility to consolidate streaming and batch pipelines, making it ideal for unified data processing. But when batch is the right choice—especially at scale or with complex transformations—Databricks remains a best-in-class platform. The two technologies are not mutually exclusive. They are complementary parts of a modern data stack.

Streaming CDC and Lakehouse Analytics

Streaming CDC is a key integration pattern. It captures changes from operational databases and pushes them into analytics platforms. But CDC isn’t limited to databases. CDC is just as important for business applications like Salesforce, where capturing customer updates in real time enables faster, more responsive analytics and downstream actions.

Confluent is well suited for this. Kafka Connect and Flink can continuously stream changes. These change events are sent to Databricks as Delta tables using Tableflow. Streaming CDC ensures:

Data consistency across operational and analytical workloads leveraging a single data pipeline
Reduced ETL / ELT lag
Near real-time updates to BI dashboards
Timely training of AI/ML models

Streaming CDC also avoids data duplication, reduces latency, and minimizes storage costs.

Reverse ETL: An (Anti) Pattern to Avoid with Confluent and Databricks

Some architectures push data from data lakes or warehouses back into operational systems using reverse ETL. While this may appear to bridge the analytical and operational worlds, it often leads to increased latency, duplicate logic, and fragile point-to-point workflows. These tools typically reprocess data that was already transformed once, leading to inefficiencies, governance issues, and unclear data lineage.

Reverse ETL is an architectural anti-pattern. It violates the principles of an event-driven system. Rather than reacting to events as they happen, reverse ETL introduces delays and additional moving parts—pushing stale insights back into systems that expect real-time updates.

With the upcoming bidirectional integration of Tableflow with Delta Lake, these issues can be avoided entirely. Insights generated in Databricks—from analytics, machine learning, or rule-based engines—can be pushed directly back into Kafka topics.

This approach removes the need for reverse ETL tools, reduces system complexity, and ensures that both operational and analytical layers operate on a shared, governed, and timely data foundation.

It also brings lineage, schema enforcement, and observability into both directions of data flow—streamlining feedback loops and enabling true event-driven decisioning across the enterprise.

In short: Don’t pull data back into operational systems after the fact. Push insights forward at the speed of events.

Multi-Cloud and Hybrid Integration with an Event-Driven Architecture

Confluent is designed for distributed data movement across environments in real-time for operational and analytical use cases:

On-prem, cloud, and edge
Multi-region and multi-cloud
Support for SaaS, BYOC, and private networking

Features like Cluster Linking and Schema Registry ensure consistent replication and governance across environments.

Databricks runs only in the cloud. It supports hybrid access and partner integrations. But the platform is not built for event-driven data distribution across hybrid environments.

In a hybrid architecture, Confluent acts as the bridge. It moves operational data securely and reliably. Then, Databricks can consume it for analytics and AI use cases. Here is an example architecture for industrial IoT use cases:

Uniper: Real-Time Energy Trading with Confluent and Databricks

Uniper, a leading international energy company, leverages Confluent and Databricks to modernize its energy trading operations.

I covered the value of data streaming with Apache Kafka and Flink for energy trading in a dedicated blog post already.

Confluent Cloud with Apache Kafka and Apache Flink provides a scalable real-time data streaming foundation for Uniper, enabling efficient ingestion and processing of market data, IoT sensor inputs, and operational events. This setup supports the full trading lifecycle, improving decision-making, risk management, and operational agility.

Within its Azure environment, Uniper uses Databricks to empower business users to rapidly build trading decision-support tools and advanced analytics applications. By combining a self-service data platform with scalable processing power, Uniper significantly reduces the lead time for developing data apps—from weeks to just minutes.

To deliver real-time insights to its teams, Uniper also leverages Plotly’s Dash Enterprise, creating interactive dashboards that consolidate live data from Databricks, Kafka, Snowflake, and various databases. This end-to-end integration enables dynamic, collaborative workflows, giving analysts and traders fast, actionable insights that drive smarter, faster trading strategies.

By combining real-time data streaming, advanced analytics, and intuitive visualization, Uniper has built a resilient, flexible data architecture that meets the demands of today’s fast-moving energy markets.

From Ingestion to Insight: Modern Data Integration and Processing for AI with Confluent and Databricks

While both platforms can handle integration and processing, their roles are different:

Use Confluent when you need real-time ingestion and processing of operational and analytical workloads, or data delivery across systems and clouds.
Use Databricks for AI workloads, analytics and data warehousing.

When used together, Confluent and Databricks form a complete data integration and processing pipeline for AI and analytics:

Confluent ingests and processes operational data in real time.
Tableflow pushes this data into Delta Lake in a discoverable, secure format.
Databricks performs analytics and model development.
Tableflow (bidirectional) pushes insights or AI models back into Kafka for use in operational systems.

This is the foundation for modern data and AI architectures—real-time pipelines feeding intelligent applications.

Stay tuned for deep dives into how these platforms are shaping the future of data-driven enterprises. Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including technical architectures and the relation to analytical platforms like Databricks.

The post Confluent Data Streaming Platform vs. Databricks Data Intelligence Platform for Data Integration and Processing appeared first on Kai Waehner.

Apache Flink: Overkill for Simple, Stateless Stream Processing and ETL?

Kai Waehner — Tue, 14 Jan 2025 07:48:04 +0000

When discussing stream processing engines, Apache Flink often takes center stage for its advanced capabilities in stateful stream processing and real-time data analytics. However, a common question arises: is Flink too heavyweight for simple, stateless stream processing and ETL tasks? The short answer for open-source Flink is often yes. But the story evolves significantly when looking at SaaS Flink products such as Confluent Cloud’s Flink offering, with its serverless architecture, multi-tenancy, consumption-based pricing, and no-code/low-code capabilities like Flink Actions. This post explores the considerations and trade-offs to help you decide when Flink is the right tool for your data streaming needs, and when Kafka Streams or Single Message Transform (SMT) within Kafka Connect are the better choice.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch.

The Nature of Stateless Stream Processing

Stateless stream processing, as the name implies, processes each event independently, with no reliance on prior events or context. This simplicity lends itself to use cases such as filtering, transformations, and simple ETL operations. Stateless tasks are:

Efficient: They don’t require state management, reducing overhead.
Scalable: Easily parallelized since there is no dependency between events.
Minimalistic: Often achievable with simpler, lightweight frameworks like Kafka Streams or Kafka Connect’s Single Message Transforms (SMT).

For example, filtering transactions above a certain amount or transforming event formats for downstream systems are classic stateless tasks that demand minimal computational complexity.

In these scenarios, deploying a robust and feature-rich framework like open-source Apache Flink might seem excessive. Flink’s rich API and state management features are unnecessary for such straightforward use cases. Instead, tools with smaller footprints, and simpler deployment models, such as Kafka Streams, often suffice.

When Flink Feels Like Overkill

Apache Flink is a powerhouse. It’s designed for advanced analytics, stateful processing, and complex event patterns. But this sophistication of the open source framework comes with complexity:

Operational Overhead: Setting up and maintaining Flink in an open-source environment can require significant infrastructure and expertise.
Resource Intensity: Flink’s distributed architecture and stateful processing capabilities are resource-hungry, often overkill for tasks that don’t require stateful processing.
Complexity in Development: The Flink API is robust but comes with a steeper learning curve. The combination with Kafka (or another streaming engine) requires understanding two frameworks. In contrast, Kafka Streams is Kafka-native, offering a single, unified framework for stream processing, which can reduce complexity for basic tasks.

For organizations that need to perform straightforward stateless operations, investing in the full Flink stack can feel like using a sledgehammer to crack a nut. Having said this, FlinkSQL simplifies development for certain personas, providing a more accessible interface beyond just Java and Python.

The Cloud Advantage: Serverless Flink for Everyone

The conversation shifts dramatically with Serverless Flink Cloud offerings, such as Confluent Cloud, which address many of the challenges associated with running open-source Flink. Let’s unpack how Serverless Flink makes a more attractive choice, even for simpler use cases.

1. Serverless Architecture

With a Serverless stream processing service, Flink operates on a fully serverless model, eliminating the need for heavy infrastructure management. This means:

No Setup Hassles: Developers focus purely on application logic, not cluster provisioning or tuning.
Elastic Scaling: Resources automatically scale with the workload, ensuring efficient handling of varying traffic patterns without manual intervention. One of the biggest challenges of self-managing Flink is over-provisioning resources to handle anticipated peak loads. Elastic Scaling mitigates this inefficiency.

2. Multi-Tenancy

Multi-tenant design allows multiple applications, teams or organizations to share the same infrastructure securely. This reduces operational costs and complexity compared to managing isolated clusters for each workload.

3. Consumption-Based Pricing

One of the key barriers to adopting Flink for simple tasks is cost. A truly Serverless Flink offering mitigates this with a pay-as-you-go pricing model:

You only pay for the resources you use, making it cost-effective for both lightweight and high-throughput workloads.
It aligns with the scalability of stateless stream processing, where workloads may spike temporarily and then taper off.

4. Bridging the Gap with No-Code and Low-Code Solutions

The rise of citizen integrators and the demand for low-code/no-code solutions have reshaped how organizations approach data streaming. Less-technical users, such as business analysts or operational teams, often face challenges when trying to engage with technical platforms designed for developers.

Low-code/no-code tools address this by providing intuitive interfaces that allow users to build, deploy, and monitor pipelines without deep programming knowledge. These solutions empower business users to take charge of simple workflows and integrations, significantly reducing time-to-value while minimizing the reliance on technical teams.

For example, capabilities like Flink Actions in Confluent Cloud offer a user-friendly approach to deploying stream processing pipelines without coding. By simplifying the process and making it accessible to non-technical stakeholders, these tools enhance collaboration and ensure faster outcomes without compromising performance or scalability. For instance, you can do ETL functions such as transformation, deduplication or masking field:

Source: Confluent

Fully Managed (SaaS) vs. Partially Managed (PaaS) Cloud Products

When choosing between SaaS and PaaS for data streaming, it’s essential to understand the key differences.

SaaS solutions, like Confluent Cloud, offer a fully managed, serverless experience with automatic scaling, low operational overhead, and pay-as-you-go pricing.

In contrast, PaaS requires users to manage infrastructure, configure scaling policies, and handle more operational complexity.

While many products are marketed as “serverless,” not all truly abstract infrastructure or eliminate idle costs—so scrutinize claims carefully.

SaaS is ideal for teams focused on rapid deployment and simplicity, while PaaS suits those needing deep customization and control. Ultimately, SaaS ensures scalability and ease of use, making it a compelling choice for most modern streaming needs. Always dive into the technical details to ensure the platform aligns with your goals. Don’t trust the marketing slogans of the vendors!

Stateless vs. Stateful Stream Processing: Blurring the Lines

Even if your current use case is stateless, it’s worth considering the potential for future needs. Stateless pipelines often evolve into more complex systems as businesses grow, requiring features like:

State Management: For event correlation and pattern detection.
Windows and Aggregations: To derive insights from time-series data.
Joins: To enrich data streams with contextual information.
Integrating Multiple Data Sources: To seamlessly combine information from various streams for a comprehensive and cohesive analysis.
AI/ML Integration: Incorporating machine learning models for real-time inference, enabling intelligent decision-making directly within data streams.

With a SaaS Flink service such as Confluent Cloud, you can start small with stateless tasks and seamlessly scale into stateful operations as needed, leveraging Flink’s full capabilities without a complete overhaul.

When Does Apache Flink Shine?

While Flink may feel like overkill for simple, stateless tasks in its open-source form, its potential is unmatched in these scenarios:

Enterprise Workloads: Scalable, reliable, and fault-tolerant systems for mission-critical applications.
Data Integration and Preparation (Streaming ETL): Flink enables preprocessing, cleansing, and enriching data at the streaming layer, ensuring high-quality data reaches downstream systems like data lakes and warehouses.
Complex Event Processing (CEP): Detecting patterns across events in real time.
Advanced Analytics: Stateful stream processing for aggregations, joins, and windowed computations.
AI/ML Integration: Incorporating machine learning models for real-time inference, enabling intelligent decision-making directly within data streams.

Apache Flink for Stateless Stream Processing instead of Kafka Streams or Kafka Connect’s Single Message Transform (SMT)

Stateless stream processing is often achieved using lightweight tools like Kafka Streams or Single Message Transforms (SMTs) within Kafka Connect. SMTs enable inline transformations, such as normalization, enrichment, or filtering, as events pass through the integration framework. This functionality is available in Kafka Connect (provided by Confluent, IBM/Red Hat, Amazon MSK and others) and tools like Benthos for Redpanda. SMTs are particularly useful for quick adjustments and filtering data before it reaches the Kafka cluster, optimizing resource usage and data flow.

While Kafka Streams and Kafka Connect’s SMTs handle many stateless workloads effectively, Apache Flink offers significant advantages for all types of workloads—whether simple or complex, stateless or stateful.

Stream processing in Flink enables true decoupling within the enterprise architecture (as it is not bound to the Kafka cluster like Kafka Streams and Kafka Connect). The benefits are separation of concerns with a domain-driven design (DDD), and improved data governance. And Flink provides interfaces for Java, Python and SQL. Something for (almost) everyone. This makes ideal Flink for ensuring clean, modular architectures and easier scalability.

By processing events from diverse sources and preparing them for downstream consumption, Flink supports both lightweight and comprehensive workflows while aligning with domain boundaries and governance requirements. This brings us to the shift left architecture.

The Shift Left Architecture

No matter what specific use cases you have in mind: The Shift Left Architecture brings data processing upstream with real-time stream processing, transforming raw data into high-quality data products early in the pipeline.

Apache Flink plays a key role as part of a complete data streaming platform by enabling advanced streaming ETL, data curation, and on-the-fly transformations, ensuring consistent, reliable, and ready-to-use data for both operational and analytical workloads, while reducing costs and accelerating time-to-market.

Making the Right Choice for Apache Flink or Not

The decision to use Flink boils down to your use case, expertise, and growth trajectory:

For basic stateless tasks, consider lightweight options like Kafka Streams or SMTs within Kafka Connect unless you’re already invested in a SaaS such as Confluent Cloud where Flink is also the appropriate choice for simple ETL processes.
For evolving workloads or scenarios requiring scalability and advanced analytics, a Flink SaaS offers unparalleled flexibility and ease of use.
For on-premise or edge deployments, Flink’s flexibility makes it an excellent choice for environments where data processing must occur locally due to latency, security, or compliance requirements.

Understanding the deployment environment—cloud, on-premise, or edge— and the capabilities of the Flink product is crucial to choosing the right streaming technology. Flink’s adaptability ensures it can serve diverse needs across these contexts.

Kafka Streams is another excellent, Kafka-native stream processing alternative. Most importantly for this discussion, Kafka Streams is “just” a lightweight Java library, not a server infrastructure like Flink. Hence, it brings different trade-offs with it. I wrote a dedicated article about the trade-offs between Apache Flink and Kafka Streams for stream processing.

Apache Flink’s Role in Your Data Streaming Strategy

In its open-source form, Flink can seem excessive for simple, stateless tasks. However, a serverless Flink SaaS like Confluent Cloud changes the equation. Multi-tenancy and pay-as-you-go pricing make it suitable for a wider range of use cases, from basic ETL to advanced analytics. Serverless features like Confluent’s Flink Actions further reduce complexity, allowing non-technical users to harness the power of stream processing without coding.

Whether you’re just beginning your journey into stream processing or scaling up for enterprise-grade applications, Flink—as part of a complete data streaming platform such as Confluent Cloud—is a future-proof investment that adapts to your needs.

The Data Streaming Landscape 2025 highlights how data streaming has evolved into a key software category, moving from niche adoption to a fundamental part of modern data architecture.

With frameworks like Apache Kafka and Flink at its core, the landscape now spans self-managed, BYOC, and fully managed SaaS solutions, driving real-time use cases, unifying transactional and analytical workloads, and enabling innovation across industries.

Stay ahead of the curve! Subscribe to my newsletter for insights into data streaming and connect with me on LinkedIn to continue the conversation.

The post Apache Flink: Overkill for Simple, Stateless Stream Processing and ETL? appeared first on Kai Waehner.

Stateless vs. Stateful Stream Processing with Kafka Streams and Apache Flink

Kai Waehner — Fri, 27 Dec 2024 08:48:54 +0000

In the world of data-driven applications, the rise of stream processing has changed how we handle and act on data. While traditional databases, data lakes, and warehouses are effective for many batch-based use cases, they fall short in scenarios demanding low latency, scalability, and real-time decision-making. This post explores the key concepts of stateless and stateful stream processing, using Kafka Streams and Apache Flink as examples. These principles apply to any stream processing engine, whether open-source or a cloud service. Let’s break down the differences, practical use cases, the relation to AI/ML, and the immense value stream processing offers compared to traditional data-at-rest methods.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch.

Rethinking Data Processing: From Static to Dynamic

In traditional systems, data is typically stored first in a database or data lake and queried later for computation. This works well for batch processing tasks, like generating reports or dashboards. The process usually looks something like this:

Store Data: Data arrives and is stored in a database or data lake.
Query & Compute: Applications request data for analysis or processing at a later time with a web service, request-response API or SQL script.

However, this approach fails when you need:

Immediate Action: Real-time responses to events, such as fraud detection.
Scalability: Handling thousands or millions of events per second.
Continuous Insights: Ongoing analysis of data in motion.

Enter stream processing—a paradigm where data is continuously processed as it flows through the system. Instead of waiting to store data first, stream processing engines like Kafka Streams and Apache Flink enable you to act on data instantly as it arrives.

Use Case: Fraud Prevention in Real-Time

The blog post uses a fraud prevention scenario to illustrate the power of stream processing. In this example, transactions from various sources (e.g., credit card payments, mobile app purchases) are monitored in real time.

The system flags suspicious activities using three methods:

Stateless Processing: Each transaction is evaluated independently, and high-value payments are flagged immediately.
Stateful Processing: Transactions are analyzed over a time window (e.g., 1 hour) to detect patterns, such as an unusually high number of transactions.
AI Integration: A pre-trained machine learning model is used for real-time fraud detection by predicting the likelihood of fraudulent activity.

This example highlights how stream processing enables instant, scalable, and intelligent fraud detection, something not achievable with traditional batch processing.

To avoid confusion: while I use Kafka Streams for stateless and Apache Flink for stateful in the example, both frameworks are capable of handling both types of processing.

Other Industry Examples of Stream Processing

Predictive Maintenance (Industrial IoT): Continuously monitor sensor data to predict equipment failures and schedule proactive maintenance.
Real-Time Advertisement (Retail): Deliver personalized ads based on real-time user interactions and behavior patterns.
Real-Time Portfolio Monitoring (Finance): Continuously analyze market data and portfolio performance to trigger instant alerts or automated trades during market fluctuations.
Supply Chain Optimization (Logistics): Track shipments in real time to optimize routing, reduce delays, and improve efficiency.
Condition Monitoring (Healthcare): Analyze patient vitals continuously to detect anomalies and trigger immediate alerts.
Network Monitoring (Telecom): Detect outages or performance issues in real time to improve service reliability.

These examples highlight how stream processing drives real-time insights and actions across diverse industries.

What is Stateless Stream Processing?

Stateless stream processing focuses on processing each event independently. In this approach, the system does not need to maintain any context or memory of previous events. Each incoming event is handled in isolation, meaning the logic applied depends solely on the data within that specific event.

This makes stateless processing highly efficient and easy to scale, as it doesn’t require state management or coordination between events. It is ideal for use cases such as filtering, transformations, and simple ETL operations where individual events can be processed with no need of historical data or context.

1. Example: Real-Time Payment Monitoring

Imagine a fraud prevention system that monitors transactions in real time to detect and prevent suspicious activities. Each transaction, whether from a credit card, mobile app, or payment gateway, is evaluated as it occurs. The system checks for anomalies such as unusually high amounts, transactions from unfamiliar locations, or rapid sequences of purchases.

By analyzing these attributes instantly, the system can flag high-risk transactions for further inspection or automatically block them. This real-time evaluation ensures potential fraud is caught immediately, reducing the likelihood of financial loss and enhancing overall security.

You want to flag high-value payments for further inspection. In the following Kafka Streams example:

Each transaction is evaluated as it arrives.
If the transaction amount exceeds 100 (in your chosen currency), it’s sent to a separate topic for further review.

Java Example (Kafka Streams):

KStream payments = builder.stream(“payments”);

payments.filter((key, payment) -> payment.getAmount() > 100)
.to(“high-risk-payments”);

Benefits of Stateless Processing

Low Latency: Immediate processing of individual events.
Simplicity: No need to track or manage past events.
Scalability: Handles large volumes of data efficiently.

This approach is ideal for use cases like filtering, data enrichment, and simple ETL tasks.

What is Stateful Stream Processing?

Stateful stream processing takes it a step further by considering multiple events together. The system maintains state across events, allowing for complex operations like aggregations, joins, and windowed analyses. This means the system can correlate data over a defined period, track patterns, and detect anomalies that emerge across multiple transactions or data points.

2. Example: Fraud Prevention through Continuous Pattern Detection

In fraud prevention, individual transactions may appear normal, but patterns over time can reveal suspicious behavior.

For example, a fraud prevention system might identify suspicious behavior by analyzing all transactions from a specific credit card within a one-hour window, rather than evaluating each transaction in isolation.

Let’s detect anomalies by analyzing transactions with Apache Flink using Flink SQL. In this example:

The system monitors transactions for each credit card within a 1-hour window.
If a card is used over 10 times in an hour, it flags potential fraud.

SQL Example (Apache Flink):

SELECT card_number, COUNT(*) AS transaction_count
FROM payments
GROUP BY TUMBLE(transaction_time, INTERVAL ‘1’ HOUR), card_number
HAVING transaction_count > 10;

Key Concepts in Stateful Processing

Stateful processing relies on maintaining context across multiple events, enabling the system to perform more sophisticated analyses. Here are the key concepts that make stateful stream processing possible:

Windows: Define a time range to group events (e.g., sliding windows, tumbling windows).
State Management: The system remembers past events within the defined window.
Joins: Combine data from multiple sources for enriched analysis.

Benefits of Stateful Processing

Stateful processing is essential for advanced use cases like anomaly detection, real-time monitoring, and predictive analytics:

Complex Analysis: Detect patterns over time.
Event Correlation: Combine events from different sources.
Real-Time Decision-Making: Continuous monitoring without reprocessing data.

Bringing AI and Machine Learning into Stream Processing

Stream processing engines like Kafka Streams and Apache Flink also enable real-time AI and machine learning model inference. This allows you to integrate pre-trained models directly into your data processing pipelines.

3. Example: Real-Time Fraud Detection with AI/ML Models

Consider a payment fraud detection system that uses a TensorFlow model for real-time inference. In this system, transactions from various sources — such as credit cards, mobile apps, and payment gateways — are streamed continuously. Each incoming transaction is preprocessed and sent to the TensorFlow model, which evaluates it based on patterns learned during training.

The model analyzes features like transaction amount, location, device ID, and frequency to predict the likelihood of fraud. If the model identifies a high probability of fraud, the system can trigger immediate actions, such as flagging the transaction, blocking it, or alerting security teams. This real-time inference ensures that potential fraud is detected and addressed instantly, reducing risk and enhancing security.

Here is a code example using Apache Flink’s Python API for predictive AI:

Python Example (Apache Flink):

def predict_fraud(payment):
prediction = model.predict(payment.features)
return prediction > 0.5

stream = payments.map(predict_fraud)

Why Combine AI with Stream Processing?

Integrating AI with stream processing unlocks powerful capabilities for real-time decision-making, enabling businesses to respond instantly to data as it flows through their systems. Here are some key benefits of combining AI with stream processing:

Real-Time Predictions: Immediate fraud detection and prevention.
Automated Decisions: Integrate AI into critical business processes.
Scalability: Handle millions of predictions per second.

Apache Kafka and Flink deliver low-latency, scalable, and robust predictions. My article “Real-Time Model Inference with Apache Kafka and Flink for Predictive AI and GenAI” compares remote inference (via APIs) and embedded inference (within the stream processing application).

For large AI models (e.g., generative AI or large language models), inference is often done via remote calls to avoid embedding large models within the stream processor.

Stateless vs. Stateful Stream Processing: When to Use Each

Choosing between stateless and stateful stream processing depends on the complexity of your use case and whether you need to maintain context across multiple events. The following table outlines the key differences to help you determine the best approach for your specific needs.

Feature	Stateless	Stateful
Use Case	Simple Filtering, ETL	Aggregations, Joins
Latency	Very Low Latency	Slightly Higher Latency due o State Management
Complexity	Simple Logic	Complex Logic Involving Multiple Events
State Management	Not Required	Required for Context-aware Processing
Scalability	High	Depends on the Framework

Read my article “Apache Kafka (including Kafka Streams) + Apache Flink = Match Made in Heaven” to learn more about choosing the right stream processing engine for your use case.

And to clarify again: while this article uses Kafka Streams for stateless and Flink for stateful stream processing, both frameworks are capable of handling both types.

Video Recording

Below, I summarize this content as a ten-minute video on my YouTube channel:

Why Stream Processing is a Fundamental Change

Whether stateless or stateful, stream processing with Kafka Streams, Apache Flink, and similar technologies unlocks real-time capabilities that traditional databases simply cannot offer. From simple ETL tasks to complex fraud detection and AI integration, stream processing empowers organizations to build scalable, low-latency applications.

Investing in stream processing means:

Faster Innovation: Real-time insights drive competitive advantage.
Operational Efficiency: Automate decisions and reduce latency.
Scalability: Handle millions of events seamlessly.

Stream processing isn’t just an evolution of data handling—it’s a revolution. If you’re not leveraging it yet, now is the time to explore this powerful paradigm. If you want to learn more, check out my light board video exploring the core value of Apache Flink:

Stay ahead of the curve! Subscribe to my newsletter for insights into data streaming and connect with me on LinkedIn to continue the conversation.

The post Stateless vs. Stateful Stream Processing with Kafka Streams and Apache Flink appeared first on Kai Waehner.

Top Trends for Data Streaming with Apache Kafka and Flink in 2025

Kai Waehner — Mon, 02 Dec 2024 14:02:07 +0000

The evolution of data streaming has transformed modern business infrastructure, establishing real-time data processing as a critical asset across industries. At the forefront of this transformation, Apache Kafka and Apache Flink stand out as leading open-source frameworks that serve as the foundation for cloud services, enabling organizations to unlock the potential of real-time data. Over recent years, trends have shifted from batch-based data processing to real-time analytics, scalable cloud-native architectures, and improved data governance powered by these technologies. Looking ahead to 2025, the data streaming ecosystem is set to undergo even greater changes. Here are the top trends shaping the future of data streaming for businesses.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch.

The Top Data Streaming Trends

Some followers might notice that this became a series with articles about the top 5 data streaming trends for 2021, the top 5 for 2022, the top 5 for 2023, and the top 5 for 2024. Trends change over time, but the huge value of having a scalable real-time infrastructure as the central data hub stays. Data streaming with Apache Kafka is a journey and evolution to set data in motion.

I recently explored the past, present, and future of data streaming tools and strategies from the past decades. Data streaming is becoming more and more mature and standardized, but also innovative.

Let’s now look at the top trends coming up more regularly in conversations with customers, prospects, and the broader data streaming community across the globe:

The Democratization of Kafka: Apache Kafka has transitioned from a specialized tool to a key pillar in modern data infrastructure.
Kafka Protocol as the Standard: Vendors adopt the Kafka wire protocol, enabling flexibility with compatibility and performance trade-offs.
BYOC Deployment Model: Bring Your Own Cloud gains traction for balancing security, compliance, and managed services.
Flink Becomes the Standard for Stream Processing: Apache Flink rises as the premier framework for stream processing, building integration pipelines and business applications.
Data Streaming for Real-Time Predictive AI and GenAI: Real-time model inference drives predictive and generative AI applications.
Data Streaming Organizations: Companies unify real-time data strategies to standardize processes, tools, governance, and collaboration.

The following sections describe each trend in more detail. The trends are relevant for many scenarios; no matter if you use open-source frameworks like Apache Kafka and Flink, a commercial platform, or a fully managed cloud service like Confluent Cloud.

Trend 1: The Democratization of Kafka

In the last decade, Apache Kafka has become the standard for data streaming, evolving from a specialized tool to an essential utility in the modern tech stack. With over 150,000 organizations using Kafka today, it has become the de facto choice for stream processing. Yet, with a market crowded by offerings from AWS, Microsoft Azure, Google GCP, IBM, Oracle, Confluent, and various startups, companies can no longer rely solely on Kafka for differentiation. The vast array of Kafka-compatible solutions means that businesses face more choices than ever, but also new challenges in selecting the solution that balances cost, performance, and features.

The Challenge: Finding the Right Fit in a Crowded Kafka Market

For end users, choosing the right Kafka solution is becoming increasingly complex. Basic Kafka offerings cover standard streaming needs but may lack advanced features, such as enhanced security, data governance, or integration and processing capabilities, that are essential for specific industries. In such a diverse market, businesses must navigate trade-offs, considering whether a low-cost option meets their needs or whether investing in a premium solution with added capabilities provides better long-term value.

The Solution: Prioritizing Features for Your Strategic Needs

As Kafka solutions evolve, users must look beyond price and consider features that offer real strategic value. For example, companies handling sensitive customer data might benefit from Kafka products with top-tier security features. Those focused on analytics may look for solutions with strong integrations into data platforms and low cost for high throughput. By carefully selecting a Kafka product that aligns with industry-specific requirements, businesses can leverage the full potential of Kafka while optimizing for cost and capabilities.

For instance, look at Confluent’s various cluster types for different requirements and use cases in the cloud:

Source: Confluent

As an example, Freight Clusters was introduced to provide an offering with up to 90 percent less cost. The major trade-off is higher latency. But this is perfect for high volume log analytics at GB/sec scale.

The Business Value: Affordable and Customized Data Streaming

Kafka’s commoditization means more affordable, customizable options for businesses of all sizes. This competition reduces costs, making high-performance data streaming more accessible, even to smaller organizations. By choosing a tailored solution, businesses can enhance customer satisfaction, speed up decision-making, and innovate faster in a competitive landscape.

Trend 2: The Kafka Protocol, not Apache Kafka, is the New Standard for Data Streaming

With the rise of cloud-native architectures, many vendors have shifted to supporting the Kafka protocol rather than the open-source Kafka framework itself, allowing for greater flexibility and cloud optimization. This change enables businesses to choose Kafka-compatible tools that better align with specific needs, moving away from a one-size-fits-all approach.

Confluent introduced its KORA engine, i.e., Kafka re-architected to be cloud-native. A deep technical whitepaper goes into the details (this is not a marketing document but really for software engineers).

Source: Confluent

Other players followed Confluent and introduced their own cloud-native “data streaming engines”. For instance, StreamNative has URSA powered by Apache Pulsar, Redpanda talks about its R1 Engine implementing the Kafka protocol, and Ververica recently announced VERA for its Flink-based platform.

Some vendors rely only on the Kafka protocol with a proprietary engine from the beginning. For instance, Azure Event Hubs or WarpStream. Amazon MSK also goes in this direction by adding proprietary features like Tiered Storage or even introducing completely new product options such as Amazon MSK Express brokers.

The Challenge: Limited Compatibility Across Kafka Solutions

When vendors implement the Kafka protocol instead of the entire Kafka framework, it can lead to compatibility issues, especially if the solution doesn’t fully support Kafka APIs. For end users, this can complicate integration, particularly for advanced features like Exactly-Once Semantics, the Transaction API, Compacted Topics, Kafka Connect, or Kafka Streams, which may not be supported or working as expected.

The Solution: Evaluating Kafka Protocol Solutions Critically

To fully leverage the flexibility of Kafka protocol-based solutions, a thorough evaluation is essential. Businesses should carefully assess the capabilities and compatibility of each option, ensuring it meets their specific needs. Key considerations include verifying the support of required features and APIs (such as the Transaction API, Kafka Streams, or Connect).

It is also crucial to evaluate the level of product support provided, including 24/7 availability, uptime SLAs, and compatibility with the latest versions of open-source Apache Kafka. This detailed evaluation ensures that the chosen solution integrates seamlessly into existing architectures and delivers the reliability and performance required for modern data streaming applications.

The Business Value: Expanded Options and Cost-Efficiency

Kafka protocol-based solutions offer greater flexibility, allowing businesses to select Kafka-compatible services optimized for their specific environments. This flexibility opens doors for innovation, enabling companies to experiment with new tools without vendor lock-in.

For instance, innovations such as a “direct write to S3 object store” architecture, as seen in WarpStream, Confluent Freight Clusters, and other data streaming startups that also build proprietary engines around the Kafka protocol. The result is a more cost-effective approach to data streaming, though it may come with trade-offs, such as increased latency. Check out this video about the evolution of Kafka Storage to learn more.

Trend 3: BYOC (Bring Your Own Cloud) as a New Deployment Model for Security and Compliance

As data security and compliance concerns grow, the Bring Your Own Cloud (BYOC) model is gaining traction as a new way to deploy Apache Kafka. BYOC allows businesses to host Kafka in their own Virtual Private Cloud (VPC) while the vendor manages the control plane to handle complex orchestration tasks like partitioning, replication, and failover.

This BYOC approach offers organizations enhanced control over their data while retaining the operational benefits of a managed service. BYOC provides a middle ground between self-managed and fully managed solutions, addressing specific regulatory and security needs without sacrificing scalability or flexibility.

Source: Confluent

The Challenge: Balancing Security and Ease of Use

Ensuring data sovereignty and compliance is non-negotiable for organizations in highly regulated industries. However, traditional fully managed cloud solutions can pose risks due to vendor access to sensitive data and infrastructure. Many BYOC solutions claim to address these issues but fall short when it comes to minimizing external access to customer environments. Common challenges include:

Vendor Access to VPCs: Many BYOC offerings require vendors to have access to customer VPCs for deployment, cluster management, and troubleshooting. This introduces potential security vulnerabilities.
IAM Roles and Elevated Privileges: Cross-account Identity and Access Management (IAM) roles are often necessary for managing BYOC clusters, which can expose sensitive systems to unnecessary risks.
VPC Peering Complexity: Traditional BYOC solutions often rely on VPC peering, a complex and expensive setup that increases operational overhead and opens additional points of failure.

These limitations create significant challenges for security-conscious organizations, as they undermine the core promise of BYOC: control over the data environment.

The Solution: Gaining Control with a “Zero Access” BYOC Model

WarpStream redefines the BYOC model with a “zero access” architecture, addressing the challenges of traditional BYOC solutions. Unlike other BYOC offerings using the Kafka protocol, WarpStream ensures that no data leaves the customer’s environment, delivering a truly secure-by-default platform. Hence this section discusses specifically WarpStream, not BYOC Kafka offerings in general.

Source: WarpStream

Key features of WarpStream include:

Zero Access to Customer VPCs: WarpStream eliminates vendor access by deploying stateless agents within the customer’s environment, handling compute operations locally without requiring cross-account IAM roles or elevated privileges to reduce security risks.
Data/Metadata Separation: Raw data remains entirely within the customer’s network for full sovereignty, while only metadata is sent to WarpStream’s control plane for centralized management, ensuring data security and compliance.
Simplified Infrastructure: WarpStream avoids complex setups like VPC peering and cross-IAM roles, minimizing operational overhead while maintaining high performance.

Comparison with Other BYOC Solutions using the Kafka protocol:

Unlike most other BYOC offerings (e.g., Redpanda), WarpStream doesn’t require direct VPC access or elevated permissions, avoiding risks like data exposure or remote troubleshooting vulnerabilities. Its “zero access” architecture ensures unparalleled security and compliance.

The Business Value: Secure, Compliant, and Scalable Data Streaming

WarpStream’s innovative approach to BYOC delivers exceptional business value by addressing security and compliance concerns while maintaining operational simplicity and scalability:

Uncompromised Security: The zero-access architecture ensures that raw data remains entirely within the customer’s environment, meeting the strictest security and compliance requirements for regulated industries like finance, healthcare, and government.
Operational Efficiency: By eliminating the need for VPC peering, cross-IAM roles, and remote vendor access, WarpStream simplifies BYOC deployments and reduces operational complexity.
Cost Optimization: WarpStream’s reliance on cloud-native technologies like object storage reduces infrastructure costs compared to traditional disk-based approaches. Stateless agents also enable efficient scaling without unnecessary overhead.
Data Sovereignty: The data/metadata split guarantees that data never leaves the customer’s environment, ensuring compliance with regulations such as GDPR and HIPAA.
Peace of Mind for Security Teams: With no vendor access to the VPC or object storage, WarpStream’s zero-access model eliminates concerns about external breaches or elevated privileges, making it easier to gain buy-in from security and infrastructure teams.

BYOC Strikes the Balance Between Control and Managed Services

BYOC offers businesses the ability to strike a balance between control and managed services, but not all BYOC solutions are created equal. WarpStream’s “zero access” architecture sets a new standard, addressing the critical challenges of security, compliance, and operational simplicity. By ensuring that raw data never leaves the customer’s environment and eliminating the need for vendor access to VPCs, WarpStream delivers a BYOC model that meets the highest standards of security and performance. For organizations seeking a secure, scalable, and compliant approach to data streaming, WarpStream represents the future of BYOC data streaming.

But just to be clear: If a data streaming project goes to the cloud, fully managed Kafka (and Flink) should always be the first option as it is much easier to manage and operate to focus on fast time-to-market and business innovation. Choose BYOC only if fully managed does not work for you because of security requirements.

Trend 4: Flink Becomes the De Facto Standard for Stream Processing

Apache Flink has emerged as the premier choice for organizations seeking a robust and versatile framework for continuous stream processing. Its ability to handle complex data pipelines with high throughput, low latency, and advanced stateful operations has solidified its position as the de facto standard for stream processing. Flink’s support for Java, Python, and SQL further enhances its appeal, enabling developers to build powerful data-driven applications using familiar tools.

As Flink adoption grows, it increasingly complements Apache Kafka as part of the modern data streaming ecosystem, while the Kafka Streams (Java-only) library remains relevant for lightweight, application-embedded use cases.

The Challenge: Handling Complex, High-Throughput Data Streams

Modern businesses increasingly rely on real-time data for both operational and analytical needs, spanning mission-critical applications like fraud detection, predictive maintenance, and personalized customer experiences, as well as Streaming ETL for integrating and transforming data. These diverse use cases demand robust stream processing capabilities that can address the challenges of:

The Solution: Apache Flink for Scalability and Versatility

Apache Flink’s versatility makes it uniquely positioned to meet the demands of both streaming ETL for data integration and building real-time business applications. Flink provides:

Low Latency: Near-instantaneous processing is crucial for enabling real-time decision-making in business applications, timely updates in analytical systems, and supporting transactional workloads where rapid processing and immediate consistency are essential for ensuring smooth operations and seamless user experiences.
High Throughput and Scalability: The ability to process millions of events per second, whether for aggregating operational metrics or moving massive volumes of data into data lakes or warehouses, without bottlenecks.
Stateful Processing: Support for maintaining and querying the state of data streams, essential for performing complex operations like aggregations, joins, and pattern detection in business applications, as well as data transformations and enrichment in ETL pipelines.
Multiple Programming Languages: Support for Java, Python, and SQL ensures accessibility for a wide range of developers, enabling efficient implementation across various use cases.

The rise of cloud services has further propelled Flink’s adoption, with offerings from major providers like Confluent, Amazon, IBM, and emerging startups. These cloud-native solutions simplify Flink deployments, making it easier for organizations to operationalize real-time analytics.

Apache Spark and Spark Streaming as Alternative for Flink?

While Apache Flink has emerged as the de facto standard for stream processing, other frameworks like Apache Spark and its streaming module, Structured Streaming, continue to compete in this space. However, Spark Streaming has notable limitations that make it less suitable for many of the complex, high-throughput workloads modern enterprises demand.

The Challenges with Spark Streaming

Apache Spark, originally designed as a batch processing framework, introduced Spark Streaming and later Structured Streaming to address real-time processing needs. However, its batch-oriented roots present inherent challenges:

Micro-Batch Architecture: Spark Structured Streaming relies on micro-batches, where data is divided into small time intervals for processing. This approach, while effective for certain workloads, introduces higher latency compared to Flink’s true streaming architecture. Applications requiring millisecond-level processing or transactional workloads may find Spark unsuitable.
Limited Stateful Processing: While Spark supports stateful operations, its reliance on micro-batches adds complexity and latency. This makes Spark Streaming less efficient for use cases that demand continuous state updates, such as fraud detection or complex event processing (CEP).
Fault Tolerance Complexity: Spark’s recovery model is rooted in its lineage-based approach to fault tolerance, which can be less efficient for long-running streaming applications. Flink, by contrast, uses checkpointing and savepoints to handle failures more gracefully to ensure state consistency with minimal overhead.
Performance Overhead: Spark’s general-purpose design often results in higher resource consumption compared to Flink, which is purpose-built for stream processing. This can lead to increased infrastructure costs for high-throughput workloads.
Scalability Challenges for Stateful Workloads: While Spark scales effectively for batch jobs, its scalability for complex stateful stream processing is more limited, as distributed state management in micro-batches can become a bottleneck under heavy load.

By addressing these limitations, Apache Flink provides a more versatile and efficient solution than Apache Spark for organizations looking to handle complex, real-time data processing at scale.

Flink’s architecture is purpose-built for streaming, offering native support for stateful processing, low-latency event handling, and fault-tolerant operation, making it the preferred choice for modern real-time applications. But to be clear: Apache Spark, including Spark Streaming, has its place in data lakes and lakehouses to process analytical workloads.

The Business Value: Real-Time Analytics and Improved Decision-Making with Apache Flink

Flink’s technical capabilities bring tangible business benefits, making it an essential tool for modern enterprises. By providing real-time insights, Flink enables businesses to respond to events as they occur, such as detecting and mitigating fraudulent transactions instantly, reducing losses, and enhancing customer trust.

The support of Flink for both transactional workloads (e.g., fraud detection or payment processing) and analytical workloads (e.g., real-time reporting or trend analysis) ensures versatility across a range of critical business functions. Scalability and resource optimization keep infrastructure costs manageable, even for demanding, high-throughput workloads, while features like checkpointing streamline failure recovery and upgrades, minimizing operational overhead.

Flink stands out with its dual focus on streaming ETL for data integration and building business applications powered by real-time analytics. Its rich APIs for Java, Python, and SQL make it easy for developers to implement complex workflows, accelerating time-to-market for new applications.

Trend 5: AI and Data Streaming: Real-Time Model Inference with Flink

Data streaming has powered AI/ML infrastructure for many years because of its capabilities to scale to high volumes, process data in real-time, and integrate with transactional (payments, orders, ERP, etc.) and analytical (data warehouse, data lake, lakehouse) systems. My first article about Apache Kafka and Machine Learning was published in 2017: “How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka“.

As AI continues to evolve, real-time model inference powered by data streaming is opening up new possibilities for predictive and generative AI applications. By integrating model inference with stream processors such as Apache Flink, businesses can perform on-demand predictions for fraud detection, customer personalization, and more.

The Challenge: Provide Context for AI Applications In Real-Time

Traditional batch-based AI inference is too slow for many applications, delaying responses and leading to missed opportunities or wrong business decisions. To fully harness AI in real-time, businesses need to embed model inference directly within streaming pipelines.

Generative AI (GenAI) demands new design patterns like Retrieval Augmented Generation (RAG) to ensure accuracy, relevance, and reliability in its outputs. Without data streaming, RAG struggles to provide large language models (LLMs) with the real-time, domain-specific context they need, leading to outdated or hallucinated responses. Context is essential to ensure that LLMs deliver accurate and trustworthy outputs by grounding them in up-to-date and precise information.

The Solution: Leveraging Apache Flink for Real-Time Context and Model Inference

Flink’s ability to process data in real-time also enables advanced machine learning workflows, supporting predictive analytics and generative AI use cases that drive innovation.

Apache Flink enables real-time model inference by connecting data streams to external AI models through APIs. This setup allows companies to use centralized model servers for inference, providing flexibility and scalability while keeping data streams fast and responsive. By processing data in real-time, Flink ensures that generative AI models operate with the most current and relevant context, reducing errors and hallucinations.

Flink’s real-time processing capabilities also support advanced machine learning workflows. This enables use cases like predictive analytics, anomaly detection, and generative AI applications that require instantaneous decision-making. The ability to join live data streams with historical or external datasets enriches the context for model inference, enhancing both accuracy and relevance.

Additionally, Flink facilitates feature extraction and data preprocessing directly within the stream to ensure that the inputs to AI models are optimized for performance. This seamless integration with model servers and vector databases allows organizations to scale their AI systems effectively, leveraging real-time insights to drive innovation and deliver immediate business value.

The Business Value: Immediate, Actionable AI Insights

Real-time AI model inference with Flink enables businesses to provide personalized customer experiences, detect fraud as it happens, and perform predictive maintenance with minimal latency. This real-time responsiveness empowers companies to make AI-driven decisions in milliseconds, improving customer satisfaction and operational efficiency.

By integrating Flink with event-driven architectures like Apache Kafka, businesses can ensure that AI systems are always fed with up-to-date and trustworthy data, further enhancing the reliability of predictions.

The integration of Flink and data streaming offers a clear path to measurable business impact. By aligning real-time AI capabilities with organizational goals, they can drive innovation while reducing operational costs, such as automating customer support to lower reliance on service agents.

Furthermore, Flink’s ability to process and enrich data streams at scale supports strategic initiatives like hyper-personalized marketing or optimizing supply chains in real-time. These benefits directly translate into enhanced competitive positioning, faster time-to-market for AI-driven solutions, and the ability to make more confident, data-driven decisions at the speed of business.

Trend 6: Becoming a Data Streaming Organization

To harness the full potential of data streaming, companies are shifting toward structured, enterprise-wide data streaming strategies. Moving from a tactical, ad-hoc approach to a cohesive top-down strategy enables businesses to align data streaming with organizational goals, driving both efficiency and innovation.

The Challenge: Fragmented Data Streaming Efforts

Many companies face challenges due to disjointed streaming efforts, leading to data silos and inconsistencies that prevent them from reaping the full benefits of real-time data processing. At Confluent, we call this the enterprise adoption barrier:

Source: Confluent

This fragmentation results in inefficiencies, duplication of efforts, and a lack of standardized processes. Without a unified approach, organizations struggle with:

Data Silos: Limited data sharing across teams creates bottlenecks for broader use cases.
Inconsistent Standards: Different teams often use varying schemas, patterns, and practices, leading to integration challenges and data quality issues.
Governance Gaps: A lack of defined roles, responsibilities, and policies results in limited oversight, increasing the risk of data misuse and compliance violations.

These challenges prevent organizations from scaling their data streaming capabilities and realizing the full value of their real-time data investments.

The Solution: Building an Integrated Data Streaming Organization

By adopting a comprehensive data streaming strategy, businesses can create a unified data platform with standardized tools and practices. A dedicated streaming platform team, often called the Center of Excellence (CoE), ensures consistent operations. An internal developer platform provides governed, self-serve access to streaming resources.

Key elements of a data streaming organization include:

Unified Platform: Move from disparate tools and approaches to a single, standardized data streaming platform. This includes consistent policies for cluster management, multi-tenancy, and topic naming, ensuring a reliable foundation for data streaming initiatives.
Self-Service: Provide APIs, UIs, and other interfaces for teams to onboard, create, and manage data streaming resources. Self-service capabilities ensure governed access to topics, schemas, and streaming capabilities, empowering developers while maintaining compliance and security.
Data as a Product: Adopt a product-oriented mindset where data streams are treated as reusable assets. This includes formalizing data products with clear contracts, ownership, and metadata, making them discoverable and consumable across the organization.
Alignment: Define clear roles and responsibilities, from platform operators and developers to data product owners. Establishing an enterprise-wide data streaming function ensures coordination and alignment across teams.
Governance: Implement automated guardrails for compliance, quality, and access control. This ensures that data streaming efforts remain secure, trustworthy, and scalable.

The Business Value: Consistent, Scalable, and Agile Data Streaming

Becoming a Data Streaming Organization unlocks significant value by turning data streaming into a strategic asset. The benefits include:

Enhanced Agility: A unified platform reduces time-to-market for new data-driven products and services, allowing businesses to respond quickly to market trends and customer demands.
Operational Efficiency: Streamlined processes and self-service capabilities reduce the overhead of managing multiple tools and teams, improving productivity and cost-effectiveness.
Scalable Innovation: Standardized patterns and reusable data products enable the rapid development of new use cases, fostering a culture of innovation across the enterprise.
Improved Governance: Clear policies and automated controls ensure data quality, security, and compliance, building trust with customers and stakeholders.
Cross-Functional Collaboration: By breaking down silos, organizations can leverage data streams across teams, creating a network effect that accelerates value creation.

To successfully adopt a Data Streaming Organization model, companies must combine technical capabilities with cultural and structural change. This involves not just deploying tools but establishing shared goals, metrics, and education to bring teams together around the value of real-time data. As organizations embrace data streaming as a strategic function, they position themselves to thrive in a data-driven world.

Embracing the Future of Data Streaming

As data streaming continues to mature, it has become the backbone of modern digital enterprises. It enables real-time decision-making, operational efficiency, and transformative AI applications. Trends such as the commoditization of Kafka, the adoption of the Kafka protocol, BYOC deployment models, and the rise of Flink as the standard for stream processing demonstrate the rapid evolution and growing importance of this technology. These innovations not only streamline infrastructure but also empower organizations to harness real-time insights, foster agility, and remain competitive in the ever-changing digital landscape.

These advancements in data streaming present a unique opportunity to redefine data strategy. Leveraging data streaming as a central pillar of IT architecture allows businesses to break down silos, integrate machine learning into critical workflows, and deliver unparalleled customer experiences. The convergence of data streaming with generative AI, particularly through frameworks like Flink, underscores the importance of embracing a real-time-first approach to data-driven innovation.

Looking ahead, organizations that invest in scalable, secure, and strategic data streaming infrastructures will be positioned to lead in 2025 and beyond. By adopting these trends, enterprises can unlock the full potential of their data, drive business transformation, and solidify their place as leaders in the digital era. The journey to set data in motion is not just about technology—it’s about building the foundation for a future where real-time intelligence powers every decision and every experience.

What trends do you see for data streaming? Which ones are your favorites? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Top Trends for Data Streaming with Apache Kafka and Flink in 2025 appeared first on Kai Waehner.

A New Era in Dynamic Pricing: Real-Time Data Streaming with Apache Kafka and Flink

Kai Waehner — Thu, 14 Nov 2024 12:09:57 +0000

In the age of digitization, the concept of pricing is no longer fixed or manual. Instead, companies increasingly use dynamic pricing — a flexible model that adjusts prices based on real-time market changes. Data streaming technologies like Apache Kafka and Apache Flink have become integral to enabling this real-time responsiveness, giving companies the tools they need to respond instantly to demand, competitor prices, and customer behaviors. This blog post explores the fundamentals of dynamic pricing, its link to data streaming, and real-world examples of how different industries such as retail, logistics, gaming and the energy sector leverage this powerful approach to get ahead of the competition.

What is Dynamic Pricing?

Dynamic pricing is a strategy where prices are adjusted automatically based on real-time data inputs, such as demand, customer behavior, supply levels, and competitor actions. This model allows companies to optimize profitability, boost sales, and better meet customer expectations.

Relevant Industries and Examples

Dynamic pricing has applications across many industries:

Retail and eCommerce: Dynamic pricing in eCommerce helps adjust product prices based on stock levels, competitor actions, and customer demand. Companies like Amazon frequently update prices on millions of products, using dynamic pricing to maximize revenue.
Transportation and Mobility: Ride-sharing companies like Uber and Grab adjust fares based on real-time demand and traffic conditions. This is commonly known as “surge pricing.”
Gaming: Context-specific in-game add-ons or virtual items are offered at varying prices based on player engagement, time spent in-game, and special events or levels.
Energy Markets: Dynamic pricing in energy adjusts rates in response to demand fluctuations, energy availability, and wholesale costs. This approach helps to stabilize the grid and manage resources.
Sports and Entertainment Ticketing: Ticket prices for events are adjusted based on seat availability, demand, and event timing to allow venues and ticketing platforms to balance occupancy and maximize ticket revenue.
Hospitality: Adaptive room rates and promotions in real-time based on demand, seasonality, and guest behavior, using dynamic pricing models.

These industries have adopted dynamic pricing to maintain profitability, manage supply-demand balance, and enhance customer satisfaction through personalized, responsive pricing.

Relation of Dynamic Pricing to Data Streaming with Apache Kafka and Flink

Dynamic pricing relies on up-to-the-minute data on market and customer conditions, making real-time data streaming critical to its success. Traditional batch processing, where data is collected and processed periodically, is insufficient for dynamic pricing. It introduces delays that could mean lost revenue opportunities or suboptimal pricing. This scenario is where data streaming technologies come into play.

Apache Kafka serves as the real-time data pipeline, collecting and distributing data streams from diverse sources. For instance, user behaviour on websites, competitor pricing, social media signals, IoT data, and more. Kafka’s capability to handle high throughput and low latency makes it ideal for ingesting large volumes of data continuously.
Apache Flink processes the data in real-time, applying complex algorithms to identify pricing opportunities as conditions change. With Flink’s support for stream processing and complex event processing, businesses can apply sophisticated logic to assess and adjust prices based on multiple real-time factors.

Together, Kafka and Flink create a powerful foundation for dynamic pricing, enabling real-time data ingestion, analysis, and action. This empowers companies to implement pricing models that are not only highly responsive but also resilient and scalable.

Clickstream Analytics in Real-Time with Data Streaming Replacing Batch with Hadoop and Spark

Years ago, companies relied on Hadoop and Spark to run batch-based clickstream analytics. Data engineers ingested logs from websites, online stores, and mobile apps to gather insights. Processing took hours. Therefore, any promotional offer or discount often arrived a day later — by which time the customer may have already made their purchase elsewhere, like on Amazon.

With today’s data streaming platforms like Kafka and Flink, clickstream analytics has evolved to support real-time, context-specific engagement and dynamic pricing. Instead of waiting on delayed insights, businesses can now analyze customer behavior as it happens, instantly adjusting prices and delivering personalized offers at the moment. This dynamic pricing capability allows companies to respond immediately to high-intent customers, presenting tailored prices or promotions when they’re most likely to convert. Dynamic pricing with Kafka and Flink can create a much better seamless and timely shopping experience that maximizes sales and customer satisfaction.

Success Stories for Dynamic Pricing with Data Streaming using Kafka and Flink Across Industries

Here’s how businesses across various sectors are harnessing Kafka and Flink for dynamic pricing.

Retail: Hyper-Personalized Promotions and Discounts
Logistics and Transportation: Intelligent Tolling
Technology: Surge Pricing
Energy Markets: Manage Supply-Demand and Stabilize Grid Loads
Gaming: Context-Specific In-Game Add-Ons
Sports and Entertainment: Optimize Ticketing Sales Sports and Entertainment

Learn more about data streaming with Kafka and Flink for dynamic pricing in the following success stories:

AO: Hyper-Personalized Promotions and Discounts (Retail and eCommerce)

AO, a major UK eCommerce retailer, leverages data streaming for dynamic pricing to stay competitive and drive higher customer engagement. By ingesting real-time data on competitor prices, customer demand, and inventory stock levels, AO’s system processes this information instantly to adjust prices in sync with market conditions. This approach allows AO to seize pricing opportunities and align closely with customer expectations. The result is a 30% increase in customer conversion rates.

Dynamic pricing has also allowed AO to provide a hyper-personalized shopping experience, delivering relevant product recommendations and timely promotions. This real-time responsiveness has enhanced customer satisfaction and loyalty, as customers receive offers that feel customized to their needs. During high-traffic periods like holiday sales, AO’s dynamic pricing ensures competitiveness and optimizes margins. This drives both profitability and customer retention. The company has applied this real-time approach not just to pricing, but also to other areas like delivery to make things run smoother. The retailer is now much more efficient and provides better customer service.

Quarterhill: Intelligent Tolling (Logistics and Transportation)

Quarterhill, a leader in tolling and intelligent transportation systems, uses Kafka and Flink to implement dynamic toll pricing. Kafka ingests real-time data from traffic sensors and road usage patterns. Flink processes this data to determine congestion levels and calculate the optimal toll based on real-time conditions.

This dynamic pricing strategy allows Quarterhill to manage road congestion effectively, reward off-peak travel, and optimize toll revenues. This system not only improves travel efficiency but also helps regulate traffic flows in high-density areas, providing value both to drivers and the city infrastructure.

Uber, Grab, and FreeNow: Surge Pricing (Technology)

Ride-sharing companies like Uber, Grab, and FreeNow are widely known for their dynamic pricing or “surge pricing” models. With data streaming, these platforms capture data on demand, supply (available drivers), location, and traffic in real time. This data is processed continuously by Apache Flink, Kafka Streams or other stream processing engines to calculate optimal pricing, balancing supply with demand, while considering variables like route distance and current traffic.

Source FreeNow

Surge pricing enables these companies to provide incentives for drivers to operate in high-demand areas, maintaining service availability and ensuring customer needs are met during peak times. This real-time pricing model improves revenue while optimizing customer satisfaction through prompt service availability.

Uber’s Kappa Architecture is an excellent example for how to build a data pipeline for dynamic pricing and many other use cases with Kafka and Flink:

Source: Uber

2K Games / Take-Two Interactive: Context-Specific In-Game Purchases (Gaming Industry)

In the gaming industry, dynamic pricing is becoming a strategy to improve player engagement and monetize experiences. Many gaming companies use Kafka and Flink to capture real-time data on player interactions, time spent in specific game sections, and in-game events. This data enables companies to offer personalized pricing for in-game items, bonuses, or add-ons, adjusting prices based on the player’s current engagement level and recent activities.

For instance, if players are actively taking part in a particular game event, they may be offered special discounts or dynamic prices on related in-game assets. Thereby, the gaming companies improve conversion rates and player engagement while maximizing revenue.

2K Games,A leading video game publisher and a subsidiary of Take-Two Interactive, has shifted from batch to real-time analytics to enhance player engagement across popular franchises like BioShock, NBA 2K, and Borderlands. By leveraging Confluent Cloud as fully managed data streaming platform, the publisher scales dynamically to handle high traffic, processing up to 3000 MB per second to serve 4 million concurrent users.

Source: 2K Games

Real-time telemetry analytics now allow them to analyze player actions and context instantly, enabling personalized, context-specific promotions and enhancing the gaming experience. Cost efficiencies are achieved through data compression, tiered storage, and reduced data transfer, making real-time engagement both effective and economical.

50hertz: Manage Supply-Demand and Stabilize Grid Loads (Energy Markets)

Dynamic pricing in energy markets is essential for managing supply-demand fluctuations and stabilizing grid loads. With Kafka, energy providers ingest data from smart meters, renewable energy sources, and weather. Flink processes this data in real-time, adjusting energy prices based on grid conditions, demand levels, and renewable supply availability.

50Hertz, as a leading electricity transmission system operator, indirectly (!) affects dynamic pricing in the energy market by sharing real-time grid data with partners and energy providers. This allows energy providers and market operators to adjust prices dynamically based on real-time insights into supply-demand fluctuations and grid stability.

To support this, 50Hertz is modernizing its SCADA systems with data streaming technologies to enable real-time data capture and distribution that enhances grid monitoring and responsiveness.

Real-time pricing approach helps encourage consumption when renewable energy is abundant and discourages usage during peak times, leading to optimized energy distribution, grid stability, and improved sustainability.

Ticketmaster: Optimize Ticketing Sales (Sports and Entertainment)

In ticketing, dynamic pricing allows for optimized revenue management based on demand and availability. Companies like Ticketmaster use Kafka to collect data on ticket availability, sales velocity, and even social media sentiment surrounding events. Flink processes this data to adjust prices based on real-time market conditions, such as proximity to the event date and current demand.

By dynamically pricing tickets, event organizers can maximize seat occupancy, boost revenue, and respond to last-minute demand surges, ensuring that prices reflect real-time interest while enhancing fan satisfaction.

Real-time inventory data streams allow Ticketmaster to monitor ticket availability, pricing, and demand as they change moment-to-moment. With data streaming through Apache Kafka and Confluent Platform, Ticketmaster tracks sales, venue capacity, and customer behavior in a single, live inventory stream. This enables quick responses, such as adjusting prices for high-demand events or boosting promotions where conversions lag. Teams gain actionable insights to forecast demand accurately and optimize inventory. This approach ensures fans have timely access to tickets. The result is a dynamic, data-driven approach that enhances customer experience and maximizes event success.

Conclusion: Business Value of Dynamic Pricing Built with Data Streaming

Dynamic pricing powered by data streaming with Apache Kafka and Flink brings transformative business value by:

Maximizing Revenue and Margins: Real-time price adjustments enable companies to capture value during demand surges, optimize for competitive conditions, and maintain healthy margins.
Improving Operational Efficiency: By automating pricing decisions based on real-time data, organizations can reduce manual intervention, speed up reaction times, and allocate resources more effectively.
Boosting Customer Satisfaction: Responsive pricing models allow companies to meet customer expectations in real time, leading to improved customer loyalty and engagement.
Supporting Sustainability Goals: In energy and transportation, dynamic pricing helps manage resources and reward environmentally friendly behaviors. Examples include off-peak travel and renewable energy usage.
Empowering Strategic Decision-Making: Real-time data insights provide business leaders with the information needed to adjust strategies and respond to developing market demands quickly.

Building a dynamic pricing system with Kafka and Flink represents a strategic investment in business agility and competitive resilience. Using data streaming to set prices instantly, businesses can stay ahead of competitors, improve customer service, and become more profitable. Dynamic pricing powered by data streaming is more than just a revenue tool; it’s a vital lever for driving growth, differentiation, and long-term success.

Did you already implement dynamic pricing? What is your data platform and strategy? Do you use Apache Kafka and Flink? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post A New Era in Dynamic Pricing: Real-Time Data Streaming with Apache Kafka and Flink appeared first on Kai Waehner.

The Past, Present and Future of Stream Processing

Kai Waehner — Wed, 20 Mar 2024 06:47:53 +0000

Stream processing has existed for decades. However, it really kicks off in the 2020s thanks to the adoption of open source frameworks like Apache Kafka and Flink. Fully managed cloud services make it easy to configure and deploy stream processing in a cloud-native way; even without the need to write any code. This blog post explores the past, present and future of stream processing. The discussion includes various technologies and cloud services, low code/ no code trade-offs, outlooks into the support of machine learning and GenAI, streaming databases, and the integration between data streaming and data lakes with Apache Iceberg.

In December 2023, the research company proved that data streaming is a new software category and not just yet another integration or data platform. Forrester published “The Forrester Wave: Streaming Data Platforms, Q4 2023“. Get free access to the report here. The leaders are Microsoft, Google and Confluent, followed by Oracle, Amazon, Cloudera, and a few others. A great time to review the past, present and future of stream processing as a key component in a data streaming architecture.

The Past of Stream Processing: The Move from Batch to Real-Time

The evolution of stream processing began as industries sought more timely insights from their data. Initially, batch processing was the norm. Data was collected over a period, stored, and processed at intervals. This method, while effective for historical analysis, proved inefficient for real-time decision-making.

In parallel to batch processing, message queues were created to provide real-time communication for transactional data. Message Brokers like IBM MQ or TIBCO EMS were a common way to decouple applications. Applications send data and receive data in an event-driven architecture without worrying about if the recipient was ready, how to handle backpressure, etc. The stream processing journey began.

Stream Processing is a Journey Over Decades…

… and we are still in a very early stage at most enterprises. Here is an excellent timeline of TimePlus about the journey of stream processing open source frameworks, proprietary platforms and SaaS cloud services:

Source: TimePlus

The stream processing journey started decades ago with research and first purpose-built proprietary products for specific use cases like stock trading.

Open source stream processing frameworks emerged during the big data and Hadoop era to make at least the ingestion layer a bit more real-time. Today, most enterprises at least get started understanding the value of stream processing for analytical and transactional use cases across industries. The cloud is a fundamental change as you can start streaming and processing data with a button click leveraging fully managed SaaS and simple UIs (if you don’t want to operate infrastructure or write low-level source code).

TIBCO StreamBase, Software AG Apama, IBM Streams

The advent of message queue technologies like IBM MQ and TIBCO EMS moved many critical applications to real-time message brokers. Real-time messaging enabled the consumption of data in real-time to store it in a database, mainframe, or application for further processing.

However, only true stream processing capabilities included in tools like TIBCO StreamBase, Software AG Apama or IBM (InfoSphere) Streams marked a significant shift towards real-time data processing. These products enabling businesses to react to information as it arrived by processing and correlating the data in motion.

Visual coding in tools like StreamBase or Apama represents an innovative approach to developing stream processing solutions. These tools provide a graphical interface that allows developers and analysts to design, build, and test applications by connecting various components and logic blocks visually, rather than writing code manually. Under the hood, the code generation worked with a Streaming SQL language.

Here is a screenshot of the TIBCO StreamBase IDE for visual drag & drop of streaming pipelines:

TIBCO StreamBase IDE

Some drawbacks of these early stream processing solutions include high cost, vendor lock-in, no flexibility regarding tools or APIs, and missing communities. These platforms are monolithic and were built far before cloud-native elasticity and scalability became a requirement for most RFIs and RFPs when evaluating vendors.

Open Source Event Streaming with Apache Kafka

The actual significant change for stream processing came with introducing Apache Kafka, a distributed streaming platform that allowed for high-throughput, fault-tolerant handling of real-time data feeds. Kafka, alongside other technologies like Apache Flink, revolutionized the landscape by providing the tools necessary to move from batch to real-time stream processing seamlessly.

The adoption of open source technologies changed all industries. Openness, flexibility, and community-driven development enabled easier influence on the features and faster innovation.

Over 100.000 organizations use Apache Kafka. The massive adoption came from a unique combination of capabilities: Messaging, storage, data integration, stream processing, all in one scalable and distributed infrastructure.

Various open source stream processing engines emerged. Kafka Streams was added to the Apache Kafka project. Other examples include Apache Storm, Spark Streaming, and Apache Flink.

The Present of Stream Processing: Architectural Evolution and Mass Adoption

The fundamental change to processing data in motion has enabled the development of data products and data mesh. Decentralizing data ownership and management with domain-driven design and technology-independent microservices promotes a more collaborative and flexible approach to data architecture. Each business unit can choose its own technology, API, cloud service, and communication paradigm like real-time, batch, or request-response.

From Lambda Architecture to Kappa Architecture

Today, stream processing is at the heart of modern data architecture, thanks in part to the emergence of the Kappa architecture. This model simplifies the traditional Lambda Architecture by using a single stream processing system to handle both real-time and historical data analysis, reducing complexity and increasing efficiency.

Lambda architecture with separate real-time and batch layers:

Kappa architecture with a single pipeline for real-time and batch processing:

For more details about the pros and cons of Kappa vs. Lambda, check out my “Kappa Architecture is Mainstream Replacing Lambda“. It explores case studies from Uber, Twitter, Disney and Shopify.

Kafka Streams and Apache Flink Become Mainstream

Apache Kafka has become synonymous with building scalable and fault-tolerant streaming data pipelines. Kafka facilitating true decoupling of domains and applications makes it integral to microservices and data mesh architectures.

Plenty of stream processing frameworks, products, and cloud services emerged in the past years. This includes open source frameworks like Kafka Streams, Apache Storm, Samza, Flume, Apex, Flink, Spark Streaming, and cloud services like Amazon Kinesis, Google Cloud Dataflow, Azure Stream Analytics. The “Data Streaming Landscape 2024” gives an overview of relevant technologies and vendors.

Apache Flink seems to become the de facto standard for many enterprises (and vendors). The adoption is like Kafka four years ago:

Source: Confluent

This does not mean other frameworks and solutions are bad. For instance, Kafka Streams is complementary to Apache Flink, as it suites different use cases.

No matter what technology enterprises choose, the mass adoption of stream processing is in progress right now. This includes modernizing existing batch processes AND building innovative new business models that only work in real time. As a concrete example, think about ride-hailing apps like Uber, Lyft, FREENOW, Grab. They are only possible because events are processed and correlated in real-time. Otherwise, everyone would still prefer a traditional taxi.

Stateless and Stateful Stream Processing

In data streaming, stateless and stateful stream processing are two approaches that define how data is handled and processed over time:

The choice between stateless and stateful processing depends on the specific requirements of the application, including the nature of the data, the complexity of the processing needed, and the performance and scalability requirements.

Stateless Stream Processing

Stateless Stream Processing refers to the handling of each data point or event independently from others. In this model, the processing of an event does not depend on the outcomes of previous events or require keeping track of the state between events. Each event is processed based on the information it contains, without the need for historical context or future data points. This approach is simpler and can be highly efficient for tasks that don’t require knowledge beyond the current event being processed.

The implementation could be a stream processor (like Kafka Streams or Flink), functionality in a connector (like Kafka Connect Single Message Transforms), or a Web Assembly (WASM) embedded into a streaming platform.

Stateful Stream Processing

Stateful Stream Processing involves keeping track of information (state) across multiple events to perform computations that depend on data beyond the current event. This model allows for more complex operations like windowing (aggregating events over a specific time frame), joining streams of data based on keys, and tracking sequences of events or patterns over time. Stateful processing is essential for scenarios where the outcome depends on accumulated knowledge or trends derived from a series of data points, not just on a single input.

The implementation is much more complex and challenging than stateless stream processing. A dedicated stream processing implementation is required. Dedicated distributed engines (like Apache Flink) handle stateful computionations, memory usage and scalability better than Kafka-native tools like Kafka Streams or KSQL (because the latter are bound to Kafka Topics).

Low Code, No Code, AND A Lot of Code!

No-code and low-code tools are software platforms that enable users to develop applications quickly and with minimal coding knowledge. These tools provide graphical user interfaces with drag-and-drop capabilities, allowing users to assemble and configure applications visually rather than writing extensive lines of code.

Common features and benefits of visual coding:

Rapid Development: Both types of platforms significantly reduce development time, enabling faster delivery of applications.
User-Friendly Interface: The graphical interface and drag-and-drop functionality make it easy for users to design, build, and iterate on applications.
Cost Reduction: By enabling quicker development with fewer resources, these platforms can lower the cost of software creation and maintenance.
Accessibility: They make application development accessible to a broader range of people, reducing the dependency on skilled developers for every project.

So far, the theory.

Disadvantages of Visual Coding Tools

Actually, StreamBase, Apama, et al., had great visual coding offerings. However, no-code / low-code tools have many drawbacks and disadvantages, too:

Limited Customization and Flexibility: While these platforms can speed up development for standard applications, they may lack the flexibility needed for highly customized solutions. Developers might find it challenging to implement specific functionalities that aren’t supported out of the box.
Dependency on Vendors: Using no-code/low-code platforms often means relying on third-party vendors for the platform’s stability, updates, and security. This dependency can lead to potential issues if the vendor cannot maintain the platform or goes out of business. And often the platform team is the bottleneck for implementing new business or integration logic.
Performance Concerns: Applications built with no-code/low-code platforms may not be as optimized as those developed with traditional coding, potentially leading to lower performance or inefficiencies, especially for complex applications.
Scalability Issues: As businesses grow, applications might need to scale up to support increased loads. No-code/low-code platforms might not always support this level of scalability or might require significant workarounds, affecting performance and user experience.
Over-reliance on Non-Technical Users: While empowering citizen developers is a key advantage of these platforms, it can also lead to governance challenges. Without proper oversight, non-technical users might create inefficient workflows or data structures, leading to technical debt and maintenance issues.
Cost Over Time: Initially, no-code/low-code platforms can reduce development costs. However, as applications grow and evolve, the ongoing subscription costs or fees for additional features and scalability can become significant.

Flexibility is King: Stream Processing for Everyone!

Microservices, domain-driven design, data mesh… All these modern design approaches taught us to provide flexible enterprise architectures. Each business unit and persona should be able to choose its own technology, API, or SaaS. And no matter if you do real-time, near real-time, batch or request response communication.

Apache Kafka provides the true decoupling out-of-the-box. Therefore, low-code or now-code tools is an option. However, a data scientist, data engineer, software developer or citizen integrator can choose its own technology for stream processing.

The past, present and future of stream processing shows different frameworks, visual coding tools and even applied generative AI. One solution does NOT replace but complement the other alternatives:

The Future of Stream Processing: Serverless SaaS, GenAI and Streaming Databases

Stream processing is set to grow exponentially in the future, thanks to advancements in cloud computing, SaaS, and AI. Let’s explore the future of stream processing and look at the expected short, mid and long-term developments.

SHORT TERM: Fully Managed Serverless SaaS for Stream Processing

The cloud’s scalability and flexibility offer an ideal environment for stream processing applications, reducing the overhead and resources required for on-premise solutions. As SaaS models continue to evolve, stream processing capabilities will become more accessible to a broader range of businesses, democratizing real-time data analytics.

For instance, look at the serverless Flink Actions in Confluent Cloud. You can configure and deploy stream processing for use cases like deduplication or masking without any code:

Source: Confluent

MID TERM: Automated Tooling and the Help of GenAI

Integrating AI and machine learning with stream processing will enable more sophisticated predictive analytics. This opens new frontiers for automated decision-making and intelligent applications while continuously processing incoming event streams. The full potential of embedding AI into stream processing has to be learned and implemented in the upcoming years.

For instance, automated data profiling is one instance of stream processing that GenAI can support significantly. Software tools analyze and understand the quality, structure, and content of a dataset without manual intervention as the events flow through the data pipeline in real-time. This process typically involves examining the data to identify patterns, anomalies, missing values, and inconsistencies. A perfect fit for stream processing!

Automated data profiling in the stream processor can provide insights into data types, frequency distributions, relationships between columns, and other metadata information crucial for data quality assessment, governance, and preparation for further analysis or processing.

MID TERM: Streaming Storage and Analytics with Apache Iceberg

Apache Iceberg is an open-source table format for huge analytic datasets that provides powerful capabilities in managing large-scale data in data lakes. Its integration with streaming data sources like Apache Kafka and analytics platforms, such as Snowflake, Starburst, Dremio, AWS Athena or Databricks, can significantly enhance data management and analytics workflows.

Integration between Streaming Data from Kafka and Analytics on Databricks or Snowflake using Apache Iceberg

Supporting the Apache Iceberg table format might be a crucial strategic move by streaming and analytics frameworks, vendors and cloud services. Here are some key benefits from the enterprise architecture perspective:

Unified Batch and Stream Processing: Iceberg tables can serve as a bridge between streaming data ingestion from Kafka and doxwnstream analytic processing. By treating streaming data as an extension of a batch-based table, Iceberg enables a seamless transition from real time to batch analytics, allowing organizations to analyze data with minimal latency.
Schema Evolution: Iceberg supports schema evolution without breaking downstream systems. This is useful when dealing with streaming data from Kafka, where the schema might evolve. Consumers can continue reading data using the schema they understand, ensuring compatibility and reducing the need for data pipeline modifications.
Time Travel and Snapshot Isolation: Iceberg’s time travel feature allows analytics on data as it looked at any point in time, providing snapshot isolation for consistent reads. This is crucial for reproducible reporting and debugging, especially when dealing with continuously updating streaming data from Kafka.
Cross-Platform Compatibility: Iceberg provides a unified data layer accessible by different compute engines, including those used by Databricks and Snowflake. This enables organizations to maintain a single copy of their data that is queryable across different platforms, facilitating a multi-tool analytics ecosystem without data silos.

LONG TERM: Transactional + Analytics = Streaming Database?

Streaming databases, like RisingWave or Materialize, are designed to handle real-time data processing and analytics. This offers a way to manage and query data that is continuously generated from sources like IoT devices, online transactions, and application logs. Traditional databases that are optimized for static data stored on disk. Instead, streaming databases are built to process and analyze data in motion. They provide insights almost instantaneously as the data flows through the system.

Streaming databases offer the ability to perform complex queries and analytics on streaming data, further empowering organizations to harness real-time insights.

The ongoing innovation in streaming databases will probably lead to more advanced, efficient, and user-friendly solutions, facilitating broader adoption and more creative applications of stream processing technologies.

Having said this, we are still in the very early stage. It is not clear yet when you really need a streaming database instead of a mature and scalable stream processor like Apache Flink. The future will show us and competition is great for innovation.

The Future of Stream Processing is Open Source and Cloud

The journey from batch to real-time processing has transformed how businesses interact with their data. The continued evolution couples technologies like Apache Kafka, Kafka Streams, and Apache Flink with the growth of cloud computing and SaaS. Stream processing will redefine the future of data analytics and decision-making.

As we look ahead, the future possibilities for stream processing are boundless, promising more agile, intelligent, and real-time insights into the ever-increasing streams of data.

If you want to learn more, listen to the following on-demand webinar about the past, present and future of stream processing with me joined by the two streaming industry veterans Richard Tibbets (founder of StreamBase) and Michael Benjamin (TimePlus). I had the please work with them for a few years at TIBCO where we deployed StreamBase at many Financial Services companies for stock trading and similar use cases:

How does your stream processing journey look like? In which decade did you join? Or are you just learning with the latest open-source frameworks or cloud services? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post The Past, Present and Future of Stream Processing appeared first on Kai Waehner.

Top 5 Trends for Data Streaming with Kafka and Flink in 2024

Kai Waehner — Sat, 02 Dec 2023 10:54:38 +0000

Data Streaming is one of the most relevant buzzwords in tech to build scalable real-time applications and innovative business models. Do you wonder about my predicted TOP 5 data streaming trends in 2024 to set data in motion? Learn what role Apache Kafka and Apache Flink play. Discover new technology trends and best practices for event-driven architectures, including data sharing, data contracts, serverless stream processing, multi-cloud architectures, and GenAI.

Some followers might notice that this became a series with past posts about the top 5 data streaming trends for 2021, the top 5 for 2022, and the top 5 for 2023. Trends change over time, but the huge value of having a scalable real-time infrastructure as the central data hub stays. Data streaming with Apache Kafka is a journey and evolution to set data in motion.

Gartner’s top strategic technology trends 2024

The research and consulting company Gartner defines the top strategic technology trends every year. This time, the trends are around building new (AI) platforms and delivering value by automation, but also protecting investment. On a higher level, it is all about automating, scaling, and pioneering. Here is what Gartner expects for 2024:

It is funny (but not surprising): Gartner’s predictions overlap and complement the five trends I focus on for data streaming with Apache Kafka looking forward to 2024. I explore how data streaming enables faster time to market, good data quality across independent data products, and innovation with technologies like Generative AI.

The top 5 data streaming trends for 2024

I see the following topics coming up more regularly in conversations with customers, prospects, and the broader data streaming community across the globe:

Data sharing for faster innovation with independent data products
Data contracts for better data governance and policy enforcement
Serverless stream processing for easier building of scalable and elastic streaming apps
Multi-cloud deployments for cost-efficient delivering value where the customers sit
Reliable Generative AI (GenAI) with embedded accurate, up-to-date information to avoid hallucination

The following sections describe each trend in more detail. The trends are relevant for many scenarios; no matter if you use open source Apache Kafka or Apache Flink, a commercial platform, or a fully managed cloud service like Confluent Cloud. I start each section with a real-world case study. The end of the article contains the complete slide deck and video recording.

Data sharing refers to the process of exchanging or providing access to data among different individuals, organizations, or systems. This can involve sharing data within an organization or sharing data with external entities. The goal of data sharing is to make information available to those who need it, whether for collaboration, analysis, decision-making, or other purposes. Obviously, real-time data beats slow data for almost all data sharing use cases.

NASA enables real-time data between space- and ground-based observatories. The

General Coordinates Network (GCN) allows real-time alerts in the astronomy community. With this system, NASA researchers, private space companies, and even backyard astronomy enthusiasts can publish and receive information about current activity in the sky.

Apache Kafka plays an essential role in astronomy research for data sharing. Particularly where black holes and neutron stars are involved, astronomers are increasingly seeking out the “time domain” and want to study explosive transients and variability. In response, observatories are increasingly adopting streaming technologies to send alerts to astronomers and to get their data to their science users in real time.

The talk “General Coordinates Network: Harnessing Kafka for Real-Time Open Astronomy at NASA” explores architectural choices, challenges, and lessons learned in adapting Kafka for open science and open data sharing at NASA.

NASA’s approach to OpenID Connect / OAuth2 in Kafka is designed to securely scale Kafka from access inside a single organization to access by the general public.

The Kafka ecosystem provides various functions to share data in real-time at any scale. Some are vendor-specific. I look at this from the perspective of Confluent, so that you see a lot of innovative options (even if you want to build it by yourself with open source Kafka):

Kafka Connect connector ecosystem to integrate with other data sources and sinks out-of-the-box
HTTP/REST proxies and connectors for Kafka to use simple and well understood request-response (HTTP is, unfortunately, also an anti-pattern for streaming data)
Cluster Linking for replication between Kafka clusters using the native Kafka protocol (instead of separate infrastructure like MirrorMaker)
Stream Sharing for exposing a Kafka Topic through a simple button click with access control, encryption, quotas, and chargeback billing APIs
Generation of AsyncAPI specs to share data with non-Kafka applications (like other message brokers or API gateways that support AsyncAPI, which is an open data for contract for asynchronous event-based messaging (similar to Swagger for HTTP/REST APIs)

Here is an example for Cluster Linking for bi-directional replication between Kafka clusters in the automotive industry:

And another example of stream sharing for easy access to a Kafka Topic in financial services:

To learn more, check out the article “Streaming Data Exchange with Kafka and a Data Mesh in Motion“.

Data contracts for data governance and policy enforcement

A data contract is an agreement or understanding that defines the terms and conditions governing the exchange or sharing of data between parties. It is a formal arrangement specifying how data will be handled, used, protected, and shared among entities. Data contracts are crucial when multiple parties need to interact with and utilize shared data, ensuring clarity and compliance with agreed-upon rules.

Raiffeisen Bank International (RBI) is scaling an event-driven architecture across the group as part of a bank-wide transformation program. This includes the creation of a reference architecture and the re-use of technology and concepts across 12 countries.

Learn more in the article “Decentralized Data Mesh with Data Streaming in Financial Services“.

Policy enforcement and data quality for Apache Kafka with Schema Registry

Good data quality is one of the most critical requirements in decoupled architectures like microservices or data mesh. Apache Kafka became the de facto standard for these architectures. But Kafka is a dumb broker that only stores byte arrays. The Schema Registry for Apache Kafka enforces message structures.

This blog post examines Schema Registry enhancements to leverage data contracts for policies and rules to enforce good data quality on field-level and advanced use cases like routing malicious messages to a dead letter queue.

For more details: Building a data mesh with decoupled data products and good data quality, governance, and policy enforcement.

Serverless stream processing with Apache Flink for scalable, elastic streaming apps

Serverless stream processing refers to a computing architecture where developers can build and deploy applications without having to manage the underlying infrastructure.

In the context of stream processing, it involves the real-time processing of data streams without the need to provision or manage servers explicitly. This approach allows developers to focus on writing code and building applications. The cloud service takes care of the operational aspects, such as scaling, provisioning, and maintaining servers.

Sencrop: Smart agriculture with Apache Kafka and Apache Flink

Designed to answer professional farmers’ needs, Sencrop offers a range of connected
weather stations that bring you precision agricultural weather data straight from your plots.

Over 20,000 connected ag-weather stations throughout Europe.
An intuitive, user-friendly application: Access accurate, ultra-local data to optimize your daily actions.
Prevent risks, reduce costs: Streamline inputs and reduce your environmental impact and associated costs.

Apache Flink becomes the de facto standard for stream processing

Apache Kafka and Apache Flink increasingly join forces to build innovative real-time stream processing applications.

The Y-axis in the diagram shows the monthly unique users (based on statistics of Maven downloads).

“Apache Kafka + Apache Flink = Match Made in Heaven” explores the benefits of combining both open-source frameworks. The article shows unique differentiators of Flink versus Kafka, and discusses when to use a Kafka-native streaming engine like Kafka Streams instead of Flink.

Unfortunately, operating a Flink cluster is really hard. Even harder than Kafka. Because Flink is not just a distributed system, it also has to keep state of applications for hours or even longer. Hence, serverless stream processing helps taking over the operation burden. And it makes the life of the developer easier, too.

Staying tuned for exciting cloud products offering serverless Flink in 2024. But be aware that some vendors use the same trick as for Kafka: Provisioning a Flink cluster and handing it over to you is NOT a serverless or fully-managed offering! For that reason, I compared Kafka products as self-driving cars vs. self-driving cars, i.e. cloud-based Kafka clusters you operate vs. truly fully managed services.

Multi-cloud for cost-efficient and reliable customer experience

Multi-cloud refers to a cloud computing strategy that uses services from multiple cloud providers to meet specific business or technical requirements. In a multi-cloud environment, organizations distribute their workloads across two or more cloud platforms, including public clouds, private clouds, or a combination of both.

The goal of a multi-cloud strategy is to avoid dependence on a single cloud provider and to leverage the strengths of different providers for various needs. Cost efficiency and regional laws (like operating in the United States or China) required different deployment strategies. Some countries do not provide a public cloud. A private cloud is the only option then.

New Relic: Multi-cloud Kafka deployments at extreme scale for real-time observability

New Relic is a software analytics company that provides monitoring and performance management solutions for applications and infrastructure. It’s designed to help organizations gain insights into the performance of their software and systems, allowing them to optimize and troubleshoot issues efficiently.

Observability has two key requirements: first, monitor data in real-time at any scale. Second, deploy the monitoring solution where the applications are running. The obvious consequence for New Relic is to process data with Apache Kafka, and multi-cloud where the customers are.

Hybrid and multi-cloud data replication for cost-efficiency, low latency, or disaster recovery

Multi-cloud deployments of Apache Kafka have become the norm rather than an exception. Several scenarios require multi-cluster solutions with specific requirements and trade-offs:

Regional separation because of legal requirements
Independence of a single cloud provider
Disaster recovery
Aggregation for analytics
Cloud migration
Mission-critical stretched deployments

The blog post “Architecture Patterns for Distributed, Hybrid, Edge and Global Apache Kafka Deployments” explores various architectures and best practices.

Reliable Generative AI (GenAI) with accurate context to avoid hallucination

Generative AI is a class of artificial intelligence systems that generate new content, such as images, text, or even entire datasets, often by learning patterns and structures from existing data. These systems use techniques such as neural networks to create content that is not explicitly programmed but is instead generated based on the patterns and knowledge learned during training.

Elemental Cognition: GenAI platform powered by Apache Kafka

Elemental Cognition’s AI platform develops responsible and transparent AI that helps solve problems and deliver expertise that can be understood and trusted.

Confluent Cloud powers the AI platform to enable scalable real-time data and data integration use cases. I recommend looking at their website to learn from various impressive use cases.

Apache Kafka as data fabric for GenAI using RAG, vector database and semantic search

Apache Kafka serves thousands of enterprises as the mission-critical and scalable real-time data fabric for machine learning infrastructures. The evolution of Generative AI (GenAI) with large language models (LLM) like ChatGPT changed how people think about intelligent software and automation. The relationship between data streaming and GenAI has enormous opportunities.

“Apache Kafka as Mission Critical Data Fabric for GenAI” explores the use cases for combining data streaming with Generative AI.

An excellent example, especially for Generative AI, is context-specific customer service. The following diagram shows an enterprise architecture leveraging event-driven data streaming for data ingestion and processing across the entire GenAI pipeline:

Stateful stream processing with Apache Flink and GenAI using a Large Language Model (LLM)

Stream processing with Kafka and Flink enables data correlation of real-time and historical data. A stateful stream processor takes existing customer information from the CRM, loyalty platform, and other applications, correlates it with the query from the customer into the chatbot, and makes an RPC call to an LLM.

The article “Apache Kafka + Vector Database + LLM = Real-Time GenAI” explores possible architectures, examples, and trade-offs between event streaming and traditional request-response APIs and databases.

Slides and video recording for the data streaming trends in 2024 with Kafka and Flink

Do you want to look at more details? This section provides the entire slide deck and a video walking you through the content.

Slide deck

Here is the slide deck from my presentation:

Fullscreen Mode

Video recording

And here is the video recording of my presentation:

2024 makes data streaming more mature, and Apache Flink becomes mainstream

I have two conclusions for data streaming trends in 2024:

Data streaming goes up in the maturity curve. More and more projects build streaming applications instead of just leveraging Apache Kafka as a dumb data pipeline between databases, data warehouses, and data lakes.
Apache Flink becomes mainstream. The open source framework shines with a scalable engine, multiple APIs like SQL, Java, and Python, and serverless cloud offerings from various software vendors. The latter makes building applications much more accessible.

Data sharing with data contracts is mandatory for a successful enterprise architecture with microservices or a data mesh. And data streaming is the foundation for innovation with technology trends like Generative AI. Therefore, we are just at the tipping point of adopting data streaming technologies such as Apache Kafka and Apache Flink.

What are your most relevant and exciting data streaming trends with Apache Kafka and Apache Flink in 2024 to set data in motion? What are your strategy and timeline? Do you use serverless cloud offerings or self-managed infrastructure? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Top 5 Trends for Data Streaming with Kafka and Flink in 2024 appeared first on Kai Waehner.

Quix Streams – Stream Processing with Kafka and Python

Kai Waehner — Sun, 28 May 2023 10:22:38 +0000

Over 100,000 organizations use Apache Kafka for data streaming. However, there is a problem: The broad ecosystem lacks a mature client framework and managed cloud service for Python data engineers. Quix Streams is a new technology on the market trying to close this gap. This blog post discusses this Python library, its place in the Kafka ecosystem, and when to use it instead of Apache Flink or other Python- or SQL-based substitutes.

Why Python and Apache Kafka together?

Python is a high-level, general-purpose programming language. It has many use cases for scripting and development. But there is one fundamental purpose for its success: Data engineers and data scientists use Python. Period.

Yes, there is R as another excellent programming language for statistical computing. And many low-code/no-code visual coding platforms for machine learning (ML).

SQL usage is ubiquitous amongst data engineers and data scientists, but it’s a declarative formalism that isn’t expressive enough to specify all necessary business logic. When data transformation or non-trivial processing is required, data engineers and data scientists use Python.

Hence: Data engineers and data scientists use Python. If you don’t give them Python, you will find either shadow IT or Python scripts embedded into the coding box of a low-code tool.

Apache Kafka is the de facto standard for data streaming. It combines real-time messaging, storage for true decoupling and replayability of historical data, data integration with connectors, and stream processing for data correlation. All in a single platform. At scale for transactions and analytics.

Python and Apache Kafka for Data Engineering and Machine Learning

In 2017, I wrote a blog post about “How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka“. The article is still accurate and explores how data streaming and AI/ML are complementary:

Machine Learning requires a lot of infrastructure for data collection, data engineering, model training, model scoring, monitoring, and so on. Data streaming with the Kafka ecosystem enables these capabilities in real-time, reliable, and at scale.

DevOps, microservices, and other modern deployment concepts merged the job roles of software developers and data engineers/data scientists. The focus is much more on data products solving a business problem, operated by the team that develops it. Therefore, the Python code needs to be production-ready and scalable.

As mentioned above, the data engineering and ML tasks are usually realized with Python APIs and frameworks. Here is the problem: The Kafka ecosystem is built around Java and the JVM. Therefore, it lacks good Python support.

Let’s explore the options and why Quix Streams is a brilliant opportunity for data engineering teams for machine learning and similar tasks.

What options exist for Python and Kafka?

Many alternatives exist for data engineers and data scientists to leverage Python with Kafka.

Python integration for Kafka

Here are a few common alternatives for integrating Python with Kafka and their trade-offs:

Python Kafka client libraries: Produce and consume via Python. I wrote an example of using a Python Kafka client and TensorFlow to replay historical data from the Kafka log and train an analytic model. This is solid but insufficient for advanced data engineering as it lacks processing primitives, such as filtering and joining operations found in Kafka Streams and other stream processing libraries.
Kafka REST APIs: Confluent REST Proxy and similar components enable producing and consuming to/from Kafka. Work well for gluing interfaces together, but is not ideal for ML workloads with low latency and critical SLAs.
SQL: Stream processing engines like ksqlDB or FlinkSQL allow querying of data in SQL. I wrote an example of Model Training with Python, Jupyter, KSQL and TensorFlow. It works. But ksqlDB and Flink are other systems that need to be operated. And SQL isn’t expressive enough for all use cases.

Instead of just integrating Python and Kafka via APIs, native stream processing provides the best of both worlds: The simplicity and flexibility of dynamic Python code for rapid prototyping with Jupyter notebooks and serious data engineering AND stream processing for stateful data correlation at scale either for data ingestion and model scoring.

Stream processing with Python and Kafka

In the past, we had two suboptimal open-source options for stream processing with Kafka and Python:

Faust: A stream processing library, porting the ideas from Kafka Streams (a Java library and part of Apache Kafka) to Python. The feature set is much more limited compared to Kafka Streams. Robinhood open-sourced Faust. But it lacks maturity and community adoption. I saw several customers evaluating it but then moving to other options.
Apache Flink’s Python API: Flink’s adoption grows significantly yearly in the stream processing community. This API is a Python version of DataStream API, which allows Python users to write Python DataStream API jobs. Developers can also use the Table API, including SQL directly in there. It is an excellent option if you have a Flink cluster and some folks want to run Python instead of Java or SQL against it for data engineering. The Kafka-Flink integration is very mature and battle-tested.

As you see, all the alternatives for combining Kafka and Python have trade-offs. They work for some use cases but are imperfect for most data engineering and data science projects.

A new open-source framework to the rescue? Introducing a brand new stream processing library for Python: Quix Streams…

What is Quix Streams?

Quix Streams is a stream processing library focused on ease of use for Python data engineers. The library is open-source under Apache 2.0 license and available on GitHub.

Instead of a database, Quix Streams uses a data streaming platform such as Apache Kafka. You can process data with high performance and save resources without introducing a delay.

Some of the Quix Streams differentiators are defined as being lightweight, powerful, no JVM and no need for separate clusters of orchestrators. It sounds like the pitch for why to use Kafka Streams in the Java ecosystem minus the JVM – this is a positive comment!

Quix Streams does not use any domain-specific language or embedded framework. It’s a library that you can use in your code base. This means that with Quix Streams, you can use any external library for your chosen language. For example, data engineers can leverage Pandas, NumPy, PyTorch, TensorFlow, Transformers, and OpenCV in Python.

So far, so good. This was more or less the copy & paste of Quix Streams marketing (it makes sense to me)… Now let’s dig deeper into the technology.

The Quix Streams API and developer experience

The following is the first feedback after playing around, doing code analysis, and speaking with some Confluent colleagues and the Quix Streams team.

The good

The Quix API and tooling persona is the data engineer (that’s at least my understanding). Hence, it does not directly compete with other offerings, say a Java developer using Kafka Streams. Again, the beauty of microservices and data mesh is the focus of an application or data product per use case. Choose the right tool for the job!
The API is mostly sleek, with some weirdness / unintuitive parts. But it is still in beta, so hopefully, it will get more refined in the subsequent releases. No worries at this early stage of the project.
The integration with other data engineering and machine learning Python frameworks is excellent. If you can combine stream processing with Pandas, NumPy and similar libraries is a massive benefit for the developer experience.
The Quix library and SaaS platform are compatible with open-source Kafka and commercial offerings and cloud services like Cloudera, Confluent Cloud, or Amazon MSK. Quix’s commercial UI provides out-of-the-box integration with self-managed Kafka and Confluent Cloud. The cloud platform also provides a managed Kafka for testing purposes (for a few cents per Kafka topic, and not meant for production).

The improvable

The stream processing capabilities (like powerful sliding windows) are still pretty limited and not comparable to advanced engines like Kafka Streams or Apache Flink. The roadmap includes enhanced features.
The architecture is complex since executing the Python API jumps through three languages: Python -> C# -> C++. Does it matter to the end user? It depends on the use case, security requirements, and more. The reasoning for this architecture is Quix’s background coming from the McLaren F1 team and ultra-low latency use cases and building a polyglot platform for different programming environments.
It would be interesting to see a benchmark for throughput and latency versus Faust, which is Python top to bottom. There is a trade-off between inter-language marshaling/unmarshalling versus the performance boost of lower-level compiled languages. This should be fine if we trust Quix’s marketing and business model. I expect they will provide some public content soon, as this question will arise regularly.

The Quix Streams Data Pipeline Low Code GUI

The commercial product provides a user interface for building data pipelines and code, MLOps, and a production infrastructure for operating and monitoring the built applications.

Here is an example:

Tiles are K8’s containers, each purple (transformation) and orange (destination) node is backed by a Git project containing the application code.
The three blue (source) nodes on the left are replay services used to test the pipeline by replaying specific streams of data.
Arrows are individual Kafka topics in Confluent Cloud (green = live data).
The first visible pipeline node (bottom left) is joining data from different physical sites (see the three input topics, one was receiving data when I took the image).
There are three modular transformations in the visible pipeline (two rolling windows and one interpolation).
There are two real-time apps (one real-time Streamlit dashboard and the other is an integration with a Twilio SMS service).

Quix Streams vs. Apache Flink for stream processing with Python

The Quix team wrote a detailed comparison of Apache Flink and Quix Streams. I don’t think it’s an entirely fair comparison as it compares open-source Apache Flink to a Quix SaaS offering. Nevertheless, for the most part, it is a good comparison.

Flink was always Java-first and has added support for Python for its DataStream and Table APIs at a later stage. Contrary, Quix Streams is brand new. Hence, it lacks maturity and customer case studies.

Having said all this, I think Quix Streams is a great choice for some stream processing projects in the Python ecosystem!

Should you use Quix Streams or Apache Flink?

TL;DR: There is a place for both! Choose the right tool… Modern enterprise architectures built with concepts like data mesh, microservices, and domain-driven design allow this flexibility per use case and problem.

I recommend using Flink if the use case makes sense with SQL or Java. And if the team is willing to operate its own Flink cluster or has a platform team or a cloud service taking over the operational burden and complexity.

Contrary, I would use Quix Streams for Python projects if I want to go to production with a more microservice-like architecture building Python applications. However, beware that Quix currently only has a few built-in stateful functions or JOINs. More advanced stream processing use cases cannot be done with Quix (yet). This is likely changing in the next months by adding more capabilities.

Hence, make sure to read Quix’ comparison with Flink. But keep in mind if you want to evaluate the open-source Quix Streams library or the Quix SaaS platform. If you are in the public cloud, you might combine Quick Streams SaaS with other fully-managed cloud services like Confluent Cloud for Kafka. On the other side, in your own private VPC or on premise, you need to build your own platform with technologies like the Quix Streams library, Kafka or Confluent Platform, and so on.

The current state and future of Quix Streams

If you build a new framework or product for data streaming, you need to make sure that it does not overlap with existing established offerings. You need differentiators and/or innovation in a new domain that does not exist today.

Quix Streams accomplishes this essential requirement to be successful: The target audience is data engineers with Python backgrounds. No severe and mature tool or vendor exists in this space today. And the demand for Python will grow more and more with the focus on leveraging data for solving business problems in every company.

Maturity: Making the right (marketing) choices in the early stage

Quix Streams is in the early maturity stage. Hence, a lot of decisions can still be strengthened or revamped.

The following buzzwords come into my mind when I think about Quix Streams: Python, data streaming, stream processing, Python, data engineering, Machine Learning, open source, cloud, Python, .NET, C#, Apache Kafka, Apache Flink, Confluent, MSK, DevOps, Python, governance, UI, time series, IoT, Python, and a few more.

TL;DR: I see a massive opportunity for Quix Streams to become a great data engineering framework (and SaaS offering) for Python users.

I am not a fan of polyglot platforms. It requires finding the lowest common denominator. I was never a fan of Apache Beam for that reason. The Kafka Streams community did not choose to implement the Beam API because of too many limitations.

Similarly, most people do not care about the underlying technology. Yes, Quix Streams’ core is C++. But is the goal to roll out stream processing for various programming languages, only starting with Python, then going to .NET, and then to another one? I am skeptical.

Hence, I like to see a change in the marketing strategy already: Quix Streams started with the pitch of being designed for high-frequency telemetry services when you must process high volumes of time-series data with up to nanosecond precision. It is now being revamped to focus mainly on Python and data engineering.

Competition: Friends or enemies?

Getting market adoption is still hard. Intuitive use of the product, building a broad community, and the right integrations and partnerships (can) make a new product such as Quix Streams successful. Quix Streams is on a good way here. For instance, integrating serverless Confluent Cloud and other Kafka deployments works well:

This is a native integration, not a connector. Everything in the pipeline image runs as a direct Kafka protocol connection using raw TCP/IP packets to produce and consume data to topics in Confluent Cloud. Quix platform is orchestrating the management of the Confluent Cloud Kafka Cluster (create/delete topics, topic sizing, topic monitoring etc) using Confluent APIs.

However, one challenge of these kinds of startups is the decision to complement versus compete with existing solutions, cloud services, and vendors. For instance, how much time and money do you invest in data governance? Should you build this or use the complementing streaming platform or a separate independent tool (like Collibra)? We will see where Quix Streams will go here. Building its cloud platform for addressing Python engineers or overlapping with other streaming platforms?

My advice is the proper integration with partners that lead in their space. Working with Confluent for over six years, I know what I am talking about: We do one thing, data streaming, but we are the best in that one. We don’t even try to compete with other categories. Yes, a few overlaps always exist, but instead of competing, we strategically partner and integrate with other vendors like Snowflake (data warehouse), MongoDB (transactional database), HiveMQ (IoT with MQTT), Collibra (enterprise-wide data governance), and many more. Additionally, we extend our offering with more data streaming capabilities, i.e., improving our core functionality and business model. The latest example is our integration of Apache Flink into the fully-managed cloud offering.

Kafka for Python? Look at Quix Streams!

In the end, a data engineer or developer has several options for stream processing deeply integrated into the Kafka ecosystem:

Kafka Streams: Java client library
ksqlDB: SQL service
Apache Flink: Java, SQL, Python service
Faust: Python client library
Quix Streams: Python client library

All have their pros and cons. The persona of the data engineer or developer is a crucial criterion. Quix Streams is a nice new open-source framework for the broader data streaming community. If you cannot or do not want to use just SQL, but native Python, then watch the project (and the company/cloud service behind it).

UPDATE – May 30th, 2023: bytewax is another open-source stream processing library for Python integrating with Kafka. It is implemented in Rust under the hood. I never saw it in the field yet. But a few comments mentioned it after I shared this blog post on social networks. I think it is worth a mention. Let’s see if it gets more traction in the following months.

Do you already use stream processing, or is Kafka just your data hub and pipeline? How do you combine Python and Kafka today? Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

The post Quix Streams – Stream Processing with Kafka and Python appeared first on Kai Waehner.

Data Streaming is not a Race, it is a Journey!

Kai Waehner — Thu, 13 Apr 2023 07:15:31 +0000

Data Streaming is not a race, it is a Journey! Event-driven architectures and technologies like Apache Kafka or Apache Flink require a mind shift in architecting, developing, deploying, and monitoring applications. Legacy integration, cloud-native microservices, and data sharing across hybrid and multi-cloud setups are the norm, not an exception. This blog post explores success stories from data streaming journeys across industries, including banking, retail, insurance, manufacturing, healthcare, energy & utilities, and software companies.

Data Streaming is a Journey, not a Race!

Confluent’s maturity model is used across thousands of customers to analyze the status quo, deploy real-time infrastructure and applications, and plan for a strategic event-driven architecture to ensure success and flexibility in the future of the multi-year data streaming journey:

Source: Confluent

The following sections show success stories from various companies across industries that moved through the data streaming journey. Each journey looks different. Each company has different technologies, vendors, strategies, and legal requirements.

Before we begin, I must stress that these journeys are NOT just successful because of the newest technologies like Apache Kafka, Kubernetes, and public cloud providers like AWS, Azure, and GCP. Success comes with a combination of technologies and expertise.

Consulting – Expertise from Software Vendors and System Integrators

All the below success stories combined open source technologies, software vendors, cloud providers, internal business and technical experts, system integrators, and consultants of software vendors.

Long story short: Technology and expertise are required to make your data streaming journey successful.

We not only sell data streaming products and cloud services but also offer advice and essential support. Note that the bundle numbers (1 to 5) in the following diagram are related to the above data streaming maturity curve:

Source: Confluent

Other vendors have similar strategies to support you. The same is true for the system integrators. Learn together. Bring people from different companies into the room to solve your business problems.

With this background, let’s look at the fantastic data streaming journeys we heard about at past Kafka Summits, Data in Motion events, Confluent blog posts, or other similar public knowledge-sharing alternatives.

Customer Journeys across Industries powered by Apache Kafka

Apache Kafka is the DE FACTO standard for data streaming. However, each data streaming journey looks different. Don’t underestimate how much you can learn from other industries. These companies might have different legal or compliance requirements, but the overall IT challenges are often very similar.

We cover stories from various industries, including banks, retailers, insurers, manufacturers, healthcare organizations, energy providers, and software companies.

The following data streaming journeys are explored in the below sections:

AO.com: Real-Time Clickstream Analytics
Nordstrom: Event-driven Analytics Platform
Migros: End-to-End Supply Chain with IoT
NordLB: Bank-wide Data Streaming for Core Banking
Raiffeisen Bank International: Strategic Real-Time Data Integration Platform Across 13 Countries
Allianz: Legacy Modernization and Cloud-Native Innovation
Optum (UnitedHealth Group): Self-Service Data Streaming
Intel: Real-Time Cyber Intelligence at Scale with Kafka and Splunk
Bayer: Transition from On-Premise to Multi-Cloud
Tesla: Migration from a Message Broker (RabbitMQ) to Data Streaming (Kafka)
Siemens: Integration between On-Premise SAP and Salesforce CRM in the Cloud
50Hertz: Transition from Proprietary SCADA to Cloud-Native Industrial IoT

There is no specific order besides industries. If you read the stories, you see that all are different, but you can still learn a lot from them, no matter what industry your business is in.

For more information about data streaming in a specific industry, just search my blog for case studies and architectures. Recent blog posts focus on the state of data streaming in manufacturing, in financial services, and so on.

AO.com – Real-Time Clickstream Analytics

AO.com is an electrical retailer. The hyper-personalized online retail experience turns each customer visit into a one-on-one marketing opportunity. The technical implementations correlation of historical customer data with real-time digital signals.

Years ago, Apache Hadoop and Apache Spark referred to this kind of clickstream analytics as the “Hello World” example. AO.com does real-time clickstream analytics powered by data streaming.

AO.com’s journey started with relatively simple data streaming pipelines leveraging the Schema Registry for decoupling and API contracts. Over time, the platform thinking of data streaming added more business value, and operations shifted to the fully-managed Confluent Cloud to focus on implementing business applications, not operations infrastructure.

Source: AO.com

Nordstrom – Event-driven Analytics Platform

Nordstrom is an American luxury department store chain. In the meantime, 50+% of revenue comes from online sales. They built the Nordstrom Analytical Platform (NAP) as the heart of its event-driven architecture. Nordstrom can use a singular event for analytical, functional, operational, and model-building purposes.

As Nordstrom said at Kafka Summit: “If it’s not in NAP, it didn’t happen.”

Nordstrom’s journey started in 1901 with the first stores. The last 25 years brought the retailer into the digital world and online retail. NAP has been a core component since 2017 for real-time analytics. The future is all automated decision-making in real-time.

Source: Nordstrom

Migros – End-to-End Supply Chain with IoT

Migros is Switzerland’s largest retail company, the largest supermarket chain, and the largest employer. They leverage data streaming with Confluent to distribute master data across many systems.

Migros’ supply chain is optimized with a single data streaming pipeline (including replaying entire days of events). For instance, real-time transportation information is visualized with MQTT and Kafka. Afterward, more advanced business logic was implemented, like forecasting the truck arrival time and planning, respectively, rescheduling truck tours.

The data streaming journey started with a single Kafka cluster. And grew to various independent Kafka clusters and a strategic Kafka-powered enterprise integration platform.

Source: Migros

NordLB – Bank-wide Data Streaming for Core Banking

Norddeutsche Landesbank (NordLB) is one of the largest commercial banks in Germany. They implemented an enterprise-wide transformation. The new Confluent-powered core banking platform enables event-based and truly decoupled stream processing. Use cases include improved real-time analytics, fraud detection, and customer retention.

Unfortunately, NordLB’s slide is only available in German. But I guess you can still follow their data streaming journey moving over the years from on-premise big data batch processing with Hadoop to real-time data streaming and analytics in the cloud with Kafka and Confluent:

Source: Norddeutsche Landesbank

Raiffeisen Bank International – Strategic Real-Time Data Integration Platform across 13 Countries

Raiffeisen Bank International (RBI) operates as a corporate and investment bank in Austria and as a universal bank in Central and Eastern Europe (CEE).

RBI’s bank transformation across 13 countries includes various components:

Bank-wide transformation program
Realtime Integration Center of Excellence (“RICE“)
Central platform and reference architecture for self-service re-use
Event-driven integration platform (fully-managed cloud and on-premise)
Group-wide API contracts (=schemas) and data governance

While I don’t have a nice diagram of RBI’s data streaming journey over the past few years, I can show you one of the most impressive migration stories. The context is horrible had to move from Ukraine data centers to the public cloud after the Russian invasion started in early 2022. The journey was super impressive from the IT perspective, as the migration happened without downtime or data loss.

RBI’s CTO recapped:

“We did this migration in three months, because we didn’t have a choice.”

Source: McKinsey

Allianz – Legacy Modernization and Cloud-Native Innovation

Allianz is a European multinational financial services company headquartered in Munich, Germany. Its core businesses are insurance and asset management. The company is one of the largest insurers and financial services groups.

A large organization like Allianz does not have just one enterprise architecture. This section has two independent stories of two separate Allianz business units.

One team of Allianz has built a so-called Core Insurance Service Layer (CISL). The Kafka-powered enterprise architecture is flexible and evolutionary. Why? Because it needs to integrate with hundreds of old and new applications. Some are running on the mainframe or use a batch file transfer. Others are cloud-native, running in containers or as a SaaS application in the public cloud.

Data streaming ensures the decoupling of applications via events and the reliable integration of different backend applications, APIs and communication paradigms. Legacy and new applications connect but also allow running in parallel (old V1 and new V2) before migrating away and shutting down from the legacy component.

Source: Allianz

Allianz Direct is a separate Kafka deployment. This business unit started with a greenfield approach to build insurance as a service in the public cloud. The cloud-native platform is elastic, scalable, and open. Hence, one platform can be used across countries with different legal, compliance, and data governance requirements.

This data streaming journey is best described by looking at a quote from Allianz Direct’s CTO:

Source: Linkedin (November 2022)

Optum – Self-Service Data Streaming

Optum is an American pharmacy benefit manager and healthcare provider
(UnitedHealth Group subsidiary). Optum started with a single data source connecting to Kafka. Today, they provide Kafka as a Service within the UnitedHealth Group. The service is centrally managed and used by over 200 internal application teams.

The benefits of this data streaming approach are the following characteristics: repeatable, scalable, and a cost-efficient way to standardize data. Optum leverages data streaming for many use cases, from mainframe via change data capture (CDC) to modern data processing and analytics tools.

Source: Optum

Intel – Real-Time Cyber Intelligence at Scale with Kafka and Splunk

Intel Corporation (commonly known as Intel) is an American multinational corporation and technology company headquartered in Santa Clara, California. It is one of the world’s largest semiconductor chip manufacturers by revenue.

Intel’s Cyber Intelligence Platform leverages the entire Kafka-native ecosystem, including Kafka Connect, Kafka Streams, Multi-Region Clusters (MRC), and more…

Their Kafka maturity curve shows how Intel started with a few data pipelines, for instance, getting data from various sources into Splunk for situational awareness and threat intelligence use cases. Later, Intel added Kafka-native stream processing for streaming ETL (to pre-process, filter, and aggregate data) instead of ingesting raw data (with high $$$ bills) and for advanced analytics by combining streaming analytics with Machine Learning.

Source: Intel

Brent Conran (Vice President and Chief Information Security Officer, Intel) described the benefits of data streaming:

“Kafka enables us to deploy more advanced techniques in-stream, such as machine learning models that analyze data and produce new insights. This helps us reduce the meantime to detect and respond.”

Bayer AG – Transition from On-Premise to Multi-Cloud

Bayer AG is a German multinational pharmaceutical and biotechnology company and one of the largest pharmaceutical companies in the world. They adopted a cloud-first strategy and started a multi-year transition to the cloud.

While Bayer AG has various data streaming projects powered by Kafka, my favorite is the success story of their Monsanto acquisition. The Kafka-based cross-data center DataHub was created to facilitate migration and to drive a shift to real-time stream processing.

Instead of using many words, let’s look at Bayer’s impressive data streaming journey from on-premise to multi-cloud, connecting various cloud-native Kafka and non-Kafka technologies over the years.

Source: Bayer AG

Tesla – Migration from a Message Broker (RabbitMQ) to Data Streaming (Kafka)

Tesla is a carmaker, utility company, insurance provider, and more. The company processes trillions of messages per day for IoT use cases.

Tesla’s data streaming journey is fascinating because it focuses on migrating from a message broker to stream processing. The initial reasons for Kafka were the scalability and reliability of processing high volumes of IoT sensor data.

But as you can see, more use cases were added quickly, such as Kafka Connect for data integration.

Source: Tesla

Tesla’s blog post details its advanced stream processing use cases. I often call streaming analytics the “secret sauce” (as data is only valuable if you correlate it in real-time instead of just ingesting it into a batch data warehouse or data lake).

Siemens – Integration between On-Premise SAP and Salesforce CRM in the Cloud

Siemens is a German multinational conglomerate corporation and the largest industrial manufacturing company in Europe.

One strategic data-streaming use case allowed Siemens to move “from batch to faster“ data processing. For instance, Siemens connected its SAP PMD to Kafka on-premise. The SAP infrastructure is very complex, with 80% ”customized ERP”. They improved the business processes and integration workflow from daily or weekly batches to real-time communication by integrating SAP’s proprietary IDoc messages within the event streaming platform.

Siemens later migrated from self-managed on-premise Kafka to Confluent Cloud via Confluent Replicator. Integrating Salesforce CRM via Kafka Connect was the first step of Siemens’ cloud strategy. As usual, more and more projects and applications join the data streaming journey as it is super easy to tap into the event stream and connect it to your favorite tools, APIs, and SaaS products.

Source: Siemens

A few important notes on this migration:

Confluent Cloud was chosen because it enables focusing on business logic and handing over complex operations to the expert, including mission-critical SLAs and support.
Migration from one Kafka cluster to another (in this case, from open source to Confluent Cloud) is possible with zero downtime and no data loss. Confluent Replicator and Cluster Linking, in conjunction with the expertise and skills of consultants, make such a migration simple and safe.

50Hertz – Transition from Proprietary SCADA to Cloud-Native Industrial IoT

50hertz is a transmission system operator for electricity in Germany. They built a cloud-native 24/7 SCADA system with Confluent. The solution is developed and tested in the public cloud but deployed in safety-critical air-gapped edge environments. A unidirectional hardware gateway helps replicate data from the edge into the cloud safely and securely.

This data streaming journey is one of my favorites as it combines so many challenges well known in any OT/IT environment across industries like manufacturing, energy and utilities, automotive, healthcare, etc.:

From monolithic and proprietary systems to open architectures and APIs
From batch with mainframe, files, and old Windows servers to real-time data processing and analytics
From on-premise data centers (and air-gapped production facilities) to a hybrid edge and cloud strategy, often with a cloud-first focus for new use cases.

The following graphic is fantastic. It shows the data streaming journey from left to right, including the bridge in the middle (as such a project is not a big bang but usually takes years, at a minimum):

Source: 50Hertz

Technical Evolution during the Data Streaming Journey

The different data streaming journeys showed you a few things: It is not a big bang. Old and new technology and processes need to work together. Only technology AND expertise drive successful projects.

Therefore, I want to share a few more resources to think about this from a technical perspective:

The data streaming journey never ends. Apache Kafka and Apache Flink increasingly join forces to build innovative real-time stream processing applications. We are just at the beginning.

What does your data streaming journey look like? Where are you right now, and what are your plans for the next few years? Let’s connect on LinkedIn and discuss it! Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

The post Data Streaming is not a Race, it is a Journey! appeared first on Kai Waehner.

Apache Kafka (including Kafka Streams) + Apache Flink = Match Made in Heaven

Kai Waehner — Mon, 23 Jan 2023 09:35:05 +0000

Apache Kafka and Apache Flink are increasingly joining forces to build innovative real-time stream processing applications. This blog post explores the benefits of combining both open-source frameworks, shows unique differentiators of Flink versus Kafka, and discusses when to use a Kafka-native streaming engine like Kafka Streams instead of Flink.

The tremendous adoption of Apache Kafka and Apache Flink

Apache Kafka became the de facto standard for data streaming. The core of Kafka is messaging at any scale in combination with a distributed storage (= commit log) for reliable durability, decoupling of applications, and replayability of historical data.

Kafka also includes a stream processing engine with Kafka Streams. And KSQL is another successful Kafka-native streaming SQL engine built on top of Kafka Streams. Both are fantastic tools. In parallel, Apache Flink became a very successful stream processing engine.

The first prominent Kafka + Flink case study I remember is the fraud detection use case of ING Bank. The first publications came up in 2017, i.e., over five years ago: “StreamING Machine Learning Models: How ING Adds Fraud Detection Models at Runtime with Apache Kafka and Apache Flink“. This is just one of many Kafka fraud detection case studies.

One of the last case studies I blogged about goes in the same direction: “Why DoorDash migrated from Cloud-native Amazon SQS and Kinesis to Apache Kafka and Flink“.

The adoption of Kafka is already outstanding. And Flink gets into enterprises more and more, very often in combination with Kafka. This article is no introduction to Apache Kafka or Apache Flink. Instead, I explore why these two technologies are a perfect match for many use cases and when other Kafka-native tools are the appropriate choice instead of Flink.

Top reasons Apache Flink is a perfect complementary technology for Kafka

Stream processing is a paradigm that continuously correlates events of one or more data sources. Data is processed in motion, in contrast to traditional processing at rest with a database and request-response API (e.g., a web service or a SQL query). Stream processing is either stateless (e.g., filter or transform a single message) or stateful (e.g., an aggregation or sliding window). Especially state management is very challenging in a distributed stream processing application.

A vital advantage of the Apache Flink engine is its efficiency in stateful applications. Flink has expressive APIs, advanced operators, and low-level control. But Flink is also scalable in stateful applications, even for relatively complex streaming JOIN queries.

Flink’s scalable and flexible engine is fundamental to providing a tremendous stream processing framework for big data workloads. But there is more. The following aspects are my favorite features and design principles of Apache Flink:

Unified streaming and batch APIs
Connectivity to one or multiple Kafka clusters
Transactions across Kafka and Flink
Complex Event Processing
Standard SQL support
Machine Learning with Kafka, Flink, and Python

But keep in mind that every design approach has pros and cons. While there are a lot of advantages, sometimes it is also a drawback.

Unified streaming and batch APIs

Apache Flink’s DataStream API unifies batch and streaming APIs. It supports different runtime execution modes for stream processing and batch processing, from which you can choose the right one for your use case and the characteristics of your job. In the case of SQL/Table API, the switch happens automatically based on the characteristics of the sources: All bounded events go into batch execution mode; at least one unbounded event means STREAMING execution mode.

The unification of streaming and batch brings a lot of advantages:

Reuse of logic/code for real-time and historical processing
Consistent semantics across stream and batch processing
A single system to operate
Applications mixing historical and real-time data processing

This sounds similar to Apache Spark. But there is a significant difference: Contrary to Spark, the foundation of Flink is data streaming, not batch processing. Hence, streaming is the default execution runtime mode in Apache Flink.

Continuous stateless or stateful processing enables real-time streaming analytics using an unbounded stream of events. Batch execution is more efficient for bounded jobs (i.e., a bounded subset of a stream) for which you have a known fixed input and which do not run continuously. This executes jobs in a way that is more reminiscent of batch processing frameworks, such as MapReduce in the Hadoop and Spark ecosystems.

Apache Flink makes moving from a Lambda to Kappa enterprise architecture easier. The foundation of the architecture is real-time, with Kafka as its heart. But batch processing is still possible out-of-the-box with Kafka and Flink using consistent semantics. Though, this combination will likely not (try to) replace traditional ETL batch tools, e.g., for a one-time lift-and-shift migration of large workloads.

Connectivity to one or multiple Kafka clusters

Apache Flink is a separate infrastructure from the Kafka cluster. This has various pros and cons. First, I often emphasize the vast benefit of Kafka-native applications: you only need to operate, scale and support one infrastructure for end-to-end data processing. A second infrastructure adds additional complexity, cost, and risk. However, imagine a cloud vendor taking over that burden, so you consume the end-to-end pipeline as a single cloud service.

With that in mind, let’s look at a few benefits of separate clusters for the data hub (Kafka) and the stream processing engine (Flink):

Focus on data processing in a separate infrastructure with dedicated APIs and features independent of the data streaming platform.
More efficient streaming pipelines before hitting the Kafka Topics again; the data exchange happens directly between the Flink workers.
Data processing across different Kafka topics of independent Kafka clusters of different business units. If it makes sense from a technical and organizational perspective, you can connect directly to non-Kafka sources and sinks. But be careful, this can quickly become an anti-pattern in the enterprise architecture and create complex and unmanageable “spaghetti integrations”.
Implement new fail-over strategies for applications.

I emphasize Flink is usually NOT the recommended choice for implementing your aggregation, migration, or hybrid integration scenario. Multiple Kafka clusters for hybrid and global architectures are the norm, not an exception. Flink does not change these architectures.

Kafka-native replication tools like MirrorMaker 2 or Confluent Cluster Linking are still the right choice for disaster recovery. It is still easier to do such a scenario with just one technology. Tools like Cluster Linking solve challenges like offset management out-of-the-box.

Transactions across Kafka and Flink

Workloads for analytics and transactions have very unlike characteristics and requirements. The use cases differ significantly. SLAs are very different, too. Many people think that data streaming is not built for transactions and should only be used for big data analytics.

However, Apache Kafka and Apache Flink are deployed in many resilient, mission-critical architectures. The concept of exactly-once semantics (EOS) allows stream processing applications to process data through Kafka without loss or duplication. This ensures that computed results are always accurate.

Transactions are possible across Kafka and Flink. The feature is mature and battle-tested in production. Operating separate clusters is still challenging for transactional workloads. However, a cloud service can take over this risk and burden.

Many companies already use EOS in production with Kafka Streams. But EOS can even be used if you combine Kafka and Flink. That is a massive benefit if you choose Flink for transactional workloads. So, to be clear: EOS is not a differentiator in Flink (vs. Kafka Streams), but it is an excellent option to use EOS across Kafka and Flink, too.

Complex Event Processing with FlinkCEP

The goal of complex event processing (CEP) is to identify meaningful events in real-time situations and respond to them as quickly as possible. CEP does usually not send continuous events to other systems but detects when something significant occurs. A common use case for CEP is handling late-arriving events or the non-occurrence of events.

The big difference between CEP and event stream processing (ESP) is that CEP generates new events to trigger action based on situations it detects across multiple event streams with events of different types (situations that build up over time and space). ESP detects patterns over event streams with homogenous events (i.e. patterns over time). Pattern matching is a technique to implement either pattern but the features look different.

FlinkCEP is an add-on for Flink to do complex event processing. The powerful pattern API of FlinkCEP allows you to define complex pattern sequences you want to extract from your input stream. After specifying the pattern sequence, you apply them to the input stream to detect potential matches. This is also possible with SQL via the MATCH_RECOGNIZE clause.

Standard SQL support

Structured Query Language (SQL) is a domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS). However, it is so predominant that other technologies like non-relational databases (NoSQL) and streaming platforms adopt it, too.

SQL became a standard of the American National Standards Institute (ANSI) in 1986 and the International Organization for Standardization (ISO) in 1987. Hence, if a tool supports ANSI SQL, it ensures that any 3rd party tool can easily integrate using standard SQL queries (at least in theory).

Apache Flink supports ANSI SQL, including the Data Definition Language (DDL), Data Manipulation Language (DML), and Query Language. Flink’s SQL support is based on Apache Calcite, which implements the SQL standard. This is great because many personas, including developers, architects, and business analysts, already use SQL in their daily job.

The SQL integration is based on the so-called Flink SQL Gateway, which is part of the Flink framework allowing other applications to interact with a Flink cluster through a REST API easily. User applications (e.g., Java/Python/Shell program, Postman) can use the REST API to submit queries, cancel jobs, retrieve results, etc. This enables a possible integration of Flink SQL with traditional business intelligence tools like Tableau, Microsoft Power BI, or Qlik.

However, to be clear, ANSI SQL was not built for stream processing. Incorporating Streaming SQL functionality into the official SQL standard is still in the works. The Streaming SQL working group includes database vendors like Microsoft, Oracle, and IBM, cloud vendors like Google and Alibaba, and data streaming vendors like Confluent. More details: “The History and Future of SQL: Databases Meet Stream Processing“.

Having said this, Flink supports continuous sliding windows and various streaming joins via ANSI SQL. There are things that require additional non-standard SQL keywords but continuous sliding windows or streaming joins, in general, are possible.

Machine Learning with Kafka, Flink, and Python

In conjunction with data streaming, machine learning solves the impedance mismatch of reliably bringing analytic models into production for real-time scoring at any scale. I explored ML deployments within Kafka applications in various blog posts, e.g., embedded models in Kafka Streams applications or using a machine learning model server with streaming capabilities like Seldon.

PyFlink is a Python API for Apache Flink that allows you to build scalable batch and streaming workloads, such as real-time data processing pipelines, large-scale exploratory data analysis, Machine Learning (ML) pipelines, and ETL processes. If you’re already familiar with Python and libraries such as Pandas, then PyFlink makes it simpler to leverage the full capabilities of the Flink ecosystem.

PyFlink is the missing piece for an ML-powered data streaming infrastructure, as almost every data engineer uses Python. The combination of Tiered Storage in Kafka and Data Streaming with Flink in Python is excellent for model training without the need for a separate data lake.

When to use Kafka Streams instead of Apache Flink?

Don’t underestimate the power and use cases of Kafka-native stream processing with Kafka Streams. The adoption rate is massive, as Kafka Streams is easy to use. And it is part of Apache Kafka. To be clear: Kafka Streams is already included if you download Kafka from the Apache website.

Kafka Streams is a Library, Apache Flink is a cluster

The most significant difference between Kafka Streams and Apache Flink is that Kafka Streams is a Java library, while Flink is a separate cluster infrastructure. Developers can deploy the Flink infrastructure in session mode for bigger workloads (e.g., many small, homogenous workloads like SQL queries) or application mode for fewer bigger, heterogeneous data processing tasks (e.g., isolated applications running in a Kubernetes cluster).

No matter your deployment option, you still need to operate a complex cluster infrastructure for Flink (including separate metadata management on a ZooKeeper cluster or an etcd cluster in a Kubernetes environment).

TL;DR: Apache Flink is a fantastic stream processing framework and a top #5 Apache open-source project. But it is also complex to deploy and difficult to manage.

Benefits of using the lightweight library of Kafka Streams

Kafka Streams is a single Java library. This adds a few benefits:

Kafka-native integration supports critical SLAs and low latency for end-to-end data pipelines and applications with a single cluster infrastructure instead of operating separate messaging and processing engines with Kafka and Flink. Kafka Streams apps still run in their VMs or Kubernetes containers, but high availability and persistence are guaranteed via Kafka Topics.
Very lightweight with no other dependencies (Flink needs S3 or similar storage as the state backend)
Easy integration into testing / CI / DevOps pipelines
Embedded stream processing into any existing JVM application, like a lightweight Spring Boot app or a legacy monolith built with old Java EE technologies like EJB.
Interactive Queries allow leveraging the state of your application from outside your application. The Kafka Streams API enables your applications to be queryable. Flink’s similar feature “queryable state” is approaching the end of its life due to a lack of maintainers.

Kafka Streams is well-known for building independent, decoupled, lightweight microservices. This differs from submitting a processing job into the Flink (or Spark) cluster; each data product team controls its destiny (e.g., don’t depend on the central Flink team for upgrades or get forced to upgrade). Flink’s application mode enables a similar deployment style for microservices. But:

Kafka Streams and Apache Flink live in different parts of a company

Today, Kafka Streams and Flink are usually used for different applications. While Flink provides an application mode to build microservices, most people use Kafka Streams for this today. Interactive queries are available in Kafka Streams and Flink, but it got deprecated in Flink as there is not much demand from the community. These are two examples that show that there is no clear winner. Sometimes Flink is the better choice, and sometimes Kafka Streams makes more sense.

“In summary, while there certainly is an overlap between the Streams API in Kafka and Flink, they live in different parts of a company, largely due to differences in their architecture and thus we see them as complementary systems.” That’s the quote of a “Kafka Streams vs. Flink comparison” article written in 2016 (!) by Stephan Ewen, former CTO of data Artisans, and Neha Narkhede, former CTO of Confluent. While some details changed over time, this old blog post is still pretty accurate today and a good read for a more technical audience.

The domain-specific language (DSL) of Kafka Streams differs from Flink but is also very similar. How are both characteristics possible? It depends on who you ask. This (legitimate) subject for debate often segregates Kafka Streams and Flink communities. Kafka Streams has Stream and Table APIs. Flink has DataStream, Table, and SQL API. I guess 95% of use cases can be built with both technologies. APIs, infrastructure, experience, history, and many other factors are relevant for choosing the proper stream processing framework.

Some architectural aspects are very different in Kafka Streams and Flink. These need to be understood and can be a pro or con for your use case. For instance, Flink’s checkpointing has the advantage of getting a consistent snapshot, but the disadvantage is that every local error always stops the whole job and everything has to be rolled back to the last checkpoint. Kafka Streams does not have this concept. Local errors can be recovered locally (move the corresponding tasks somewhere else; the task/threads without errors just continue normally). Another example is Kafka Streams’ hot standby for high availability versus Flink’s fault-tolerant checkpointing system.

Kafka + Flink = A powerful combination for stream processing

Apache Kafka is the de facto standard for data streaming. It includes Kafka Streams, a widely used Java library for stream processing. Apache Flink is an independent and successful open-source project offering a stream processing engine for real-time and batch workloads. The combination of Kafka (including Kafka Streams) and Flink is already widespread in enterprises across all industries.

Both Kafka Streams and Flink have benefits and tradeoffs for stream processing. The freedom of choice of these two leading open-source technologies and the tight integration of Kafka with Flink enables any kind of stream processing use case. This includes hybrid, global and multi-cloud deployments, mission-critical transactional workloads, and powerful analytics with embedded machine learning. As always, understand the different options and choose the right tool for your use case and requirements.

What is your favorite for streaming processing, Kafka Streams, Apache Flink, or another open-source or proprietary engine? In which use cases do you leverage stream processing? Let’s connect on LinkedIn and discuss it! Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka (including Kafka Streams) + Apache Flink = Match Made in Heaven appeared first on Kai Waehner.