Stream Processing Archives - Kai Waehner https://www.kai-waehner.de/blog/tag/stream-processing/ Technology Evangelist - Big Data Analytics - Middleware - Apache Kafka Mon, 28 Apr 2025 15:17:20 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.2 https://www.kai-waehner.de/wp-content/uploads/2020/01/cropped-favicon-32x32.png Stream Processing Archives - Kai Waehner https://www.kai-waehner.de/blog/tag/stream-processing/ 32 32 Fraud Detection in Mobility Services (Ride-Hailing, Food Delivery) with Data Streaming using Apache Kafka and Flink https://www.kai-waehner.de/blog/2025/04/28/fraud-detection-in-mobility-services-ride-hailing-food-delivery-with-data-streaming-using-apache-kafka-and-flink/ Mon, 28 Apr 2025 06:29:25 +0000 https://www.kai-waehner.de/?p=7516 Mobility services like Uber, Grab, and FREE NOW (Lyft) rely on real-time data to power seamless trips, deliveries, and payments. But this real-time nature also opens the door to sophisticated fraud schemes—ranging from GPS spoofing to payment abuse and fake accounts. Traditional fraud detection methods fall short in speed and adaptability. By using Apache Kafka and Apache Flink, leading mobility platforms now detect and block fraud as it happens, protecting their revenue, users, and trust. This blog explores how real-time data streaming is transforming fraud prevention across the mobility industry.

The post Fraud Detection in Mobility Services (Ride-Hailing, Food Delivery) with Data Streaming using Apache Kafka and Flink appeared first on Kai Waehner.

]]>
Mobility services like Uber, Grab, FREE NOW (Lyft), and DoorDash are built on real-time data. Every trip, delivery, and payment relies on accurate, instant decision-making. But as these services scale, they become prime targets for sophisticated fraud—GPS spoofing, fake accounts, payment abuse, and more. Traditional, batch-based fraud detection can’t keep up. It reacts too late, misses complex patterns, and creates blind spots that fraudsters exploit. To stop fraud before it happens, mobility platforms need data streaming technologies like Apache Kafka and Apache Flink for fraud detection. This blog explores how leading platforms are using real-time event processing to detect and block fraud as it happens—protecting revenue, user trust, and platform integrity at scale.

Fraud Prevention in Mobility Services with Data Streaming using Apache Kafka and Flink with AI Machine Learning

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases.

The Business of Mobility Services (Ride-Hailing, Food Delivery, Taxi Aggregators, etc.)

Mobility services have become an essential part of modern urban life. They offer convenience and efficiency through ride-hailing, food delivery, car-sharing, e-scooters, taxi aggregators, and micro-mobility options. Companies such as Uber, Lyft, FREE NOW (former MyTaxi; acquired by Lyft recently), Grab, Careem, and DoorDash connect millions of passengers, drivers, restaurants, retailers, and logistics partners to enable seamless transactions through digital platforms.

Taxis and Delivery Services in a Modern Smart City

These platforms operate in highly dynamic environments where real-time data is crucial for pricing, route optimization, customer experience, and fraud detection. However, this very nature of mobility services also makes them prime targets for fraudulent activities. Fraud in this sector can lead to financial losses, reputational damage, and deteriorating customer trust.

To effectively combat fraud, mobility services must rely on real-time data streaming with technologies such as Apache Kafka and Apache Flink. These technologies enable continuous event processing and allow platforms to detect and prevent fraud before transactions are finalized.

Why Fraud is a Major Challenge in Mobility Services

Fraudsters continually exploit weaknesses in digital mobility platforms. Some of the most common fraud types include:

  1. Fake Rides and GPS Spoofing: Drivers manipulate GPS data to simulate trips that never occurred. Passengers use location spoofing to receive cheaper fares or exploit promotions.
  1. Payment Fraud and Stolen Credit Cards: Fraudsters use stolen payment methods to book rides or order food.
  1. Fake Drivers and Passengers: Fraudsters create multiple accounts and pretend to be both the driver and passenger to collect incentives. Some drivers manipulate fares by manually adjusting distances in their favor.
  1. Promo Abuse: Users create multiple fake accounts to exploit referral bonuses and promo discounts.
  1. Account Takeovers and Identity Fraud: Hackers gain access to legitimate accounts, misusing stored payment information. Fraudsters use fake identities to bypass security measures.

Fraud not only impacts revenue but also creates risks for legitimate users and drivers. Without proper fraud prevention measures, ride-hailing and delivery companies could face serious losses, both financially and operationally.

The Unseen Enemy: Core Challenges in Mobility Fraud
Detection

Traditional fraud detection relies on batch processing and manual rule-based systems. However, these approaches are no longer effective due to the speed and complexity of modern mobile apps with real-time experiences combined with modern fraud schemes.

Payment Fraud - The Hidden Enemy in a Digital World
Payment Fraud – The Hidden Enemy in a Digital World

Key challenges in mobility fraud detection include:

  • Fraud occurs in real-time, requiring instant detection and prevention before transactions are completed.
  • Millions of events per second must be processed, requiring scalable and efficient systems.
  • Fraud patterns constantly evolve, making static rule-based approaches ineffective.
  • Platforms operate across hybrid and multi-cloud environments, requiring seamless integration of fraud detection systems.

To overcome these challenges, real-time streaming analytics powered by Apache Kafka and Apache Flink provide an effective solution.

Event-driven Architecture for Mobility Services with Data Streaming using Apache Kafka and Flink

Apache Kafka: The Backbone of Event-Driven Fraud Detection

Kafka serves as the core event streaming platform. It captures and processes real-time data from multiple sources such as:

  • GPS location data
  • Payment transactions
  • User and driver behavior analytics
  • Device fingerprints and network metadata

Kafka provides:

  • High-throughput data streaming, capable of processing millions of events per second to support real-time decision-making.
  • An event-driven architecture that enables decoupled, flexible systems—ideal for scalable and maintainable mobility platforms.
  • Seamless scalability across hybrid and multi-cloud environments to meet growing demand and regional expansion.
  • Always-on reliability, ensuring 24/7 data availability and consistency for mission-critical services such as fraud detection, pricing, and trip orchestration.

An excellent success story about the transition to data streaming comes from DoorDash: Why DoorDash migrated from Cloud-native Amazon SQS and Kinesis to Apache Kafka and Flink.

Apache Flink enables real-time fraud detection through advanced event correlation and applied AI:

  • Detects anomalies in GPS data, such as sudden jumps, route manipulation, or unrealistic movement patterns.
  • Analyzes historical user behavior to surface signs of account takeovers or other forms of identity misuse.
  • Joins multiple real-time streams—including payment events, location updates, and account interactions—to generate accurate, low-latency fraud scores.
  • Applies machine learning models in-stream, enabling the system to flag and stop suspicious transactions before they are processed.
  • Continuously adapts to new fraud patterns, updating models with fresh data in near real-time to reflect evolving user behavior and emerging threats.

With Kafka and Flink, fraud detection can shift from reactive to proactive to stop fraudulent transactions before they are completed.

I already covered various data streaming success stories from financial services companies such as Paypal, Capital One and ING Bank in a dedicated blog post. And a separate case study from about “Fraud Prevention in Under 60 Seconds with Apache Kafka: How A Bank in Thailand is Leading the Charge“.

Real-World Fraud Prevention Stories from Mobility Leaders

Fraud is not just a technical issue—it’s a business-critical challenge that impacts trust, revenue, and operational stability in mobility services. The following real-world examples from industry leaders like FREE NOW (Lyft), Grab, and Uber show how data streaming with advanced stream processing and AI are used around the world to detect and stop fraud in real time, at massive scale.

FREE NOW (Lyft): Detecting Fraudulent Trips in Real Time by Analyzing GPS Data of Cars

FREE NOW operates in more than 150 cities across Europe with 48 million users. It integrates multiple mobility services, including taxis, private vehicles, car-sharing, e-scooters, and bikes.

The company was recently acquired by Lyft, the U.S.-based ride-hailing giant known for its focus on multimodal urban transport and strong presence in North America. This acquisition marks Lyft’s strategic entry into the European mobility ecosystem, expanding its footprint beyond the U.S. and Canada.

FREE NOW - former MyTaxi - Company Overview
Source: FREE NOW

Fraud Prevention Approach leveraging Data Streaming (presented at Kafka Summit)

  • Uses Kafka Streams and Kafka Connect to analyze GPS trip data in real-time.
  • Deploys fraud detection models that identify anomalies in trip routes and fare calculations.
  • Operates data streaming on fully managed Confluent Cloud and applications on Kubernetes for scalable fraud detection.
Fraud Prevention in Mobility Services with Data Streaming using Kafka Streams and Connect at FREE NOW
Source: FREE NOW

Example: Detecting Fake Rides

  1. A driver inputs trip details into the app.
  2. Kafka Streams predicts expected trip fare based on distance and duration.
  3. GPS anomalies and unexpected route changes are flagged.
  4. Fraud alerts are triggered for suspicious transactions.

By implementing real-time fraud detection with Kafka and Flink, FREE NOW (Lyft) has significantly reduced fraudulent trips and improved platform security.

Grab: AI-Powered Fraud Detection for Ride-Hailing and Delivery with Data Streaming and AI/ML

Grab is a leading mobility platform in Southeast Asia, handling millions of transactions daily. Fraud accounts for 1.6 percent of total revenue loss in the region.

To address these significant fraud numbers, Grab developed GrabDefence—an AI-powered fraud detection engine that leverages real-time data and machine learning to detect and block suspicious activity across its platform.

Fraud Detection and Presentation with Kafka and AI ML at Grab in Asia
Source: Grab

Fraud Detection Approach

  • Uses Kafka Streams and machine learning for fraud risk scoring.
  • Leverages Flink for feature aggregation and anomaly detection.
  • Detects fraudulent transactions before they are completed.
GrabDefence - Fraud Prevention with Data Streaming and AI / Machine Learning in Grab Mobility Service
Source: Grab

Example: Fake Driver and Passenger Fraud

  1. Fraudsters create accounts as both driver and passenger to claim rewards.
  2. Kafka ingests device fingerprints, payment transactions, and ride data.
  3. Flink aggregates historical fraud behavior and assigns risk scores.
  4. High-risk transactions are blocked instantly.

With GrabDefence built with data streaming, Grab reduced fraud rates to 0.2 percent, well below the industry average. Learn more about GrabDefence in the Kafka Summit talk.

Uber: Project RADAR – AI-Powered Fraud Detection with Human Oversight

Uber processes millions of payments per second globally. Fraud detection is complex due to chargebacks and uncollected payments.

To combat this, Uber launched Project RADAR—a hybrid system that combines machine learning with human reviewers to continuously detect, investigate, and adapt to evolving fraud patterns in near real time. Low latency is not required in this scenario. And humans are in the loop of the business process. Hence, Apache Spark is sufficient for Uber.

Uber Project Radar for Scam Detection with Humans in the Loop
Source: Uber

Fraud Prevention Approach

  • Uses Kafka and Spark for multi-layered fraud detection.
  • Implements machine learning models to detect chargeback fraud.
  • Incorporates human analysts for rule validation.
Uber Project RADAR with Apache Kafka and Spark for Scam Detection with AI and Machine Learning
Source: Uber

Example: Chargeback Fraud Detection

  1. Kafka collects all ride transactions in real time.
  2. Stream processing detects anomalies in payment patterns and disputes.
  3. AI-based fraud scoring identifies high-risk transactions.
  4. Uber’s RADAR system allows human analysts to validate fraud alerts.

Uber’s combination of AI-driven detection and human oversight has significantly reduced chargeback-related fraud.

Fraud in mobility services is a real-time challenge that requires real-time solutions that work 24/7, even at extreme scale for millions of events. Traditional batch processing systems are too slow, and static rule-based approaches cannot keep up with evolving fraud tactics.

By leveraging data streaming with Apache Kafka in conjunction with Kafka Streams or Apache Flink, mobility platforms can:

  • Process millions of events per second to detect fraud in real time.
  • Prevent fraudulent transactions before they occur.
  • Use AI-driven real-time fraud scoring for accurate risk assessment.
  • Adapt dynamically through continuous learning to evolving fraud patterns.

Mobility platforms such as Uber, Grab, and FREE NOW (Lyft) are leading the way in using real-time streaming analytics to protect their platforms from fraud. By implementing similar approaches, other mobility businesses can enhance security, reduce financial losses, and maintain customer trust.

Real-time fraud prevention in mobility services is not an option; it is a necessity. The ability to detect and stop fraud in real time will define the future success of ride-hailing, food delivery, and urban mobility platforms.

Stay ahead of the curve! Subscribe to my newsletter for insights into data streaming and connect with me on LinkedIn to continue the conversation. And download my free book about data streaming use cases.

The post Fraud Detection in Mobility Services (Ride-Hailing, Food Delivery) with Data Streaming using Apache Kafka and Flink appeared first on Kai Waehner.

]]>
Virta’s Electric Vehicle (EV) Charging Platform with Real-Time Data Streaming: Scalability for Large Charging Businesses https://www.kai-waehner.de/blog/2025/04/22/virtas-electric-vehicle-ev-charging-platform-with-real-time-data-streaming-scalability-for-large-charging-businesses/ Tue, 22 Apr 2025 11:53:00 +0000 https://www.kai-waehner.de/?p=7477 The rise of Electric Vehicles (EVs) demands a scalable, efficient charging network—but challenges like fluctuating demand, complex billing, and real-time availability updates must be addressed. Virta, a global leader in smart EV charging, is tackling these issues with real-time data streaming. By leveraging Apache Kafka and Confluent Cloud, Virta enhances energy distribution, enables predictive maintenance, and supports dynamic pricing. This approach optimizes operations, improves user experience, and drives sustainability. Discover how real-time data streaming is shaping the future of EV charging and enabling intelligent, scalable infrastructure.

The post Virta’s Electric Vehicle (EV) Charging Platform with Real-Time Data Streaming: Scalability for Large Charging Businesses appeared first on Kai Waehner.

]]>
The Electric Vehicle (EV) revolution is here, but scaling charging infrastructure and integration with the energy system presents challenges— rapid power supply and demand fluctuations, billing complexity, and real-time availability updates. Virta, a global leader in smart EV charging, is leveraging real-time data streaming to optimize operations, improve user experience, and drive sustainability. By integrating Apache Kafka and Confluent Cloud, Virta ensures seamless energy distribution, predictive maintenance, and dynamic pricing for a smarter, greener future. Read how data streaming is transforming EV charging and enabling scalable, intelligent infrastructure.

Electric Vehicle (EV) Charging - Automotive and ESG with Data Streaming at Virta

I spoke with Jussi Ahtikari (Chief AI Officer at Virta) at a HotTopics C-Suite Exchange about Virta business model around EV charging networks and how they leverage data streaming. The following is a summary of this excellent success story about an innovative EV charging platform.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases, including several success stories around Kafka and Flink to improve ESG.

The Evolution and Challenges of Electric Vehicle (EV) Charging

The global shift towards electric vehicles (EVs) is accelerating, driven by the surge in variable renewable energy (wind, solar) production, need for sustainable and more cost-efficient transportation solutions, government incentives, and rapid advancements in battery technology. EV charging infrastructure plays a critical role in making this transition successful. It ensures that drivers have access to reliable and efficient charging options while keeping the costs of energy and charging operations in check and energy system in balance.

The innovation in EV charging goes beyond simply providing power to vehicles. Intelligent charging networks, dynamic pricing models, and energy management solutions are transforming the industry. Sustainability is also a key factor, as efficient energy consumption and integration with renewable energy system contribute to environmental, social, and governance (ESG) goals.

While the user and charged energy volumes grow, the real time interplay with the energy system, demand fluctuations, complex billing systems, and real-time station availability updates require a scalable and resilient data infrastructure. Delays in processing real-time data can lead to inefficient energy distribution, poor user experience, and lost revenue.

Virta: Innovating the Future of EV Charging

Virta is a digital cloud platform for electric vehicle (EV) charging businesses and a global leader in connecting of smart charging infrastructure and EV battery capacity with the renewable energy system via bi-directional charging (V2G) and demand response (V1G).

The digital Virta EV Energy platform provides a comprehensive suite of solutions for charging businesses to launch and manage their own EV charging networks. Virta full-service charging platform enables Charging Network and Business Management, Transactions, Pricing, Payments and Invoicing, EV Driver and Fleet Services, Roaming, Energy Management, and Virtual Power Plant services.

Its Charge Point Management System (CPMS) supports over 450 charger models, allowing seamless integration with third-party infrastructure. Virta is the only provider combining CPMS with energy flexibility platform.

Virta EV Charging Platform
Source: Virta

Virta Platform Connecting 100,000+ Charging Stations Serving Millions of EV Drivers

The Virta platform is utilised by professional charge point operators (CPOs) and e-mobility service providers (EMPs) across energy, petrol, retail, automotive and real estate industries in 36 countries in Europe and South-East Asia. Virta is headquartered in Helsinki, Finland.

Virta manages real-time data from well over 100,000 EV charging stations, serving millions of EV drivers, and processes approximately 40 GB of real-time data every hour. Including roaming partnerships, the platform offers EV drivers access to in total over 620,000 public charging stations in over 60 countries.

With this scale, real-time responsiveness is critical. Each time a charging station sends a signal—for example, when a driver starts charging—the platform must immediately trigger a series of actions:

  • Start billing
  • Update real-time status in mobile apps
  • Notify roaming networks
  • Update metrics and statistics
  • Conduct fraud checks

At the early days of electric mobility all of these operations could be handled in a monolithic system using tightly coupled and synchronized code. According to Jussi Ahtikari, Chief AI Officer at Virta, this would have made the system “complex, difficult to maintain, and hard to scale” as data volumes grew. Therefore the team identified early a need for a more modular, scalable, and real-time architecture to support its rapid growth and evolving service portfolio.

Innovative Industry Partnerships: Virta and Valeo

Virta is also exploring new opportunities in the EV ecosystem through its partnership with Valeo, a leader in automotive and energy solutions. The companies are working on integrating Valeo’s Ineez charging technology with Virta’s CPMS platform to enhance fleet charging, leasing services, and vehicle-to-grid (V2G) capabilities.

Vehicle-to-grid technology enables EVs to act as distributed energy storage, feeding excess power back into the grid during peak demand. This innovation is expected to play a critical role in balancing electricity supply and demand, contributing to cheaper electricity and more stable renewables based energy system.

The Role of Data Streaming in ESG and EV Charging

Sustainability and environmental responsibility are key drivers of ESG initiatives in industries such as energy, transportation, and manufacturing. Data streaming plays a crucial role in achieving ESG goals by enabling real-time monitoring, predictive maintenance, and energy efficiency improvements.

In the EV charging industry, real-time data streaming supports:

Foreseeing the growing need for these real-time insights led Virta to adopt a data streaming approach with Confluent.

Virta’s Data Streaming Transformation

To maintain its rapid growth and provide an exceptional customer experience, Virta needed a scalable, real-time data streaming solution. The company turned to Confluent’s data streaming platform (DSP), powered by Apache Kafka, to process millions of messages per hour and ensure seamless operations.

Scaling Challenges and the Need for Real-Time Processing

Virta’s rapid growth to scale of millions of charging events and tens of gigawatt hours of charged energy on a monthly basis in Europe and South-East Asia resulted in massive volumes of data that needed to be processed instantly. Something legacy systems, based on sequential authorization, would have struggled with.

Without real-time updates, large scale charging operations would face issues such as:

  • Unclear station availability
  • Slow transaction processing
  • Inaccurate billing information

Initially, Virta worked with open-source Apache Kafka but found managing high-volume data streams at scale to be increasingly resource-intensive. Therefore the team sought an enterprise-grade solution that would remove operational complexities while providing robust real-time capabilities.

Deploying A Data Streaming Platform for Scalable EV Charging

Confluent has become the backbone of Virta’s real-time data architecture. With Confluent’s event streaming platform, Virta is able to maintain a modern event-driven microservices architecture. Instead of tightly coupling all business logic into one system, each charging event—such as a driver starting a session—is published as a single, centralized event. Independent microservices subscribe to that event to trigger specific actions like billing, mobile app updates, roaming notifications, fraud detection, and more.

Here is a diagram of Virta’s cloud-Native microservices architecture powered by AWS, Confluent Cloud, Snowflake, Redis, OpenSearch, and other technologies:

Virta Cloud-Native Microservices Architecture for EV Charging Platform powered by AWS, Confluent Cloud, Snowflake, Redis, OpenSearch
Source: Virta

This architectural shift with an event-driven architecture and the data streaming platform as central nervous system has significantly improved scalability, maintainability, and fault isolation. It has also accelerated innovation with fast roll-out times of new services, including audit trails, improved data governance through schemas, and the foundation for AI-powered capabilities—all built on clean, real-time data streams.

Key Benefits of a SaaS Data Streaming Platform for Virta

As a fully managed data streaming platform, Confluent Cloud has eliminated the need for Virta to maintain Kafka clusters manually, allowing its engineering teams to focus on innovation rather than infrastructure management:

  • Elastic scalability: Automatically scales up to handle peak loads, ensuring uninterrupted service.
  • Real-time processing: Supports 45 million messages per hour, enabling immediate updates on charging status and availability.
  • Simplified development: Tools such as Schema Registry and pre-built APIs provide a standardized approach for developers, speeding up feature deployment.

Data Streaming Landscape: Spoilt for Choice – Open Source Kafka, Confluent, and many other Vendors

To navigate the evolving data streaming landscape, Virta chose a cloud-native, enterprise-grade platform that balances reliability, scalability, cost-efficiency, and ease of use. While many streaming technologies exist, Confluent offered the right trade-offs between operational simplicity and real-time performance at scale.

Read more about the different data streaming frameworks, platforms and cloud services in the data streaming landscape overview:The Data Streaming Landscape 2025 with Kafka Flink Confluent Amazon MSK Cloudera Event Hubs and Other Platforms

Business Impact of a Data Streaming Platform

By leveraging Confluent Cloud as its cloud-native and serverless data streaming platform, Virta has realized significant business benefits:

1. Faster Time to Market

Virta’s teams can now deploy new app features, charge points, and business services more quickly. The company has regained the agility of a startup, rolling out improvements without infrastructure bottlenecks.

2. Instant Updates for Customers and Operators

With real-time data streaming, Virta can update station availability and configuration changes in less than a second. This ensures that customers always have the latest information at their fingertips.

3. Cost Savings through Usage-Based Pricing

Virta’s shift to a usage-based pricing model has optimized its operational expenses. Instead of maintaining excess capacity, the company only pays for the resources it consumes.

4. Future-Ready Infrastructure for Advanced Analytics

Virta is building the future of real-time analytics, predictive maintenance, and smart billing by integrating Confluent with Snowflake’s AI-powered data cloud.

By decoupling data streams with Kafka, Virta ensures data consistency, scalability, and agility—enabling advanced analytics without operational bottlenecks.

Beyond EV Charging: Broader Energy and ESG Use Cases

Virta’s success with real-time data streaming highlights broader applications across the energy and ESG sectors. Similar data-driven solutions are being deployed for:

  • Smart grids: Real-time monitoring of electricity distribution to optimize supply and demand.
  • Renewable energy integration: Managing wind and solar power fluctuations with predictive analytics.
  • Industrial sustainability: Tracking carbon emissions and optimizing resource utilization.

The transition to electric mobility requires more than just an increase in charging stations. The ability to process and act on data in real time is critical to optimizing the use and costs of energy and infrastructure, enhancing user experience, and driving sustainability.

Virta’s usage of a serverless data streaming platform demonstrates the power of real-time data streaming in enabling scalable, efficient, and future-ready EV charging solutions. By eliminating infrastructure constraints, improving responsiveness, and reducing operational costs, Virta is setting new industry standards for innovation in mobility and energy management.

The EV charging landscape will tenfold within the next ten years, and especially with the mass adoption of bi-directional charging (V2G), integrate seamlessly with the energy system. Real-time data streaming will serve as the cornerstone for this evolution, helping businesses navigate challenges while unlocking new opportunities for sustainability and profitability.

For more data streaming success stories and use cases, make sure to download my free ebook. Please let me know your thoughts, feedback and use cases on LinkedIn and stay in touch via my newsletter.

The post Virta’s Electric Vehicle (EV) Charging Platform with Real-Time Data Streaming: Scalability for Large Charging Businesses appeared first on Kai Waehner.

]]>
How Data Streaming with Apache Kafka and Flink Drives the Top 10 Innovations in FinServ https://www.kai-waehner.de/blog/2025/02/09/how-data-streaming-with-apache-kafka-and-flink-drives-the-top-10-innovations-in-finserv/ Sun, 09 Feb 2025 09:59:38 +0000 https://www.kai-waehner.de/?p=7336 The financial industry is rapidly shifting toward real-time, intelligent, and seamlessly integrated services. From IoT payments and AI-driven banking to embedded finance and RegTech, financial institutions must process vast amounts of data instantly and securely. Data Streaming with Apache Kafka and Apache Flink provides the backbone for real-time payments, fraud detection, personalized financial insights, and compliance automation. This blog post explores the top 10 emerging financial technologies and how data streaming enables them, helping banks, fintechs, and central institutions stay ahead in the future of finance.

The post How Data Streaming with Apache Kafka and Flink Drives the Top 10 Innovations in FinServ appeared first on Kai Waehner.

]]>
The FinServ industry is undergoing a major transformation, driven by emerging technologies that enhance efficiency, security, and customer experience. At the heart of these innovations is real-time data streaming, enabled by Apache Kafka and Apache Flink. These technologies allow financial institutions to process and analyze data instantly to make finance smarter, more secure, and more accessible. This blog post explores the top 10 emerging financial technologies and how data streaming plays a critical role in making them a reality.

Top 10 Real Time Innovations in FinServ with Data Streaming using Apache Kafka and Flink

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases across all industries.

Data Streaming in the FinServ Industry

This article builds on FinTechMagazine.com’s “Top 10 Emerging Technologies in Finance by mapping each of these innovations to real-time data streaming concepts, possibilities, and real-world success stories.

Event-driven Architecture with Data Streaming using Apache Kafka and Flink in Financial Services

By leveraging Apache Kafka and Apache Flink, financial institutions can process transactions instantly, detect fraud proactively, and enhance customer experiences with real-time insights. Each emerging technology—whether IoT payment networks, AI-powered banking, or embedded finance—relies on the ability to stream, analyze, and act on data in real time, making data streaming a foundational enabler of the future of finance.

10. IoT Payment Networks: Real-time Processing for Seamless Payments

IoT payment networks enable automated, contactless transactions using connected devices like smartwatches, cars, and home appliances. Whether it’s a fridge restocking groceries or a car paying for tolls, these interactions generate massive real-time data streams that must be processed instantly and securely.

  • Fraud Detection in Milliseconds – Flink analyzes streaming transaction data to detect anomalies, flagging fraudulent activity before payments are approved.
  • Reliable Connectivity – Kafka ensures payment events from IoT devices are securely transmitted and processed, preventing dropped or duplicate transactions.
  • Dynamic Pricing & Offers – Flink processes sensor and market data to adjust prices dynamically (e.g., surge pricing for EV charging stations) and deliver real-time personalized discounts.
  • Edge Processing for Low-Latency Payments – Kafka enables local transaction validation on IoT devices, reducing lag in autonomous vehicle payments and retail checkout systems.
  • Compliance & Security – Streaming pipelines support real-time monitoring, encryption, and anomaly detection, ensuring IoT payments meet financial regulations like PSD2 and PCI DSS.

In financial services, don’t make the mistake of only looking inward for lessons—other industries have been solving similar challenges for years. Consumer IoT and Apache Kafka have long been used together in sectors like retail, where real-time data integration is critical for unified commerce, rewards programs, social selling, and many other use cases.

9. Voice-First Banking: Turning Conversations into Transactions

Voice-first banking enables customers to interact with financial services using smart speakers, virtual assistants, and mobile voice recognition. Whether checking an account balance, making a payment, or applying for a loan, these interactions require instant access to multiple backend systems—from core banking and CRM to fraud detection and credit scoring systems.

To make voice banking seamless, fast, and secure, banks must integrate real-time data streaming between AI-powered voice assistants and backend financial systems. This is where Apache Kafka and Apache Flink come in.

  • Seamless Integration Across Banking Systems – Voice assistants need real-time access to core banking (account balances, transactions), CRM (customer history), risk systems (fraud checks), and AI analytics. Kafka acts as a high-speed messaging and integration layer (aka ESB/middleware), ensuring that voice requests are instantly routed to the right backend services (including legacy technologies, such as mainframe) and responses are processed in milliseconds.
  • Instant Voice Query Processing – When a customer asks, “What’s my balance?”, Flink streams real-time transaction data from Kafka to retrieve the latest balance, rather than relying on outdated batch data.
  • Secure Authentication & Fraud Detection – Streaming pipelines analyze voice patterns in real time to detect fraud and trigger multi-factor authentication (MFA) if needed.
  • Personalized & Context-Aware Banking and Advertising – Flink continuously enriches customer profiles by analyzing past transactions, spending habits, and preferences—allowing the system to offer real-time financial insights (e.g., suggesting a savings plan based on spending trends).
  • Asynchronous Processing for Long-Running Requests – For complex tasks like loan applications, Kafka handles asynchronous processing—initiating background workflows across multiple systems while keeping the customer engaged.

For instance, Northwestern Mutual presented at Kafka Summit how the bank leverages Apache Kafka as a database for real-time transaction processing.

8. Autonomous Finance Platforms: AI-Driven Financial Decision Making

Autonomous finance platforms use AI, machine learning, and multi-agent systems to optimize savings, investments, and budgeting for consumers. These platforms act as digital financial advisors to make real-time decisions based on market data, user spending habits, and risk models.

  • Multi-Agent AI System Coordination – Autonomous finance platforms use multiple AI agents to handle different aspects of financial decision-making (e.g., portfolio optimization, credit assessment, fraud detection). Kafka streams data between these AI agents, ensuring they can collaborate in real time to refine investment and savings strategies.
  • Streaming Market Data Integration – Kafka ingests live stock prices, interest rates, and macroeconomic data, making it instantly available for AI models to adjust financial strategies.
  • Real-Time Customer Insights – Flink continuously analyzes customer transactions and spending behavior to enable AI-driven recommendations (e.g., automatically moving surplus funds into an interest-bearing account).
  • Predictive Portfolio Management – By combining real-time stock market data with AI-driven risk models, Flink helps adjust portfolio allocations based on current trends, ensuring maximum returns while minimizing exposure.
  • Automated Risk Mitigation – Autonomous finance systems must react instantly to market shifts. Flink’s real-time monitoring detects economic downturns or sudden market crashes, triggering immediate adjustments to investment portfolios or loan interest rates.
  • Event-Driven Financial Automation – Kafka enables real-time triggers (e.g., an AI agent detecting high inflation can automatically adjust a savings strategy).

7. RegTech 3.0: Automating Compliance and Risk Monitoring

RegTech is modernizing compliance by replacing slow batch audits with continuous real-time monitoring, automated reporting, and proactive fraud detection.

Financial institutions need instant insights into transactions, risk exposure, and regulatory changes—Kafka and Flink make this possible by streaming, analyzing, and automating compliance at scale.

  • Continuous Transaction Monitoring – Kafka streams every transaction in real time, enabling Flink to detect fraud, money laundering, or unusual patterns instantly—ensuring compliance with AML and KYC regulations.
  • Automated Regulatory Reporting – Flink processes compliance events as they happen, ensuring regulatory bodies receive up-to-date reports without delays. Kafka integrates compliance data across banking systems for audit-ready records.
  • Real-Time Fraud Prevention – Flink analyzes transaction behavior in milliseconds, detecting anomalies and triggering security actions like transaction blocking or multi-factor authentication.
  • Event-Driven Compliance Alerts – Kafka ensures instant alerts when regulations change, allowing banks to adapt in real time instead of relying on manual updates.
  • Proactive Risk Management – By analyzing live risk factors across transactions, users, and markets, Flink helps financial institutions identify and prevent compliance violations before they occur.

Continuous Regulatory Reporting and Compliance in FinServ with Data Streaming using Kafka and Flink

For example, KOR leverages data streaming to revolutionize compliance and regulatory reporting in the derivatives market by enabling on-demand historical reporting and real-time insights that were previously difficult to achieve with traditional batch processing. By using Kafka as a persistent state store, KOR ensures an immutable log of data that allows regulators to track changes over time, reconcile historical corrections, and meet compliance requirements more efficiently than legacy ETL-based big data systems. Read the entire KOR success story in my ebook.

6. Central Bank Digital Currencies (CBDC): The Future of Government-Backed Digital Money

Central Bank Digital Currencies (CBDC) are digital versions of national currencies, designed to enable faster, more secure, and highly scalable financial transactions.

Unlike cryptocurrencies, CBDCs are government-backed, meaning they require robust, real-time infrastructure capable of handling millions of transactions per second. They also need instant settlement, fraud detection, and cross-border interoperability—all of which depend on real-time data streaming.

  • Instant SettlementKafka ensures that CBDC transactions are processed and confirmed in real time, eliminating delays in digital payments. This allows central banks to enable 24/7 instant transactions, even in cross-border scenarios.
  • Scalability for Nationwide Adoption – Flink dynamically processes millions of transactions per second, ensuring that a CBDC system can handle high demand without bottlenecks or downtime.
  • Cross-Border Payments & Exchange Rate Optimization – Flink analyzes foreign exchange markets in real time and ensures optimized B2B data exchange for currency conversion and detecting suspicious cross-border activities for fraud prevention.
  • Regulatory Monitoring & Compliance – Kafka continuously streams transaction data to regulatory bodies. This ensures governments have real-time visibility into the movement of digital currencies.

At Kafka Summit Bangalore 2024, Mindgate Solutions presented its successful integration of Central Bank Digital Currency (CBDC) into banking apps, leveraging real-time data streaming to enable seamless digital payments. Mindgate utilized Kafka-based microservices architecture to ensure scalability, security, and reliability, reinforcing its leadership in India’s real-time payments ecosystem while processing over 8 billion transactions per month.

5. Green Fintech Infrastructure: Sustainability and ESG in Finance

Green fintech focuses on tracking carbon footprints, ESG (Environmental, Social, and Governance) investments, and climate risks in real time.

As financial institutions shift towards sustainable investment strategies, they need accurate, real-time data on environmental impact, regulatory compliance, and green investment opportunities.

  • Real-Time Carbon Tracking – Kafka streams emissions and sustainability data from supply chains to enable instant carbon footprint analysis.
  • Automated ESG Compliance – Flink analyzes sustainability reports and investment portfolios, automatically flagging non-compliant companies or assets.
  • Green Investment Insights – Real-time analytics match investors with eco-friendly projects, funds, and companies, helping financial institutions promote sustainable investments.

Event-Driven Architecture for Continuous ESG Optimization

More details about optimizing the ESG footprint with data streaming: “Green Data, Clean Insights: How Kafka and Flink Power ESG Transformations“.

4. AI-Powered Personalized Banking: Hyper-Personalized Customer Experiences

AI-driven banking solutions are transforming how customers interact with financial institutions to provide real-time insights, spending recommendations, and fraud alerts based on user behavior.

  • Real-Time Spending Analysis – Flink continuously processes live transaction data, identifying spending patterns to provide instant budgeting recommendations.
  • Personalized Alerts & Recommendations – Kafka streams transaction events to banking apps, notifying users of unusual spending, low balances, or savings opportunities.
  • Automated Financial Planning – Flink enables AI-driven financial assistance, helping users optimize savings, credit usage, and investments based on real-time insights.

Personalized Omnichannel Customer Experience in FinServ with Data Streaming using Kafka and Flink

A good example is how Erste Group Bank modernized its mobile banking experience with a hyper-personalized approach to ensure that customers receive tailored financial insights while prioritizing data consistency over real-time updates. By offloading data from expensive mainframes to a cloud-native, microservices-driven architecture, Erste Group Bank reduced costs, maintained compliance, and improved operational efficiency—ensuring a seamless flow of consistent, high-quality data across its legacy and modern banking applications. Read the entire Erste Group Bank success story in my ebook.

3. Decentralized Identity Solutions: Secure Identity Without Central Authorities

Decentralized identity solutions allow users to control their personal data, eliminating the need for centralized databases that are vulnerable to hacks. These systems use blockchain and zero-knowledge proofs for secure, passwordless authentication, but require real-time verification and fraud prevention measures.

  • Cybersecurity in Real Time – Kafka streams biometric and identity verification data to fraud detection engines, ensuring instant risk analysis.
  • Passwordless AuthenticationKafka integrates blockchain and biometric authentication to enable real-time identity validation without traditional passwords.
  • Secure KYC (Know Your Customer) Processing – Flink processes identity verification requests instantly, ensuring faster onboarding and fraud-proof financial transactions.

2. Quantum-Resistant Cryptography: Securing Financial Data in the Quantum Era

Quantum computing poses a major risk to traditional encryption methods, requiring financial institutions to adopt post-quantum cryptography to secure sensitive financial transactions and user data.

  • Scalable Cryptographic Upgrades – Streaming data pipelines allow banks to deploy cryptographic updates instantly, ensuring financial systems remain secure without downtime.
  • Threat Detection & Security Analysis – Flink analyzes live transaction patterns to identify potential vulnerabilities in encryption algorithms before they are exploited.

Nobody knows where quantum computing goes. Frankly, this is the only section of the top 10 finance innovations where I am not sure how much data streaming will be able to help or if completely new paradigms come up.

1. Embedded Finance: Banking Services in Every Digital Experience

Embedded finance integrates banking, payments, lending, and insurance into non-financial platforms, allowing companies like Uber, Shopify, and Apple to offer seamless financial services within their ecosystems.

To function smoothly, embedded finance requires real-time data integration between payment processors, credit scoring systems, fraud detection tools, and regulatory bodies.

  • Instant Payments & Transactions – Kafka streams payment data in real time, enabling seamless in-app purchases and instant money transfers.
  • Real-Time Credit Scoring & Lending – Flink analyzes transaction histories to provide instant credit approvals for loans and BNPL (Buy Now, Pay Later) services.
  • Fraud Prevention & Compliance – Streaming analytics detect suspicious behavior in real time, ensuring secure embedded financial transactions.

Tech giants like Uber and Shopify have embedded financial services directly into their platforms using event-driven architectures powered by Kafka, enabling real-time payments, lending, and fraud detection. By integrating finance seamlessly into their ecosystems, these companies enhance customer experience, create new revenue streams, and redefine how consumers interact with financial services.

Just like Uber and Shopify use event-driven architectures for real-time payments and financial services, Stripe and many similar FinTech companies power embedded finance for businesses by providing seamless, scalable payment infrastructure. To ensure six-nines (99.9999%) availability, Stripe relies on Apache Kafka as its financial source of truth to enable ultra-reliable transaction processing and real-time financial insights.

The Future of FinServ Is Real-Time: Are You Ready for Data Streaming?

The future of finance is real-time, intelligent, and seamlessly integrated into digital ecosystems. The ability to process massive amounts of financial data instantly is no longer optional—it’s a competitive necessity for operational and analytical use cases.

Data streaming with Apache Kafka and Apache Flink provides the foundation for scalability, security, and real-time analytics that modern financial services demand. By embracing data streaming, financial institutions can deliver:

  • Faster transactions
  • Proactive fraud prevention
  • Better customer experiences
  • Regulatory compliance

Finance is evolving from batch processing to real-time intelligence—and the companies that adopt streaming-first architectures will lead the industry into the future.

How do you leverage data streaming with Kafka and Flink in financial services? Let’s discuss on LinkedIn or X (former Twitter). Also join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and to stay in touch. And make sure to download my free book about data streaming use cases across all industries.

The post How Data Streaming with Apache Kafka and Flink Drives the Top 10 Innovations in FinServ appeared first on Kai Waehner.

]]>
Apache Flink: Overkill for Simple, Stateless Stream Processing and ETL? https://www.kai-waehner.de/blog/2025/01/14/apache-flink-overkill-for-simple-stateless-stream-processing/ Tue, 14 Jan 2025 07:48:04 +0000 https://www.kai-waehner.de/?p=7210 Discover when Apache Flink is the right tool for your stream processing needs. Explore its role in stateful and stateless processing, the advantages of serverless Flink SaaS solutions like Confluent Cloud, and how it supports advanced analytics and real-time data integration together with Apache Kafka. Dive into the trade-offs, deployment options, and strategies for leveraging Flink effectively across cloud, on-premise, and edge environments, and when to use Kafka Streams or Single Message Transforms (SMT) within Kafka Connect for ETL instead of Flink.

The post Apache Flink: Overkill for Simple, Stateless Stream Processing and ETL? appeared first on Kai Waehner.

]]>
When discussing stream processing engines, Apache Flink often takes center stage for its advanced capabilities in stateful stream processing and real-time data analytics. However, a common question arises: is Flink too heavyweight for simple, stateless stream processing  and ETL tasks? The short answer for open-source Flink is often yes. But the story evolves significantly when looking at SaaS Flink products such as Confluent Cloud’s Flink offering, with its serverless architecture, multi-tenancy, consumption-based pricing, and no-code/low-code capabilities like Flink Actions. This post explores the considerations and trade-offs to help you decide when Flink is the right tool for your data streaming needs, and when Kafka Streams or Single Message Transform (SMT) within Kafka Connect are the better choice.

Apache Flink - Overkill for Simple Stateless Stream Processing

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch.

The Nature of Stateless Stream Processing

Stateless stream processing, as the name implies, processes each event independently, with no reliance on prior events or context. This simplicity lends itself to use cases such as filtering, transformations, and simple ETL operations. Stateless tasks are:

  • Efficient: They don’t require state management, reducing overhead.
  • Scalable: Easily parallelized since there is no dependency between events.
  • Minimalistic: Often achievable with simpler, lightweight frameworks like Kafka Streams or Kafka Connect’s Single Message Transforms (SMT).

For example, filtering transactions above a certain amount or transforming event formats for downstream systems are classic stateless tasks that demand minimal computational complexity.

In these scenarios, deploying a robust and feature-rich framework like open-source Apache Flink might seem excessive. Flink’s rich API and state management features are unnecessary for such straightforward use cases. Instead, tools with smaller footprints, and simpler deployment models, such as Kafka Streams, often suffice.

Apache Flink is a powerhouse. It’s designed for advanced analytics, stateful processing, and complex event patterns. But this sophistication of the open source framework comes with complexity:

  1. Operational Overhead: Setting up and maintaining Flink in an open-source environment can require significant infrastructure and expertise.
  2. Resource Intensity: Flink’s distributed architecture and stateful processing capabilities are resource-hungry, often overkill for tasks that don’t require stateful processing.
  3. Complexity in Development: The Flink API is robust but comes with a steeper learning curve. The combination with Kafka (or another streaming engine) requires understanding two frameworks. In contrast, Kafka Streams is Kafka-native, offering a single, unified framework for stream processing, which can reduce complexity for basic tasks.

For organizations that need to perform straightforward stateless operations, investing in the full Flink stack can feel like using a sledgehammer to crack a nut. Having said this, FlinkSQL simplifies development for certain personas, providing a more accessible interface beyond just Java and Python.

The conversation shifts dramatically with Serverless Flink Cloud offerings, such as Confluent Cloud, which address many of the challenges associated with running open-source Flink. Let’s unpack how Serverless Flink makes a more attractive choice, even for simpler use cases.

1. Serverless Architecture

With a Serverless stream processing service, Flink operates on a fully serverless model, eliminating the need for heavy infrastructure management. This means:

  • No Setup Hassles: Developers focus purely on application logic, not cluster provisioning or tuning.
  • Elastic Scaling: Resources automatically scale with the workload, ensuring efficient handling of varying traffic patterns without manual intervention. One of the biggest challenges of self-managing Flink is over-provisioning resources to handle anticipated peak loads.  Elastic Scaling mitigates this inefficiency.

2. Multi-Tenancy

Multi-tenant design allows multiple applications, teams or organizations to share the same infrastructure securely. This reduces operational costs and complexity compared to managing isolated clusters for each workload.

3. Consumption-Based Pricing

One of the key barriers to adopting Flink for simple tasks is cost. A truly Serverless Flink offering mitigates this with a pay-as-you-go pricing model:

  • You only pay for the resources you use, making it cost-effective for both lightweight and high-throughput workloads.
  • It aligns with the scalability of stateless stream processing, where workloads may spike temporarily and then taper off.

4. Bridging the Gap with No-Code and Low-Code Solutions

The rise of citizen integrators and the demand for low-code/no-code solutions have reshaped how organizations approach data streaming. Less-technical users, such as business analysts or operational teams, often face challenges when trying to engage with technical platforms designed for developers.

Low-code/no-code tools address this by providing intuitive interfaces that allow users to build, deploy, and monitor pipelines without deep programming knowledge. These solutions empower business users to take charge of simple workflows and integrations, significantly reducing time-to-value while minimizing the reliance on technical teams.

For example, capabilities like Flink Actions in Confluent Cloud offer a user-friendly approach to deploying stream processing pipelines without coding. By simplifying the process and making it accessible to non-technical stakeholders, these tools enhance collaboration and ensure faster outcomes without compromising performance or scalability. For instance, you can do ETL functions such as transformation, deduplication or masking field:

Confluent Cloud - Apache Flink Action UI for No Code Low Code Streaming ETL Integration
Source: Confluent

Fully Managed (SaaS) vs. Partially Managed (PaaS) Cloud Products

When choosing between SaaS and PaaS for data streaming, it’s essential to understand the key differences.

SaaS solutions, like Confluent Cloud, offer a fully managed, serverless experience with automatic scaling, low operational overhead, and pay-as-you-go pricing.

In contrast, PaaS requires users to manage infrastructure, configure scaling policies, and handle more operational complexity.

While many products are marketed as “serverless,” not all truly abstract infrastructure or eliminate idle costs—so scrutinize claims carefully.

SaaS is ideal for teams focused on rapid deployment and simplicity, while PaaS suits those needing deep customization and control. Ultimately, SaaS ensures scalability and ease of use, making it a compelling choice for most modern streaming needs. Always dive into the technical details to ensure the platform aligns with your goals. Don’t trust the marketing slogans of the vendors!

Stateless vs. Stateful Stream Processing: Blurring the Lines

Even if your current use case is stateless, it’s worth considering the potential for future needs. Stateless pipelines often evolve into more complex systems as businesses grow, requiring features like:

  • State Management: For event correlation and pattern detection.
  • Windows and Aggregations: To derive insights from time-series data.
  • Joins: To enrich data streams with contextual information.
  • Integrating Multiple Data Sources: To seamlessly combine information from various streams for a comprehensive and cohesive analysis.
  • AI/ML Integration: Incorporating machine learning models for real-time inference, enabling intelligent decision-making directly within data streams.

Stateless and Stateful Stream Processing with Kafka Streams and Apache Flink

With a SaaS Flink service such as Confluent Cloud, you can start small with stateless tasks and seamlessly scale into stateful operations as needed, leveraging Flink’s full capabilities without a complete overhaul.

While Flink may feel like overkill for simple, stateless tasks in its open-source form, its potential is unmatched in these scenarios:

  • Enterprise Workloads: Scalable, reliable, and fault-tolerant systems for mission-critical applications.
  • Data Integration and Preparation (Streaming ETL): Flink enables preprocessing, cleansing, and enriching data at the streaming layer, ensuring high-quality data reaches downstream systems like data lakes and warehouses.
  • Complex Event Processing (CEP): Detecting patterns across events in real time.
  • Advanced Analytics: Stateful stream processing for aggregations, joins, and windowed computations.
  • AI/ML Integration: Incorporating machine learning models for real-time inference, enabling intelligent decision-making directly within data streams.

Stateless stream processing is often achieved using lightweight tools like Kafka Streams or Single Message Transforms (SMTs) within Kafka Connect. SMTs enable inline transformations, such as normalization, enrichment, or filtering, as events pass through the integration framework. This functionality is available in Kafka Connect (provided by Confluent, IBM/Red Hat, Amazon MSK and others) and tools like Benthos for Redpanda. SMTs are particularly useful for quick adjustments and filtering data before it reaches the Kafka cluster, optimizing resource usage and data flow.

While Kafka Streams and Kafka Connect’s SMTs handle many stateless workloads effectively, Apache Flink offers significant advantages for all types of workloads—whether simple or complex, stateless or stateful.

Stream processing in Flink enables true decoupling within the enterprise architecture (as it is not bound to the Kafka cluster like Kafka Streams and Kafka Connect). The benefits are separation of concerns with a domain-driven design (DDD), and improved data governance. And Flink provides interfaces for Java, Python and SQL. Something for (almost) everyone. This makes ideal Flink for ensuring clean, modular architectures and easier scalability.

Stream Processing and ETL with Apache Kafka Streams Connect SMT and Flink

By processing events from diverse sources and preparing them for downstream consumption, Flink supports both lightweight and comprehensive workflows while aligning with domain boundaries and governance requirements. This brings us to the shift left architecture.

The Shift Left Architecture

No matter what specific use cases you have in mind: The Shift Left Architecture brings data processing upstream with real-time stream processing, transforming raw data into high-quality data products early in the pipeline.

Shift Left Architecture with Data Streaming into Data Lake Warehouse Lakehouse

Apache Flink plays a key role as part of a complete data streaming platform by enabling advanced streaming ETL, data curation, and on-the-fly transformations, ensuring consistent, reliable, and ready-to-use data for both operational and analytical workloads, while reducing costs and accelerating time-to-market.

Shift Left Architecture with Apacke Kafka Flink and Iceberg

The decision to use Flink boils down to your use case, expertise, and growth trajectory:

  • For basic stateless tasks, consider lightweight options like Kafka Streams or SMTs within Kafka Connect unless you’re already invested in a SaaS such as Confluent Cloud where Flink is also the appropriate choice for simple ETL processes.
  • For evolving workloads or scenarios requiring scalability and advanced analytics, a Flink SaaS offers unparalleled flexibility and ease of use.
  • For on-premise or edge deployments, Flink’s flexibility makes it an excellent choice for environments where data processing must occur locally due to latency, security, or compliance requirements.

Understanding the deployment environment—cloud, on-premise, or edge— and the capabilities of the Flink product is crucial to choosing the right streaming technology. Flink’s adaptability ensures it can serve diverse needs across these contexts.

Kafka Streams is another excellent, Kafka-native stream processing alternative. Most importantly for this discussion, Kafka Streams is “just” a lightweight Java library, not a server infrastructure like Flink. Hence, it brings different trade-offs with it. I wrote a dedicated article about the trade-offs between Apache Flink and Kafka Streams for stream processing.

In its open-source form, Flink can seem excessive for simple, stateless tasks. However, a serverless Flink SaaS like Confluent Cloud changes the equation. Multi-tenancy and pay-as-you-go pricing make it suitable for a wider range of use cases, from basic ETL to advanced analytics. Serverless features like Confluent’s Flink Actions further reduce complexity, allowing non-technical users to harness the power of stream processing without coding.

Whether you’re just beginning your journey into stream processing or scaling up for enterprise-grade applications, Flink—as part of a complete data streaming platform such as Confluent Cloud—is a future-proof investment that adapts to your needs.

The Data Streaming Landscape 2025 highlights how data streaming has evolved into a key software category, moving from niche adoption to a fundamental part of modern data architecture.

The Data Streaming Landcape 2025 with Kafka Flink Confluent Amazon MSK Cloudera Event Hubs and Other Platforms

With frameworks like Apache Kafka and Flink at its core, the landscape now spans self-managed, BYOC, and fully managed SaaS solutions, driving real-time use cases, unifying transactional and analytical workloads, and enabling innovation across industries.

Stay ahead of the curve! Subscribe to my newsletter for insights into data streaming and connect with me on LinkedIn to continue the conversation.

The post Apache Flink: Overkill for Simple, Stateless Stream Processing and ETL? appeared first on Kai Waehner.

]]>
Stateless vs. Stateful Stream Processing with Kafka Streams and Apache Flink https://www.kai-waehner.de/blog/2024/12/27/stateless-vs-stateful-stream-processing-with-kafka-streams-and-apache-flink/ Fri, 27 Dec 2024 08:48:54 +0000 https://www.kai-waehner.de/?p=6857 The rise of stream processing has changed how we handle and act on data. While traditional databases, data lakes, and warehouses are effective for many batch-based use cases, they fall short in scenarios demanding low latency, scalability, and real-time decision-making. This post explores the key concepts of stateless and stateful stream processing, using Kafka Streams and Apache Flink as examples.

The post Stateless vs. Stateful Stream Processing with Kafka Streams and Apache Flink appeared first on Kai Waehner.

]]>
In the world of data-driven applications, the rise of stream processing has changed how we handle and act on data. While traditional databases, data lakes, and warehouses are effective for many batch-based use cases, they fall short in scenarios demanding low latency, scalability, and real-time decision-making. This post explores the key concepts of stateless and stateful stream processing, using Kafka Streams and Apache Flink as examples. These principles apply to any stream processing engine, whether open-source or a cloud service. Let’s break down the differences, practical use cases, the relation to AI/ML, and the immense value stream processing offers compared to traditional data-at-rest methods.

Stateless and Stateful Stream Processing with Kafka Streams and Apache Flink

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch.

Rethinking Data Processing: From Static to Dynamic

In traditional systems, data is typically stored first in a database or data lake and queried later for computation. This works well for batch processing tasks, like generating reports or dashboards. The process usually looks something like this:

  1. Store Data: Data arrives and is stored in a database or data lake.
  2. Query & Compute: Applications request data for analysis or processing at a later time with a web service, request-response API or SQL script.

However, this approach fails when you need:

  • Immediate Action: Real-time responses to events, such as fraud detection.
  • Scalability: Handling thousands or millions of events per second.
  • Continuous Insights: Ongoing analysis of data in motion.

Enter stream processing—a paradigm where data is continuously processed as it flows through the system. Instead of waiting to store data first, stream processing engines like Kafka Streams and Apache Flink enable you to act on data instantly as it arrives.

Use Case: Fraud Prevention in Real-Time

The blog post uses a fraud prevention scenario to illustrate the power of stream processing. In this example, transactions from various sources (e.g., credit card payments, mobile app purchases) are monitored in real time.

Fraud Detection and Prevention with Stream Processing in Real-Time

The system flags suspicious activities using three methods:

  1. Stateless Processing: Each transaction is evaluated independently, and high-value payments are flagged immediately.
  2. Stateful Processing: Transactions are analyzed over a time window (e.g., 1 hour) to detect patterns, such as an unusually high number of transactions.
  3. AI Integration: A pre-trained machine learning model is used for real-time fraud detection by predicting the likelihood of fraudulent activity.

This example highlights how stream processing enables instant, scalable, and intelligent fraud detection, something not achievable with traditional batch processing.

To avoid confusion: while I use Kafka Streams for stateless and Apache Flink for stateful in the example, both frameworks are capable of handling both types of processing.

Other Industry Examples of Stream Processing

  • Predictive Maintenance (Industrial IoT): Continuously monitor sensor data to predict equipment failures and schedule proactive maintenance.
  • Real-Time Advertisement (Retail): Deliver personalized ads based on real-time user interactions and behavior patterns.
  • Real-Time Portfolio Monitoring (Finance): Continuously analyze market data and portfolio performance to trigger instant alerts or automated trades during market fluctuations.
  • Supply Chain Optimization (Logistics): Track shipments in real time to optimize routing, reduce delays, and improve efficiency.
  • Condition Monitoring (Healthcare): Analyze patient vitals continuously to detect anomalies and trigger immediate alerts.
  • Network Monitoring (Telecom): Detect outages or performance issues in real time to improve service reliability.

These examples highlight how stream processing drives real-time insights and actions across diverse industries.

What is Stateless Stream Processing?

Stateless stream processing focuses on processing each event independently. In this approach, the system does not need to maintain any context or memory of previous events. Each incoming event is handled in isolation, meaning the logic applied depends solely on the data within that specific event.

This makes stateless processing highly efficient and easy to scale, as it doesn’t require state management or coordination between events. It is ideal for use cases such as filtering, transformations, and simple ETL operations where individual events can be processed with no need of historical data or context.

1. Example: Real-Time Payment Monitoring

Imagine a fraud prevention system that monitors transactions in real time to detect and prevent suspicious activities. Each transaction, whether from a credit card, mobile app, or payment gateway, is evaluated as it occurs. The system checks for anomalies such as unusually high amounts, transactions from unfamiliar locations, or rapid sequences of purchases.

Fraud Detection - Stateless Transaction Monitoring with Kafka Streams

By analyzing these attributes instantly, the system can flag high-risk transactions for further inspection or automatically block them. This real-time evaluation ensures potential fraud is caught immediately, reducing the likelihood of financial loss and enhancing overall security.

You want to flag high-value payments for further inspection. In the following Kafka Streams example:

  • Each transaction is evaluated as it arrives.
  • If the transaction amount exceeds 100 (in your chosen currency), it’s sent to a separate topic for further review.

Java Example (Kafka Streams):

KStream<String, Payment> payments = builder.stream(“payments”);

payments.filter((key, payment) -> payment.getAmount() > 100)
.to(“high-risk-payments”);

Benefits of Stateless Processing

  • Low Latency: Immediate processing of individual events.
  • Simplicity: No need to track or manage past events.
  • Scalability: Handles large volumes of data efficiently.

This approach is ideal for use cases like filtering, data enrichment, and simple ETL tasks.

What is Stateful Stream Processing?

Stateful stream processing takes it a step further by considering multiple events together. The system maintains state across events, allowing for complex operations like aggregations, joins, and windowed analyses. This means the system can correlate data over a defined period, track patterns, and detect anomalies that emerge across multiple transactions or data points.

2. Example: Fraud Prevention through Continuous Pattern Detection

In fraud prevention, individual transactions may appear normal, but patterns over time can reveal suspicious behavior.

For example, a fraud prevention system might identify suspicious behavior by analyzing all transactions from a specific credit card within a one-hour window, rather than evaluating each transaction in isolation.

Fraud Detection - Stateful Anomaly Detection with Apache Flink SQL

Let’s detect anomalies by analyzing transactions with Apache Flink using Flink SQL. In this example:

  • The system monitors transactions for each credit card within a 1-hour window.
  • If a card is used over 10 times in an hour, it flags potential fraud.

SQL Example (Apache Flink):

SELECT card_number, COUNT(*) AS transaction_count
FROM payments
GROUP BY TUMBLE(transaction_time, INTERVAL ‘1’ HOUR), card_number
HAVING transaction_count > 10;

Key Concepts in Stateful Processing

Stateful processing relies on maintaining context across multiple events, enabling the system to perform more sophisticated analyses. Here are the key concepts that make stateful stream processing possible:

  1. Windows: Define a time range to group events (e.g., sliding windows, tumbling windows).
  2. State Management: The system remembers past events within the defined window.
  3. Joins: Combine data from multiple sources for enriched analysis.

Benefits of Stateful Processing

Stateful processing is essential for advanced use cases like anomaly detection, real-time monitoring, and predictive analytics:

  • Complex Analysis: Detect patterns over time.
  • Event Correlation: Combine events from different sources.
  • Real-Time Decision-Making: Continuous monitoring without reprocessing data.

Bringing AI and Machine Learning into Stream Processing

Stream processing engines like Kafka Streams and Apache Flink also enable real-time AI and machine learning model inference. This allows you to integrate pre-trained models directly into your data processing pipelines.

3. Example: Real-Time Fraud Detection with AI/ML Models

Consider a payment fraud detection system that uses a TensorFlow model for real-time inference. In this system, transactions from various sources — such as credit cards, mobile apps, and payment gateways — are streamed continuously. Each incoming transaction is preprocessed and sent to the TensorFlow model, which evaluates it based on patterns learned during training.

Fraud Detection - Anomaly Detection with Predictive Al ML using Apache Flink Python API

The model analyzes features like transaction amount, location, device ID, and frequency to predict the likelihood of fraud. If the model identifies a high probability of fraud, the system can trigger immediate actions, such as flagging the transaction, blocking it, or alerting security teams. This real-time inference ensures that potential fraud is detected and addressed instantly, reducing risk and enhancing security.

Here is a code example using Apache Flink’s Python API for predictive AI:

Python Example (Apache Flink):

def predict_fraud(payment):
prediction = model.predict(payment.features)
return prediction > 0.5

stream = payments.map(predict_fraud)

Why Combine AI with Stream Processing?

Integrating AI with stream processing unlocks powerful capabilities for real-time decision-making, enabling businesses to respond instantly to data as it flows through their systems. Here are some key benefits of combining AI with stream processing:

  • Real-Time Predictions: Immediate fraud detection and prevention.
  • Automated Decisions: Integrate AI into critical business processes.
  • Scalability: Handle millions of predictions per second.

Apache Kafka and Flink deliver low-latency, scalable, and robust predictions. My article “Real-Time Model Inference with Apache Kafka and Flink for Predictive AI and GenAI” compares remote inference (via APIs) and embedded inference (within the stream processing application).

For large AI models (e.g., generative AI or large language models), inference is often done via remote calls to avoid embedding large models within the stream processor.

Stateless vs. Stateful Stream Processing: When to Use Each

Choosing between stateless and stateful stream processing depends on the complexity of your use case and whether you need to maintain context across multiple events. The following table outlines the key differences to help you determine the best approach for your specific needs.

FeatureStatelessStateful
Use CaseSimple Filtering, ETLAggregations, Joins
LatencyVery Low LatencySlightly Higher Latency due o State Management
ComplexitySimple LogicComplex Logic Involving Multiple Events
State ManagementNot RequiredRequired for Context-aware Processing
ScalabilityHighDepends on the Framework

Read my article “Apache Kafka (including Kafka Streams) + Apache Flink = Match Made in Heaven” to learn more about choosing the right stream processing engine for your use case.

And to clarify again: while this article uses Kafka Streams for stateless and Flink for stateful stream processing, both frameworks are capable of handling both types.

Video Recording

Below, I summarize this content as a ten-minute video on my YouTube channel:

Why Stream Processing is a Fundamental Change

Whether stateless or stateful, stream processing with Kafka Streams, Apache Flink, and similar technologies unlocks real-time capabilities that traditional databases simply cannot offer. From simple ETL tasks to complex fraud detection and AI integration, stream processing empowers organizations to build scalable, low-latency applications.

Stream Processing with Apache Kafka Flink SQL Java Python and AI ML

Investing in stream processing means:

  • Faster Innovation: Real-time insights drive competitive advantage.
  • Operational Efficiency: Automate decisions and reduce latency.
  • Scalability: Handle millions of events seamlessly.

Stream processing isn’t just an evolution of data handling—it’s a revolution. If you’re not leveraging it yet, now is the time to explore this powerful paradigm. If you want to learn more, check out my light board video exploring the core value of Apache Flink:

Stay ahead of the curve! Subscribe to my newsletter for insights into data streaming and connect with me on LinkedIn to continue the conversation.

The post Stateless vs. Stateful Stream Processing with Kafka Streams and Apache Flink appeared first on Kai Waehner.

]]>
GenAI Demo with Kafka, Flink, LangChain and OpenAI https://www.kai-waehner.de/blog/2024/01/29/genai-demo-with-kafka-flink-langchain-and-openai/ Mon, 29 Jan 2024 14:32:13 +0000 https://www.kai-waehner.de/?p=6105 Generative AI (GenAI) enables automation and innovation across industries. This blog post explores a simple but powerful architecture and demo for the combination of Python, and LangChain with OpenAI LLM, Apache Kafka for event streaming and data integration, and Apache Flink for stream processing. The use case shows how data streaming and GenAI help to correlate data from Salesforce CRM, searching for lead information in public datasets like Google and LinkedIn, and recommending ice-breaker conversations for sales reps.

The post GenAI Demo with Kafka, Flink, LangChain and OpenAI appeared first on Kai Waehner.

]]>
Generative AI (GenAI) enables automation and innovation across industries. This blog post explores a simple but powerful architecture and demo for the combination of Python, and LangChain with OpenAI LLM, Apache Kafka for event streaming and data integration, and Apache Flink for stream processing. The use case shows how data streaming and GenAI help to correlate data from Salesforce CRM, searching for lead information in public datasets like Google and LinkedIn, and recommending ice-breaker conversations for sales reps.

GenAI Demo with Kafka, Flink, LangChain and OpenAI

The Emergence of Generative AI

Generative AI (GenAI) refers to a class of artificial intelligence (AI) systems and models that generate new content, often as images, text, audio, or other types of data. These models can understand and learn the underlying patterns, styles, and structures present in the training data and then generate new, similar content on their own.

Generative AI has applications in various domains, including:

  • Image Generation: Generating realistic images, art, or graphics.
  • Text Generation: Creating human-like text, including natural language generation.
  • Music Composition: Generating new musical compositions or styles.
  • Video Synthesis: Creating realistic video content.
  • Data Augmentation: Generating additional training data for machine learning models.
  • Drug Discovery: Generating molecular structures for new drugs.

A key challenge of Generative AI is the deployment in production infrastructure with context, scalability, and data privacy in mind. Let’s explore an example of using CRM and customer data to integrate GenAI into an enterprise architecture to support sales and marketing.

This article shows a demo that combines real-time data streaming powered by Apache Kafka and Flink with a large language model from OpenAI within LangChain. If you want to learn more about data streaming with Kafka and Flink in conjunction with Generative AI, check out these two articles:

The following demo is about supporting sales reps or automated tools with Generative AI:
  • The Salesforce CRM creates new leads through other interfaces or by the human manually.
  • The sales rep / SDR receives lead information in real time to call the prospect.
  • A special GenAI service leverages the lead information (name and company) to search the web (mainly LinkedIn) to generate helpful content for the cold call of the lead, including: Summary, two interesting facts, topic of interest, and two creative ice-breaker for initiating a conversation.

Kudos to my colleague Carsten Muetzlitz who built the demo. The code is available on Github. Here is the architecture of the demo:

GenAI Demo with Kafka, Flink, LangChain, OpenAI

Technologies and Infrastructure in the Demo

The following technologies and infrastructure are used to implement and deploy the GenAI demo.

  • Python: The programming language almost every data engineer and data scientist uses.
  • LangChain: The Python framework implements the application to support sales conversations.
  • OpenAI: The language model and API help to build simple but powerful GenAI applications.
  • Salesforce: The cloud CRM tool stores the lead information and other sales and marketing data.
  • Apache Kafka: Scalable real-time data hub decoupling the data sources (CRM) and data sinks (GenAI application and other services).
  • Kafka Connect: Data integration via Change Data Capture (CDC) from Salesforce CRM.
  • Apache Flink: Stream processing for enrichment and data quality improvements of the CRM data.
  • Confluent Cloud: Fully managed Kafka (stream and store), Flink (process), and Salesforce connector (integrate).
  • SerpAPI: Scrape Google and other search engines with the lead information.
  • proxyCurl: Pull rich data about the lead from LinkedIn without worrying about scaling a web scraping and data-science team.

Here is a 15 minute video walking you through the demo:

  • Use case
  • Technical architecture
  • GitHub project with Python code using Kafka and LangChain
  • Fully managed Kafka and Flink in the Confluent Cloud UI
  • Push new leads in real-time from Salesforce CRM via CDC using Kafka Connect
  • Streaming ETL with Apache Flink
  • Generative AI with Python, LangChain and OpenAI

Missing: No Vector DB and RAG with Model Embeddings in the LangChain Demo

This demo does NOT use advanced GenAI technologies for RAG (retrieval augmented generation), model embeddings, or vector search via a Vector database (Vector DB) like Pinecone, Weaviate, MongoDB or Oracle.

The principle of the demo is KISS (“keep it as simple as possible”). These technologies can and will be integrated into many real-world architectures.

The demo has limitations regarding latency and scale. Kafka and Flink run as fully managed and elastic SaaS. But the AI/ML part around LangChain could have improved latency, using a SaaS for hosting, and integration with other dedicated AI platforms. Especially data-intensive applications will need a vector database and advanced retrieval and semantic search technologies like RAG.

Fun fact: The demo breaks when I search for my name instead of Carsten’s. Because the web scraper finds too much content in the web about me and as a result the LangChain app crashes… This is a compelling event for complementary technologies like Pinecone or MongoDB that can do indexing, RAG and semantic search at scale. These technologies provide fully managed integration with Confluent Cloud so the demo could easily be extended.

The Role of LangChain in GenAI

LangChain is an open-source framework for developing applications powered by language models. LangChain is also the name of the commercial vendor behind the framework. The tool provides the needed “glue code” for data engineers to build GenAI applications with intuitive APIs for chaining together large language models (LLM), prompts with context, agents that drive decision making with stateful conversations, and tools that integrate with external interfaces.

LangChain supports:

  • Context-awareness: connect a language model to sources of context (prompt instructions, few shot examples, content to ground its response in, etc.)
  • Reason: rely on a language model to reason (about how to answer based on provided context, what actions to take, etc.)

The main value props of the LangChain packages are:

  1. Components: composable tools and integrations for working with language models. Components are modular and easy-to-use, whether you are using the rest of the LangChain framework or not.
  2. Off-the-shelf chains: built-in assemblages of components for accomplishing higher-level tasks.

LangChain Architecture and Components

Together, these products simplify the entire application lifecycle:

  • Develop: Write your applications in LangChain/LangChain.js. Hit the ground running using Templates for reference.
  • Productionize: Use LangSmith to inspect, test and monitor your chains, so that you can constantly improve and deploy with confidence.
  • Deploy: Turn any chain into an API with LangServe.

LangChain in the Demo

The demo uses several LangChain concepts such as Prompts, Chat Models, Chains using the LangChain Expression Language (LCEL), Agents using a language model to choose a sequence of actions to take

Here is the logical flow of the LangChain business process:

  1. Get new leads: Collect full name and company of the lead from Salesforce CRM in real-time from a Kafka Topic.
  2. Find LinkedIn profile: Use the Google Search API “SerpAPI” to search for the URL of the lead’s LinkedIn profile.
  3. Collect information about the lead: Use Proxycurl to collect the required information about the lead from LinkedIn.
  4. Create cold call recommendations for the sales rep or automated script: Ingest all information into the ChatGPT LLM via OpenAI API and send the generated text to a Kafka Topic.

The following screenshot shows a snippet of the generated content. It includes context-specific icebreaker conversations based on the LinkedIn profile. For the context, Carsten worked at Oracle for 24 years before joining Confluent. The LLM uses this context of the LangChain prompt to generate related content:

LLM Text Generated with Python, LangChain, GoogleSERP, Proxycurl and OpenAI

The Role of Apache Kafka in GenAI

Apache Kafka is a distributed streaming platform used for building real-time data pipelines and streaming applications. It plays a crucial role in handling and managing large volumes of data streams efficiently and reliably.

Generative AI typically involves models and algorithms for creating new data, such as images, text, or other types of content. Apache Kafka supports Generative AI by providing a scalable and resilient infrastructure for managing data streams. In a Generative AI context, Kafka can be used for:

  • Data Ingestion: Kafka can handle the ingestion of large datasets, including the diverse and potentially high-volume data needed to train Generative AI models.
  • Real-time Data Processing: Kafka’s real-time data processing capabilities help in scenarios where data is constantly changing, allowing for the rapid updating and training of Generative AI models.
  • Event Sourcing: Event sourcing with Kafka captures and stores events that occur over time, providing a historical record of data changes. This historical data is valuable for training and improving Generative AI models.
  • Integration with other Tools: Kafka can be integrated into larger data processing and machine learning pipelines, facilitating the flow of data between different components and tools involved in Generative AI workflows.

While Apache Kafka itself is a tool specifically designed for Generative AI, its features and capabilities contribute to the overall efficiency and scalability of the data infrastructure. Kafka’s capabilities are crucial when working with large datasets and complex machine learning models, including those used in Generative AI applications.

Apache Kafka in the Demo

Kafka is the data fabric connecting all the different applications. Ensuring data consistency is a sweet spot of Kafka. No matter if a data source or sink is real time, batch or a request-response API.

In this demo, Kafka consumes events from Salesforce CRM as the main data source of customer data. Different applications (Flink, LangChain, Salesforce) consume the data in different steps of the business process. Kafka Connect provides the capability for data integration with no need for another ETL, ESB or iPaaS tool. This demo uses Confluent’s Change Data Capture (CDC) connector to consume changes from the Salesforce database in real-time for further processing.

Fully managed Confluent Cloud is the infrastructure for the entire Kafka and Flink ecosystem in this demo. The focus of the developer should always build business logic, not worrying about operating infrastructure.

While the heart of Kafka is event-based, real-time and scalable, it also enables domain-driven design and data mesh enterprise architectures out-of-the-box.

Apache Flink is an open-source distributed stream processing framework for real-time analytics and event-driven applications. Its primary focus is on processing continuous streams of data efficiently and at scale. While Apache Flink itself is not a specific tool for Generative AI, it plays a role in supporting certain aspects of Generative AI workflows. Here are a few ways in which Apache Flink is relevant:

  1. Real-time Data Processing: Apache Flink can process and analyze data in real-time, which can be useful for scenarios where Generative AI models need to operate on streaming data, adapting to changes and generating responses in real-time.
  2. Event Time Processing: Flink has built-in support for event time processing, allowing for the handling of events in the order they occurred, even if they arrive out of order. This can be beneficial in scenarios where temporal order is crucial, such as in sequences of data used for training or applying Generative AI models.
  3. Stateful Processing: Flink supports stateful processing, enabling the maintenance of state across events. This can be useful in scenarios where the Generative AI business process needs to maintain context or memory of past events to generate coherent and context-aware outputs.
  4. Integration with Machine Learning Libraries: While Flink itself is not a machine learning framework, it can be integrated with other tools and libraries that are used in machine learning, including those relevant to Generative AI. This integration can facilitate the deployment and execution of machine learning models within Flink-based streaming applications.

The specific role of Apache Flink in Generative AI depends on the particular use case and the architecture of the overall system.

This demo leverages Apache Flink for streaming ETL (enrichment, data quality improvements) of the incoming Salesforce CRM events.

FlinkSQL provides a simple and intuitive way to implement ETL with any Java or Python code. Fully managed Confluent Cloud is the infrastructure for Kafka and Flink in this demo. Serverless FlinkSQL allows you to scale up as much as needed, but also scale down to zero if no events are consumed and processed.

The demo is just the starting point. Many powerful applications can be built with Apache Flink. This includes streaming ETL, but also business applications like you find them at Netflix, Uber and many other tech giants.

LangChain is an easy-to-use AI/ML framework to connect large language models to other data sources and create valuable applications. The flexibility and open approach enables developers and data engineers to build all sorts of applications, from chatbots to smart systems that answer your questions.

Data streaming with Apache Kafka and Flink provide a reliable and scalable data fabric for data pipelines and stream processing. The event store of Kafka ensures data consistency across real-time, batch, and request-response APIs. Domain-driven design, microservice architectures and data products build in a data mesh more and more leverage on Kafka for these reasons.

The combination of LangChain, GenAI technologies like OpenAI and data streaming with Kafka and Flink make a powerful combination for context-specific decision in real-time powered by AI.

Most enterprises have a cloud-first strategy for AI use cases. Data streaming infrastructure is available in SaaS like Confluent Cloud so that the developers can focus on business logic with much faster time-to-market. Plenty of alternatives exist for building AI applications with Python (the de facto standard for AI). For instance, you could build a user-defined function (UDF) in a FlinkSQL application executing the Python code and consuming from Kafka. Or use a separate application development framework and cloud platform like Quix Streams or Bytewax for Python apps instead of a framework like LangChain.

How do you combine Python, LangChain and LLMs with data streaming technologies like Kafka and Flink? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post GenAI Demo with Kafka, Flink, LangChain and OpenAI appeared first on Kai Waehner.

]]>
Why Tiered Storage for Apache Kafka is a BIG THING… https://www.kai-waehner.de/blog/2023/12/05/why-tiered-storage-for-apache-kafka-is-a-big-thing/ Tue, 05 Dec 2023 05:38:36 +0000 https://www.kai-waehner.de/?p=5787 Apache Kafka added Tiered Storage to separate compute and storage. The capability enables more scalable, reliable and cost-efficient enterprise architectures. This blog post explores the architecture, use cases, benefits, and a case study for storing Petabytes of data in the Kafka commit log. The end discusses why Tiered Storage does NOT replace other databases and how Apache Iceberg might change future Kafka architectures even more.

The post Why Tiered Storage for Apache Kafka is a BIG THING… appeared first on Kai Waehner.

]]>
Apache Kafka added Tiered Storage to separate compute and storage. The capability enables more scalable, reliable and cost-efficient enterprise architectures. This blog post explores the architecture, use cases, benefits, and a case study for storing Petabytes of data in the Kafka commit log. The end discusses why Tiered Storage does NOT replace other databases and how Apache Iceberg might change future Kafka architectures even more.

Tiered Storage for Apache Kafka - Use Cases Architecture Benefits

 

If you prefer watching a ten minute video, check out this summary about the “Evolution of Storage for Apache Kafka covering Tiered Storage, Direct Write to Object Storage and the relation to Open Table Formats such as Apache Iceberg”:

Now, let’s explore why Tiered Storage for Apache Kafka is a BIG THING:

Compute vs. Storage vs. Tiered Storage

Let’s define the terms compute, storage, and tiered storage to have the same understanding when exploring this in the context of the data streaming platform Apache Kafka.

Compute and Storage

Two fundamental components of a computing system are compute and storage. They serve different purposes in information processing.

Compute refers to the processing power and capability of a computer system to perform tasks, execute instructions, and carry out computations. The compute component includes the CPU (Central Processing Unit) and GPU (Graphics Processing Unit).

Storage refers to the components and systems that store and retrieving data over the long term. It is where data is persistently maintained for later use. Storage includes devices such as hard disk drives (HDDs), solid-state drives (SSDs), and other types of non-volatile memory, such as databases that keep data even when the power is turned off.

Tiered Storage

Tiered storage refers to a storage architecture that uses different classes or tiers of storage (e.g., Object Storage on S3) to efficiently manage and store data based on its access patterns, performance requirements, and cost considerations.

The goal of tiered storage is to optimize the use of storage resources, balancing performance and cost, by placing data on the most suitable storage media based on its characteristics and the organization’s policies.

Data placement and movement between these tiers can be automated based on policies and algorithms that analyze usage patterns, access frequency, and other factors. This ensures that the most critical and frequently accessed data lives in high-performance storage, while less critical or infrequently accessed data is moved to lower-cost, lower-performance storage.

Long-Term Storage in Apache Kafka

Apache Kafka is an open-source distributed streaming platform that is used for building real-time data pipelines and streaming applications. Kafka is the established de facto standard for data streaming. The event streaming platform handles large volumes of data, providing a scalable and fault-tolerant architecture.

Applications and data stores use Kafka for ingesting, storing, and processing real-time data streams, making it a fundamental component in building event-driven architectures and systems that require the processing of continuous data flows. Additionally, many use cases leverage Kafka not just for real-time data but to ensure data consistency across real-time, batch, and request-response APIs.

Use Cases for Apache Kafka as Storage System

While most people think about Kafka as a message broker, real-time analytics platform, or big data ingestion system, the distributed commit log with ordering guarantees and timestamps enables plenty of use cases for accessing data long after its creation or replaying historical data.

Use Cases for Replaying Historical Events with Apache Kafka

Here are a few examples for use cases that leverage long-term storage of data in Kafka:

  • New consumer: Deploy a new application / database / data warehouse, data lake and synchronize the state of the business objects.
  • Offloading: Reducing cost significantly by NOT consuming again and again from expensive or non-scalable systems (e.g. mainframe and MIPS)
  • Error-handling: Re-process historical data after fixing an issue in the business logic.
  • Compliance / regulatory processing: Replay historical data to analyze an incident.
  • Query and analyze existing events: Consume data from a notebook for data engineering, analytics, or reporting.
  • Schema changes in analytics platform: Re-process data after updating data contracts.
  • Model training: Batch ingestion into an AI framework to apply a machine learning algorithm
  • Disaster recovery: Operational data stores replay data again from the persistent commit log in the case of a failure.

Objections for Storing Data Long-Term in Kafka

Storing data long term in Kafka has a few drawbacks. The following arguments are valid concerns:

  • Cost: Storing large volumes of data on attached disks is much more expensive than external storage systems like an object store.
  • Scalability: Operating Kafka brokers with lots of data (say many gigabytes, or even terabytes, and more) is challenging, especially in the case of failures when you need to rebalance partitions.
  • Risk: Downtime or data inconsistencies happen if operations struggle with large volumes or when hardware needs to be migrated.

Therefore, you should NOT store big data sets in Kafka without Tiered Storage! With this in mind, let’s explore how Tiered Storage for Kafka solves these problems.

Introducing Tiered Storage for Apache Kafka

Apache Kafka’s backend is a distributed system running Kafka brokers. Each Kafka broker has processing and storage capabilities.

The applications are producers and consumers of events. Many interfaces communicate with Kafka brokers:

  • An application written in Java, Python, C++, Go, or any other programming language
  • A Kafka Connect source or sink connector connecting to IBM MQ, Spark, Snowflake, or any other data store or SaaS application
  • A stream processor built with Kafka-native Kafka Streams, KSQL, or external infrastructures like Apache Flink
  • Any other endpoint, like a HTTP interface or an out-of-the-box integration of another middleware or data platform

What is Tiered Storage for Kafka?

Tiered storage for Apache Kafka refers to the capability of configuring different storage tiers to optimize the storage infrastructure based on the access patterns and requirements of the data stored in Kafka brokers.

A Kafka cluster stores data in Kafka Topics. These topics can have different characteristics in terms of importance, access frequency, and retention policies.

The concept is like the general idea of tiered storage in storage systems, but it’s adapted to the specific needs of Kafka. Tiered Storage is one critical making the Kafka architecture cloud-native.

Kafka Architecture without Tiered Storage

Kafka applications communicate with logical Kafka Topics to produce messages to or consume messages from partitions:

Apache Kafka Architecture without Tiered Storage

The storage is a disk attached to the broker. This can be HDD or SDD disks on-premise or e.g. EBS volumes on AWS cloud.

Kafka Architecture with Tiered Storage

Tiered Storage for Kafka does NOT change how applications communicate with Kafka brokers. Tiered Storage is an implementation detail:

Apache Kafka Architecture with Tiered Storage

Besides the disks attached to the broker, Kafka offloads data to an external storage. Most times, this is an object storage like Amazon S3, Azure Blog Storage, Google Cloud Storage, or MinIO for Kubernetes.

Serverless cloud offerings handle the offloading for the operator. Self-managed solutions allow operators to configure hot and cold storage durations for each Kafka Topic.

Benefits of Tiered Storage for Apache Kafka

Let’s review the above-discussed objections to storing big data sets long-term in Kafka and how Tiered Storage helps:

  • Reduced cost: Most data is offloaded to an external storage. This reduces the storage cost significantly.
  • Improved scalability: Only data on the disks attached to the Kafka brokers must be rebalanced. As most data is offloaded, rebalancing only takes seconds or minutes; even if the external storage saves petabytes.
  • Reduced risk: Better scalability and separation of compute and storage makes operations much easier and significantly reduces the risk of downtime or data inconsistency.

The Implementation of Tiered Storage in Apache Kafka

Tiered Storage for Apache Kafka is available. However, be aware that different implementations exist with different features, maturity, and support levels.

And open source Apache Kafka only provides the interface for tiered storage. You must choose an open source implementation, build your own integration into an external storage system, or leverage a commercial product or cloud service that embeds tiered storage into its offering.

Keep in mind that the interface alone is not helpful. The implementation needs to be battle-tested and guarantee data consistency across the hot storage on the broker and cold storage in the external storage; even in the case of failure, network issues, etc.

Kafka consumers do not see the implementation details of Kafka’s Tiered Storage. They just consumed as if there was no tiered storage implementation (and still expect the same behavior). There are no API or code changes needed in Kafka client applications. Hence, you can easily migrate an existing deployment to a Kafka cluster leveraging Tiered Storage.

Many people ask about the performance impact of tiered storage for Kafka. The short answer: There is no performance impact for most scenarios. Real-time consumers consume from the memory / page cache as before. And replaying historical data from the event log does not differ much from the local disk or the remote object-store.

AK 3.6 Release Makes Tiered Storage Available

When writing this blog post (December 2023), KIP-405: Kafka Tiered Storage is available as early access in Apache Kafka 3.6. This release introduces Tiered Storage to Kafka. This release is only for non-production environments (see the early access notes for more information).

GA of this feature is just a foreseeable matter of time. The bulk of KIP-405 was part of early access in release 3.6. But there are a few additional features that are slated for 3.7. And GA likely comes after that in 3.8+.

KIP-405 Provides a Pluggable Storage API for Tiering

KIP-405 separates computation and storage in the Kafka broker for pluggable storage tiering natively in Kafka Tiered Storage, bringing a seamless storage extension to remote objects with minimal operational changes.

Apache Kafka’s LocalTieredStorage default implementation is a local file-based RemoteStorageManager. LocalTieredStorage facilitates the simulation of remote storage behavior in a controlled and isolated environment during testing. This is not meant for production use cases! Enterprises need to write their own implementation, embed an open-source alternative, or trust a software vendor respectively cloud service.

How Confluent, Uber, and Others use Tiered Storage

KIP-405 is only available in preview with Kafka 3.6. But some proprietary implementations already exist for years in production. This also helped to define the KIP with lessons learned from running Kafka in production with tiered storage under the hood.

Implementation details of tiered storage for Kafka vary, and there may be different approaches or tools available to achieve this, depending on the specific Kafka distribution or storage infrastructure being used. Organizations might also use external systems or cloud storage solutions to implement tiered storage strategies for Kafka.

Confluent pioneered tiered storage for Kafka and has provided the capability for several years already. It is available for the self-managed Confluent Platform and the fully managed Confluent Cloud in AWS, Azure, and GCP. Confluent chose the S3 interface to implement storage support for the cloud providers (AWS, Azure, GCP) and several on-premise solutions like PureStorage Flash Blade, Nutanix Objects, Netapp Object Storage, Dell EMC ECS, Hitachi Content Platform Object Storage, or MinIO for Kubernetes.

Uber, who had the lead in implementing the KIP-405 in open source Apache Kafka, runs its tiered storage against HDFS. Confluent and AWS contributed to refactoring, best practices and performance / integration testing. Satish Duggana, tech lead for Data and Streaming Infrastructure at Uber, presented the details of their implementations and deployment in a talk at Current 2023.

Other vendors like AWS with MSK and Aiven are adopting KIP-405 and provide their own tiered storage implementations these days.

Case Study: KOR Financial stores 160 Petabytes in Kafka for Regulatory Reporting

KOR is a cloud-native family of global trade repositories and regulatory reporting services that has adopted Confluent Cloud and a data streaming architecture to improve compliance processes.

Regulatory reporting is obviously a perfect use case for Tiered Storage in Kafka to replay historical data. As the Kafka log provides guaranteed ordering and timestamps, there is no need for another database or data lake besides Kafka.

Daan Gerits, Chief Data Officer, KOR Financial, explains at Diginomica: “At KOR Financial, we have a very specific problem that we are trying to solve, which is collecting trading information for regulators. And we decided to do it in a totally different way to the way that most people are doing it. Where others would be using data storage or big data technologies, we decided to go all in on Kafka. We are building our system to store 160 petabytes in Confluent Cloud and then work on top of that. We don’t have any other database. So it’s a long retention use case.”

Kafka is NOT a Database (Replacement)

Apache Kafka is a database. It provides ACID guarantees. Hundreds of companies for deploy Kafka for mission-critical deployments including transactional workloads. However, most times, Kafka is NOT competitive to other databases.

Can Apache Kafka Replace a Database like Oracle Hadoop NoSQL MongoDB Elastic MySQL et al

Kafka is an event streaming platform for messaging, storage, processing and integration at scale in real-time with zero downtime and zero data loss. Almost all deployments connect Kafka to database sources and sinks for data integration, decoupling and data consistency, where the heart of the cloud-native enterprise architecture is real-time, scalable and reliable.

Apache Kafka is complementary to database, data warehouse, data lake and Lakehouse architectures. I wrote a blog series about use cases and architectures for data streaming other storage platforms.

The Future: Apache Iceberg for Kafka?

The adoption of Tiered Storage for Apache Kafka is just getting started. Many teams will store (some) data longer in Kafka to offload data from expensive systems or replay historical data without needing another database.

However, most analytics platforms do NOT use the Kafka protocol to consume and query data. The trend across most data platforms goes towards Apache Iceberg as a standardized abstraction layer for storing and querying (non-real-time) data in an objects store or other storage.

Apache Iceberg is an open-source table format and processing framework for big data. It aims to provide the best of both worlds: the performance of a traditional table format with the flexibility of a schema-on-read approach. Iceberg addresses solves in managing large-scale and evolving data sets in distributed storage environments.

Apache Iceberg supports popular data processing frameworks, such as Apache Spark, Apache Flink, Apache Hive, Presto, and more. With Kafka’s Tiered Storage and especially the S3 support by some vendors, I can see how this can be an entire game changer for storing and processing events in real-time with the Kafka protocol or with other analytics engines and databases in near-real-time or batch.

The future will show us. For now, let’s be excited about how Tiered Storage for Kafka is the next big thing around data streaming.

Tiered Storage makes Kafka more Scalable, Cost-Efficient and Reliable

Tiered Storage for Apache Kafka makes event-driven architectures more scalable, cost-efficient and reliable. It enables new use cases that require another database or data lake in the past.

However, Kafka’s goal is still NOT to replace other data and analytics platforms. Design patterns like microservices and data mesh enable a true decoupling of applications and data stores. Kafka provides this decoupling. With tiered storage in mind for various use cases such as offloading, new consumers, or error-handling, you can consider new approaches for your cloud-native enterprise architecture.

Are you excited about Tiered Storage for Apache Kafka? How will you use it? Or do you already use an existing implementation, like Confluent Cloud? Let’s connect on LinkedIn and discuss it! Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

The post Why Tiered Storage for Apache Kafka is a BIG THING… appeared first on Kai Waehner.

]]>
Top 5 Trends for Data Streaming with Kafka and Flink in 2024 https://www.kai-waehner.de/blog/2023/12/02/top-5-trends-for-data-streaming-with-apache-kafka-and-flink-in-2024/ Sat, 02 Dec 2023 10:54:38 +0000 https://www.kai-waehner.de/?p=5885 Do you wonder about my predicted TOP 5 data streaming trends with Apache Kafka and Flink in 2024 to set data in motion? Discover new technology trends and best practices for event-driven architectures, including data sharing, data contracts, serverless stream processing, multi-cloud architectures, and GenAI.

The post Top 5 Trends for Data Streaming with Kafka and Flink in 2024 appeared first on Kai Waehner.

]]>
Data Streaming is one of the most relevant buzzwords in tech to build scalable real-time applications and innovative business models. Do you wonder about my predicted TOP 5 data streaming trends in 2024 to set data in motion? Learn what role Apache Kafka and Apache Flink play. Discover new technology trends and best practices for event-driven architectures, including data sharing, data contracts, serverless stream processing, multi-cloud architectures, and GenAI.

Some followers might notice that this became a series with past posts about the top 5 data streaming trends for 2021, the top 5 for 2022, and the top 5 for 2023. Trends change over time, but the huge value of having a scalable real-time infrastructure as the central data hub stays. Data streaming with Apache Kafka is a journey and evolution to set data in motion.

Top 5 Trends for Data Streaming with Apache Kafka and Flink in 2024

The research and consulting company Gartner defines the top strategic technology trends every year. This time, the trends are around building new (AI) platforms and delivering value by automation, but also protecting investment. On a higher level, it is all about automating, scaling, and pioneering. Here is what Gartner expects for 2024:

Gartner Top Strategic Technology Trends 2024

It is funny (but not surprising): Gartner’s predictions overlap and complement the five trends I focus on for data streaming with Apache Kafka looking forward to 2024. I explore how data streaming enables faster time to market, good data quality across independent data products, and innovation with technologies like Generative AI.

The top 5 data streaming trends for 2024

I see the following topics coming up more regularly in conversations with customers, prospects, and the broader data streaming community across the globe:

  1. Data sharing for faster innovation with independent data products
  2. Data contracts for better data governance and policy enforcement
  3. Serverless stream processing for easier building of scalable and elastic streaming apps
  4. Multi-cloud deployments for cost-efficient delivering value where the customers sit
  5. Reliable Generative AI (GenAI) with embedded accurate, up-to-date information to avoid hallucination

The following sections describe each trend in more detail. The trends are relevant for many scenarios; no matter if you use open source Apache Kafka or Apache Flink, a commercial platform, or a fully managed cloud service like Confluent Cloud. I start each section with a real-world case study. The end of the article contains the complete slide deck and video recording.

Data sharing across business units and organizations

Data sharing refers to the process of exchanging or providing access to data among different individuals, organizations, or systems. This can involve sharing data within an organization or sharing data with external entities. The goal of data sharing is to make information available to those who need it, whether for collaboration, analysis, decision-making, or other purposes. Obviously, real-time data beats slow data for almost all data sharing use cases.

NASA: Real-time data sharing with Apache Kafka

NASA enables real-time data between space- and ground-based observatories. The
General Coordinates Network (GCN) allows real-time alerts in the astronomy community. With this system, NASA researchers, private space companies, and even backyard astronomy enthusiasts can publish and receive information about current activity in the sky.

NASA enables real-time data from Mars with Apache Kafka

Apache Kafka plays an essential role in astronomy research for data sharing. Particularly where black holes and neutron stars are involved, astronomers are increasingly seeking out the “time domain” and want to study explosive transients and variability. In response, observatories are increasingly adopting streaming technologies to send alerts to astronomers and to get their data to their science users in real time.

The talk “General Coordinates Network: Harnessing Kafka for Real-Time Open Astronomy at NASA” explores architectural choices, challenges, and lessons learned in adapting Kafka for open science and open data sharing at NASA.

NASA’s approach to OpenID Connect / OAuth2 in Kafka is designed to securely scale Kafka from access inside a single organization to access by the general public.

Stream data exchange with Kafka using cluster linking, stream sharing, and AsyncAPI

The Kafka ecosystem provides various functions to share data in real-time at any scale. Some are vendor-specific. I look at this from the perspective of Confluent, so that you see a lot of innovative options (even if you want to build it by yourself with open source Kafka):

  • Kafka Connect connector ecosystem to integrate with other data sources and sinks out-of-the-box
  • HTTP/REST proxies and connectors for Kafka to use simple and well understood request-response (HTTP is, unfortunately, also an anti-pattern for streaming data)
  • Cluster Linking for replication between Kafka clusters using the native Kafka protocol (instead of separate infrastructure like MirrorMaker)
  • Stream Sharing for exposing a Kafka Topic through a simple button click with access control, encryption, quotas, and chargeback billing APIs
  • Generation of AsyncAPI specs to share data with non-Kafka applications (like other message brokers or API gateways that support AsyncAPI, which is an open data for contract for asynchronous event-based messaging (similar to Swagger for HTTP/REST APIs)

Here is an example for Cluster Linking for bi-directional replication between Kafka clusters in the automotive industry:

Stream Data Exchange with Apache Kafka and Confluent Cluster Linking

And another example of stream sharing for easy access to a Kafka Topic in financial services:

Confluent Stream Sharing for Data Sharing Beyond Apache Kafka

To learn more, check out the article “Streaming Data Exchange with Kafka and a Data Mesh in Motion“.

Data contracts for data governance and policy enforcement

A data contract is an agreement or understanding that defines the terms and conditions governing the exchange or sharing of data between parties. It is a formal arrangement specifying how data will be handled, used, protected, and shared among entities. Data contracts are crucial when multiple parties need to interact with and utilize shared data, ensuring clarity and compliance with agreed-upon rules.

Raiffeisen Bank International: Data contracts for data sharing across countries

Raiffeisen Bank International (RBI) is scaling an event-driven architecture across the group as part of a bank-wide transformation program. This includes the creation of a reference architecture and the re-use of technology and concepts across 12 countries.

Data Mesh powered by Data Streaming at Raiffeisen Bank International

Learn more in the article “Decentralized Data Mesh with Data Streaming in Financial Services“.

Policy enforcement and data quality for Apache Kafka with Schema Registry

Good data quality is one of the most critical requirements in decoupled architectures like microservices or data mesh. Apache Kafka became the de facto standard for these architectures. But Kafka is a dumb broker that only stores byte arrays. The Schema Registry for Apache Kafka enforces message structures.

This blog post examines Schema Registry enhancements to leverage data contracts for policies and rules to enforce good data quality on field-level and advanced use cases like routing malicious messages to a dead letter queue.

Data Governance and Policy Enforcement with Data Contracts for Apache Kafka

For more details: Building a data mesh with decoupled data products and good data quality, governance, and policy enforcement.

Serverless stream processing refers to a computing architecture where developers can build and deploy applications without having to manage the underlying infrastructure.

In the context of stream processing, it involves the real-time processing of data streams without the need to provision or manage servers explicitly. This approach allows developers to focus on writing code and building applications. The cloud service takes care of the operational aspects, such as scaling, provisioning, and maintaining servers.

Designed to answer professional farmers’ needs, Sencrop offers a range of connected
weather stations that bring you precision agricultural weather data straight from your plots.

  • Over 20,000 connected ag-weather stations throughout Europe.
  • An intuitive, user-friendly application: Access accurate, ultra-local data to optimize your daily actions.
  • Prevent risks, reduce costs: Streamline inputs and reduce your environmental impact and associated costs.

Smart Agriculture with Kafka and Flink at Sencrop

Apache Kafka and Apache Flink increasingly join forces to build innovative real-time stream processing applications.

The Rise of Open Source Streaming Processing with Apache Kafka and Apache Flink

The Y-axis in the diagram shows the monthly unique users (based on statistics of Maven downloads).

Apache Kafka + Apache Flink = Match Made in Heaven” explores the benefits of combining both open-source frameworks. The article shows unique differentiators of Flink versus Kafka, and discusses when to use a Kafka-native streaming engine like Kafka Streams instead of Flink.

Unfortunately, operating a Flink cluster is really hard. Even harder than Kafka. Because Flink is not just a distributed system, it also has to keep state of applications for hours or even longer. Hence, serverless stream processing helps taking over the operation burden. And it makes the life of the developer easier, too.

Staying tuned for exciting cloud products offering serverless Flink in 2024. But be aware that some vendors use the same trick as for Kafka: Provisioning a Flink cluster and handing it over to you is NOT a serverless or fully-managed offering! For that reason, I compared Kafka products as self-driving cars vs. self-driving cars, i.e. cloud-based Kafka clusters you operate vs. truly fully managed services.

Multi-cloud for cost-efficient and reliable customer experience

Multi-cloud refers to a cloud computing strategy that uses services from multiple cloud providers to meet specific business or technical requirements. In a multi-cloud environment, organizations distribute their workloads across two or more cloud platforms, including public clouds, private clouds, or a combination of both.

The goal of a multi-cloud strategy is to avoid dependence on a single cloud provider and to leverage the strengths of different providers for various needs. Cost efficiency and regional laws (like operating in the United States or China) required different deployment strategies. Some countries do not provide a public cloud. A private cloud is the only option then.

New Relic: Multi-cloud Kafka deployments at extreme scale for real-time observability

New Relic is a software analytics company that provides monitoring and performance management solutions for applications and infrastructure. It’s designed to help organizations gain insights into the performance of their software and systems, allowing them to optimize and troubleshoot issues efficiently.

Observability has two key requirements: first, monitor data in real-time at any scale. Second, deploy the monitoring solution where the applications are running. The obvious consequence for New Relic is to process data with Apache Kafka, and multi-cloud where the customers are.

Multi Cloud Observability in Real-Time at extreme Scale with Apache Kafka at New Relic

Hybrid and multi-cloud data replication for cost-efficiency, low latency, or disaster recovery

Multi-cloud deployments of Apache Kafka have become the norm rather than an exception. Several scenarios require multi-cluster solutions with specific requirements and trade-offs:

  • Regional separation because of legal requirements
  • Independence of a single cloud provider
  • Disaster recovery
  • Aggregation for analytics
  • Cloud migration
  • Mission-critical stretched deployments

Hybrid Cloud Architecture with Apache Kafka

The blog post “Architecture Patterns for Distributed, Hybrid, Edge and Global Apache Kafka Deployments” explores various architectures and best practices.

Reliable Generative AI (GenAI) with accurate context to avoid hallucination

Generative AI is a class of artificial intelligence systems that generate new content, such as images, text, or even entire datasets, often by learning patterns and structures from existing data. These systems use techniques such as neural networks to create content that is not explicitly programmed but is instead generated based on the patterns and knowledge learned during training.

Elemental Cognition: GenAI platform powered by Apache Kafka

Elemental Cognition’s AI platform develops responsible and transparent AI that helps solve problems and deliver expertise that can be understood and trusted.

Confluent Cloud powers the AI platform to enable scalable real-time data and data integration use cases. I recommend looking at their website to learn from various impressive use cases.

Elemental Cognition - Real Time GenAI Platform powered by Apache Kafka and Confluent Cloud

Apache Kafka serves thousands of enterprises as the mission-critical and scalable real-time data fabric for machine learning infrastructures. The evolution of Generative AI (GenAI) with large language models (LLM) like ChatGPT changed how people think about intelligent software and automation. The relationship between data streaming and GenAI has enormous opportunities.

Apache Kafka as Mission Critical Data Fabric for GenAI” explores the use cases for combining data streaming with Generative AI.

An excellent example, especially for Generative AI, is context-specific customer service. The following diagram shows an enterprise architecture leveraging event-driven data streaming for data ingestion and processing across the entire GenAI pipeline:

Event-driven Architecture with Apache Kafka and Flink as Data Fabric for GenAI

Stream processing with Kafka and Flink enables data correlation of real-time and historical data. A stateful stream processor takes existing customer information from the CRM, loyalty platform, and other applications, correlates it with the query from the customer into the chatbot, and makes an RPC call to an LLM.

Stream Processing with Apache Flink SQL UDF and GenAI with OpenAI LLM

The article “Apache Kafka + Vector Database + LLM = Real-Time GenAI” explores possible architectures, examples, and trade-offs between event streaming and traditional request-response APIs and databases.

Slides and video recording for the data streaming trends in 2024 with Kafka and Flink

Do you want to look at more details? This section provides the entire slide deck and a video walking you through the content.

Slide deck

Here is the slide deck from my presentation:

Fullscreen Mode

Video recording

And here is the video recording of my presentation:

Video Recording: Top 5 Use Cases and Architectures for Data Streaming with Apache Kafka and Flink in 2024

2024 makes data streaming more mature, and Apache Flink becomes mainstream

I have two conclusions for data streaming trends in 2024:

  • Data streaming goes up in the maturity curve. More and more projects build streaming applications instead of just leveraging Apache Kafka as a dumb data pipeline between databases, data warehouses, and data lakes.
  • Apache Flink becomes mainstream. The open source framework shines with a scalable engine, multiple APIs like SQL, Java, and Python, and serverless cloud offerings from various software vendors. The latter makes building applications much more accessible.

Data sharing with data contracts is mandatory for a successful enterprise architecture with microservices or a data mesh. And data streaming is the foundation for innovation with technology trends like Generative AI. Therefore, we are just at the tipping point of adopting data streaming technologies such as Apache Kafka and Apache Flink.

What are your most relevant and exciting data streaming trends with Apache Kafka and Apache Flink in 2024 to set data in motion? What are your strategy and timeline? Do you use serverless cloud offerings or self-managed infrastructure? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Top 5 Trends for Data Streaming with Kafka and Flink in 2024 appeared first on Kai Waehner.

]]>
How to Build a Real-Time Advertising Platform with Apache Kafka and Flink https://www.kai-waehner.de/blog/2023/09/15/how-to-build-a-real-time-advertising-platform-with-apache-kafka-and-flink/ Fri, 15 Sep 2023 06:49:27 +0000 https://www.kai-waehner.de/?p=4930 An advertising platform requires real-time capabilities to provide dynamic targeting, ad personalization, ad fraud detection, budget allocation, and event-driven marketing. This blog post explores how data streaming with Apache Kafka and Apache Flink enables context-specific advertising at any scale. Real-world success stories from Pinterest, Uber, Unity, buzzwil, and TV-Insight show different solutions and architectures for serving ads in marketing campaigns, embedded into mobile apps, and as SaaS software products.

The post How to Build a Real-Time Advertising Platform with Apache Kafka and Flink appeared first on Kai Waehner.

]]>
An advertising platform requires real-time capabilities to provide dynamic targeting, ad personalization, ad fraud detection, budget allocation, and event-driven marketing. This blog post explores how data streaming with Apache Kafka and Apache Flink enables context-specific advertising at any scale. Real-world success stories from Pinterest, Uber, Reddit, Unity, buzzwil, and TV-Insight show different solutions and architectures for serving ads in marketing campaigns, embedded into mobile apps, and as SaaS software products.

Real-Time Advertising Platform with Apache Kafka and Flink

What is a digital advertising platform?

An advertising (ads) platform is a digital system or service that allows businesses and advertisers to create, manage, and optimize their advertising campaigns across various channels. These platforms provide tools and features to target specific audiences, allocate budgets, track performance, and measure the effectiveness of advertising efforts.

Digital Marketing

Examples of advertising platforms include Google Ads, Facebook Ads, and programmatic advertising platforms that automate ad placement across websites and apps. These platforms play a crucial role in digital marketing, enabling advertisers to reach their target audience online and achieve their marketing objectives.

Challenges of an advertising platform

  1. Competition: Advertisers often face fierce competition for ad space. This can lead to higher costs and the need for effective targeting strategies.
  2. Ad Fraud: Digital advertising is susceptible to various forms of ad fraud, including click and impression fraud. Advertisers need to implement measures to protect their campaigns from fraudulent activity.
  3. Data Privacy Regulations: Stricter data privacy regulations, such as GDPR and CCPA, impact how advertisers collect and use customer data for targeting. Advertisers must comply with these regulations to avoid legal consequences.
  4. Ad Quality and Relevance: Ensuring that ads are of high quality and relevance to the target audience is essential. Poorly designed or irrelevant ads can lead to wasted ad spend and a negative user experience.
  5. Ad Fatigue: Showing the same ads repeatedly to users can lead to ad fatigue, causing users to ignore or block the ads. Advertisers need to manage frequency and creative refresh to combat this.
  6. Measurement and Attribution: Accurately measuring the impact of advertising campaigns and attributing conversions to specific ads or channels can be challenging, especially in a multi-channel marketing environment.
  7. Platform Changes: Advertising platforms frequently update their algorithms and policies. Advertisers need to adapt to these changes and stay informed to maintain campaign effectiveness.
  8. Budget Management: Effective budget allocation across various channels and campaigns can be complex. Balancing the budget to achieve the best results is an ongoing challenge.
  9. Creative Variation: Creating and testing different ad creatives to find the most effective ones requires ongoing effort and creativity.
  10. Ad Placement: Choosing the right placements on websites, apps, and social media is crucial. Advertisers must consider where their target audience spends their time.

Navigating these challenges requires a data-driven platform and a deep understanding of the digital advertising landscape, constant monitoring, and optimization.

Why does an ads platform need to be real-time?

An advertising platform should be real-time for several important reasons:

  1. Timely Campaign Adjustments: Real-time data allows advertisers to adjust their advertising campaigns promptly. They can respond quickly to market conditions, user behavior, or campaign performance changes. For example, if a particular ad is not performing well or if a sudden surge in user interest occurs, advertisers can pause or modify their campaigns immediately to optimize results.
  2. Dynamic Targeting: Real-time data enables dynamic and precise targeting. Advertisers can adjust their targeting criteria on the fly based on real-time user actions and data, ensuring that ads are delivered to the most relevant audience at the right moment.
  3. Optimized Bidding: Real-time bidding (RTB) is a crucial component of programmatic advertising. Advertisers can bid on ad inventory in real-time based on user data, maximizing their chances of winning ad placements at the best prices.
  4. Ad Personalization: Real-time data allows for highly personalized ad experiences. Advertisers can serve ads tailored to individual user preferences and behavior, increasing the likelihood of engagement and conversion.
  5. Ad Fraud Detection: Real-time monitoring and analysis of ad traffic can help detect and prevent ad fraud as it occurs. Ad platforms can identify suspicious patterns and take action to mitigate fraud, protecting advertisers’ investments.
  6. Budget Allocation: Real-time data informs budget allocation decisions. Advertisers can allocate more budget to high-performing campaigns and reduce spending on underperforming ones in real-time, ensuring efficient use of resources.
  7. Competitive Advantage: Real-time capabilities can provide a significant advantage in a competitive advertising landscape. Advertisers who can react swiftly to market changes and trends can capture opportunities that slower competitors might miss.
  8. User Engagement: Real-time advertising can engage users at the most opportune moments. For example, an e-commerce platform can display retargeting ads to users who abandoned their shopping carts in real-time, encouraging them to complete their purchases.
  9. Event-Driven Marketing: Real-time capabilities enable event-driven marketing. Advertisers can trigger ads based on specific user actions or external events, such as holidays or significant news events, making their campaigns more relevant and timely.
  10. Measurement and Attribution: Real-time data allows for immediate measurement of ad performance and attribution of conversions. Advertisers can track which ads and channels drive results and adjust their strategies accordingly.
  11. User Experience: Real-time ads can enhance the user experience by delivering current and contextually relevant content and offers. This can improve user engagement and satisfaction.

In today’s digital advertising landscape, where user behavior and market conditions can change rapidly, real-time capabilities are essential for advertisers to stay competitive, make data-driven decisions, and maximize the impact of their advertising campaigns. Real-time advertising platforms empower advertisers to be more agile, responsive, and effective in reaching their target audience.

How does Apache Kafka help build an advertising platform?

Apache Kafka combines real-time messaging at any scale with true decoupling through its event store. The data streaming platform collects data, correlates real-time and historical events with stream processing, and shares created information with downstream consumers.

Data Streaming with Apache Kafka and Apache Flink for Advertisement Platform and Ads

One of the most underestimated capabilities is the out-of-the-box capability of Apache Kafka to ensure data consistency across real-time and non-real-time systems. The heart of the enterprise architecture is real-time, scalable, and reliable. But any near real-time, batch or request-response communication can produce or consume at its own pace with its own API or programming language.

Apache Flink is ideal for data correlation. No matter if the task is data integration (aka streaming ETL) or advanced stateful business and application logic. Apache Kafka and Apache Flink are a match made in heaven for data streaming.

Real Time Bidding and Fraud Detection Advertisement Platform with Apache Kafka and Flink

Real-world success stories show how data streaming with Kafka and Flink helps build a next-generation advertising platform. These technologies solve the abovementioned challenges to provide real-time and consistent information across all applications.

Advertising platforms are either directly embedded into customer-facing applications or built as software or SaaS products that other companies buy and leverage.

The following success stories explore ad platforms built with data streaming:

  • Pinterest: Image-sharing and natural engagement with ads (Kafka Streams).
  • Buzzvil: Lock screen advertisement for smartphones (Kafka and Confluent Cloud).
  • TV-Insight: Live decisions of regular TV ad blocks (Kafka, Flink, and Confluent Cloud).
  • Unity: Monetization network for gaming (Kafka and Confluent).
  • Uber Eats: Ads in the mobile food delivery app (Kafka, Flink, Pinot).
  • Reddit: Ads placing including real-time budget planning without over-delivery or under-delivery (Kafka, Flink, Druid).

Pinterest – Social media natural engagement with ads

Pinterest is an American image-sharing and social media service designed to enable the saving and discovery of information (specifically “ideas”) like recipes, home, style, motivation, and inspiration on the internet.

The content of ads is very close to the actual content. Naturally, users engage with the content and ads:

Pinterest Mobile App Home Feed Search and Ads

Pinterest talked about its Kafka-powered advertising platform for the first time in 2018 at a Kafka Summit. The Ad platform leverages Kafka for the data ingestion pipeline and stream processing with Kafka Streams to enable a real-time feedback loop. Recommendation engine (via machine learning), budgeting, and new ads exploration are some of the critical use cases.

Pinterest Ads Engine built with Kafka Streams

The continuous feedback loop enables real-time updates in seconds. Stateful stream processing with Kafka Streams correlates events from users, ads, budget, and other interfaces to decide on ads serving.

Stateful Stream Processing with Kafka Streams at Pinterest

Real-time (even at an extreme scale) is critical for Pinterest. When a new ad is created, the ads platform does not know about the user engagement with this ad on different surfaces. The faster the ads platform knows about the performance of the newly created ad, the better value can be provided to the user.

There is a balance between exploiting good ads and exploring new ads. The solution was adding a boosting factor to new ads to increase the probability of winning an auction.

Listen to the talk from Pinterest for more details, best practices, and lessons learned in developing and operating a scalable, real-time advertising platform with stateful stream processing using Kafka Streams.

Buzzvil – Lock screen advertising platform

Buzzvil provides a lock screen advertising platform that connects partners and advertisers:

buzzvil – AdTech for Publishers and Advertisers

Buzvill’s advertising platform is data-driven and built with Apache Kafka in the cloud. It optimizes ad spending through automation, behavioral analytics, audience targeting, rewards programs, and more. Data streaming enables a single source of truth for real-time ad transaction data.

 

 

buzzvil - Advertisement Platform built with Apache Kafka in Confluent Cloud

They built the ad platform with Apache Kafka in a fully managed Confluent Cloud to focus on business logic and faster time-to-market.

Data streaming with Apache Kafka enables 18x faster data updates for ad bidding. Confluent Cloud saves 20-30% infrastructure cost.

TV-Insight – Live decisions of regular TV ad blocks

TV-Insight developed a solution to help Joint Industry Committees (JIC), Broadcasters, and Advertisers to improve and evolve the data quality of existing TV measurement panels using return path data of connected devices.

The problem of monitoring classical TV

The essential difference between TV-Insight and all other “panel boosting” initiatives and products is that TV-Insight uses real-time data. Therefore, it can provide a live TV reach for live decisions of regular TV ad blocks.

TVI Insight Live Reach Prediction in Real-Time

The TV-Insight application collects data from the Smart TV or Set-Top Box via GDPR compliance device tracking. The live extrapolation enables advertising optimization:

TV Insight Enterprise Architecture for Real-Time Ads

The technical architecture and data pipeline look like the following. Apache Kafka is the real-time messaging platform and event store. Apache Kafka’s stateful stream processing correlates events to calculate real-time ad serving in the advertising platform.

Apache Kafka and Apache Flink for Advertisement Platform at TV Insight

Unity Ads – Monetization network for gaming

Unity is a cross-platform game engine developed by Unity Technologies. The engine has since been gradually extended to support a variety of desktop, mobile, console, and virtual reality platforms. The engine can create three-dimensional (3D) and two-dimensional (2D) games, interactive simulations, and other experiences. Industries outside video gaming have adopted the engine, such as film, automotive, architecture, engineering, and construction.

In 2019, Unity apps and content were installed 33 billion times, reaching 3 billion devices worldwide.

The 3D development platform and game engine is not the only product of Unity Technologies. Unity Ads is one of the largest monetization networks in the world:

  • Reward players for watching ads
  • Incorporate banner ads
  • Incorporate Augmented Reality (AR) ads
  • Playable ads
  • Cross-Promotions
  • IAPs (in-app purchases)

Unity is a data-driven company:

  • Average about half a million events per second
  • Handles millions of dollars in monetary transactions
  • Data infrastructure based on Apache Kafka

single data pipeline provides the foundational infrastructure for analytics, R&D, monetization, cloud services, etc., for real-time and batch processing leveraging Apache Kafka:

  • Real-time monetization network
  • Feed machine learning models in real-time
  • Data lake went from two-day latency down to 15 minutes

If you want to learn about Unity’s success story of migrating this platform from self-managed Kafka to the cloud, read the post on the Confluent Blog: “How Unity uses Confluent for real-time event streaming at scale“.

Uber Eats – Ads embedded into food delivery app

Uber provides an exciting food delivery app capability: Uber Eats allows ads embedding. With this ability came new challenges that needed to be solved at Uber, such as systems for ad auctions, bidding, attribution, reporting, and more.

Uber wrote an excellent article that focuses on how they leveraged open source technology to build Uber’s first near real-time exactly-once events processing system. Uber leverages Kafka, Flink, and Pinot for its advertising platform. This perfectly combines the right technologies.

Uber Eats Architecture of the Advertisement Platform using Kafka Flink and Pinot

As Uber writes: “With every ad served, there are corresponding events per user (impressions, clicks). The responsibility of the ad events processing system is to manage the flow of events, cleanse them, aggregate clicks and impressions, attribute them to orders, and provide this data in an accessible format for reporting and analytics as well as dependent clients (e.g., other ads systems).”

While speed, scale, and reliability are always crucial for such a system, I want to emphasize the part about accuracy and why exactly-once processing with Kafka and Flink was a critical piece of the architecture.

The Aggregation Job implemented with Apache Flink does a lot of the heavy lifting: Data cleansing, persistence for order attribution, aggregation, and record UUID generation.

Uber Aggregation Job with Apache Flink

Exactly-once with Kafka and Flink is very important, as their blog post explains: “Uber can’t afford to overcount events. Double counting clicks results in overcharging advertisers and overreporting the success of ads. Both being poor customer experiences, this requires processing events exactly-once. Uber is the marketplace in which ads are being served, therefore our ad attribution must be 100% accurate.”

Reddit – Ads Placing with Budget Planning avoiding Over- or Under-Delivery

Reddit is an American social news aggregation, content rating, and discussion website. Registered users submit content to the site such as links, text posts, images, and videos, which other members then vote up or down.

Reddit’s ads platform allows advertisers to create ad campaigns and set both daily and lifetime budgets for a campaign. Here is Reddit’s decision tree to place advertisements:

Reddit Decision Tree to Place Advertisements

The data pipeline leverages Kafka, Flink, and Druid to analyze campaign budgets in real-time. The platform leverages real-time plus historical user activity data to decide which ad to place. All within 30 milliseconds to avoid over-delivery and under-delivery (budget spent too quickly / slowly).

Reddit Ads Serving Platform using Apache Kafka, Flink and Druid

Watch Reddit’s talk from Druid Summit “Low Latency Real-Time Ads Pacing Queries” to learn more about their ads platform and use cases.

Real-world success stories from Pinterest, Uber, Unity, buzzwil, and TV-Insight showed how to embed real-time advertising into your applications or build a dedicated marketing product.

Data streaming with Apache Kafka and Apache Flink enables context-specific advertising at scale in real time. The cloud makes it possible to focus on business logic and faster time-to-market with a fully managed data streaming platform.

How do you leverage data streaming in marketing and advertising use cases? Do you deploy at the edge, in the cloud, or both? Or do you integrate 3rd party marketing platforms into your advertising platforms? Let’s connect on LinkedIn and discuss it! Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

The post How to Build a Real-Time Advertising Platform with Apache Kafka and Flink appeared first on Kai Waehner.

]]>
Quix Streams – Stream Processing with Kafka and Python https://www.kai-waehner.de/blog/2023/05/28/quix-streams-stream-processing-with-kafka-and-python/ Sun, 28 May 2023 10:22:38 +0000 https://www.kai-waehner.de/?p=5441 Over 100,000 organizations use Apache Kafka for data streaming. However, there is a problem: The broad ecosystem lacks a mature client framework and managed cloud service for Python data engineers. Quix Streams is a new technology on the market trying to close this gap. This blog post discusses this Python library, its place in the Kafka ecosystem, and when to use it instead of Apache Flink or other Python- or SQL-based substitutes.

The post Quix Streams – Stream Processing with Kafka and Python appeared first on Kai Waehner.

]]>
Over 100,000 organizations use Apache Kafka for data streaming. However, there is a problem: The broad ecosystem lacks a mature client framework and managed cloud service for Python data engineers. Quix Streams is a new technology on the market trying to close this gap. This blog post discusses this Python library, its place in the Kafka ecosystem, and when to use it instead of Apache Flink or other Python- or SQL-based substitutes.

Python Kafka Quix Streams and Flink for Open Source Stream Processing

Why Python and Apache Kafka together?

Python is a high-level, general-purpose programming language. It has many use cases for scripting and development. But there is one fundamental purpose for its success: Data engineers and data scientists use Python. Period.

Yes, there is R as another excellent programming language for statistical computing. And many low-code/no-code visual coding platforms for machine learning (ML).

SQL usage is ubiquitous amongst data engineers and data scientists, but it’s a declarative formalism that isn’t expressive enough to specify all necessary business logic. When data transformation or non-trivial processing is required, data engineers and data scientists use Python.

Hence: Data engineers and data scientists use Python. If you don’t give them Python, you will find either shadow IT or Python scripts embedded into the coding box of a low-code tool.

Apache Kafka is the de facto standard for data streaming. It combines real-time messaging, storage for true decoupling and replayability of historical data, data integration with connectors, and stream processing for data correlation. All in a single platform. At scale for transactions and analytics.

Python and Apache Kafka for Data Engineering and Machine Learning

In 2017, I wrote a blog post about “How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka“. The article is still accurate and explores how data streaming and AI/ML are complementary:

Apache Kafka Open Source Ecosystem as Infrastructure for Machine Learning

Machine Learning requires a lot of infrastructure for data collection, data engineering, model training, model scoring, monitoring, and so on. Data streaming with the Kafka ecosystem enables these capabilities in real-time, reliable, and at scale.

DevOps, microservices, and other modern deployment concepts merged the job roles of software developers and data engineers/data scientists. The focus is much more on data products solving a business problem, operated by the team that develops it. Therefore, the Python code needs to be production-ready and scalable.

As mentioned above, the data engineering and ML tasks are usually realized with Python APIs and frameworks. Here is the problem: The Kafka ecosystem is built around Java and the JVM. Therefore, it lacks good Python support.

Let’s explore the options and why Quix Streams is a brilliant opportunity for data engineering teams for machine learning and similar tasks.

What options exist for Python and Kafka?

Many alternatives exist for data engineers and data scientists to leverage Python with Kafka.

Python integration for Kafka

Here are a few common alternatives for integrating Python with Kafka and their trade-offs:

  • Python Kafka client libraries: Produce and consume via Python. I wrote an example of using a Python Kafka client and TensorFlow to replay historical data from the Kafka log and train an analytic model. This is solid but insufficient for advanced data engineering as it lacks processing primitives, such as filtering and joining operations found in Kafka Streams and other stream processing libraries.
  • Kafka REST APIs: Confluent REST Proxy and similar components enable producing and consuming to/from Kafka. Work well for gluing interfaces together, but is not ideal for ML workloads with low latency and critical SLAs.
  • SQL: Stream processing engines like ksqlDB or FlinkSQL allow querying of data in SQL. I wrote an example of Model Training with Python, Jupyter, KSQL and TensorFlow. It works. But ksqlDB and Flink are other systems that need to be operated. And SQL isn’t expressive enough for all use cases.

Instead of just integrating Python and Kafka via APIs, native stream processing provides the best of both worlds: The simplicity and flexibility of dynamic Python code for rapid prototyping with Jupyter notebooks and serious data engineering AND stream processing for stateful data correlation at scale either for data ingestion and model scoring.

Stream processing with Python and Kafka

In the past, we had two suboptimal open-source options for stream processing with Kafka and Python:

  • Faust: A stream processing library, porting the ideas from Kafka Streams (a Java library and part of Apache Kafka) to Python. The feature set is much more limited compared to Kafka Streams. Robinhood open-sourced Faust. But it lacks maturity and community adoption. I saw several customers evaluating it but then moving to other options.
  • Apache Flink’s Python API: Flink’s adoption grows significantly yearly in the stream processing community. This API is a Python version of DataStream API, which allows Python users to write Python DataStream API jobs. Developers can also use the Table API, including SQL directly in there. It is an excellent option if you have a Flink cluster and some folks want to run Python instead of Java or SQL against it for data engineering. The Kafka-Flink integration is very mature and battle-tested.

As you see, all the alternatives for combining Kafka and Python have trade-offs. They work for some use cases but are imperfect for most data engineering and data science projects.

A new open-source framework to the rescue? Introducing a brand new stream processing library for Python: Quix Streams…

What is Quix Streams?

Quix Streams is a stream processing library focused on ease of use for Python data engineers. The library is open-source under Apache 2.0 license and available on GitHub.

Instead of a database, Quix Streams uses a data streaming platform such as Apache Kafka. You can process data with high performance and save resources without introducing a delay.

Some of the Quix Streams differentiators are defined as being lightweight, powerful, no JVM and no need for separate clusters of orchestrators. It sounds like the pitch for why to use Kafka Streams in the Java ecosystem minus the JVM – this is a positive comment! 🙂

Quix Streams does not use any domain-specific language or embedded framework. It’s a library that you can use in your code base. This means that with Quix Streams, you can use any external library for your chosen language. For example, data engineers can leverage Pandas, NumPy, PyTorch, TensorFlow, Transformers, and OpenCV in Python.

So far, so good. This was more or less the copy & paste of Quix Streams marketing (it makes sense to me)… Now let’s dig deeper into the technology.

The Quix Streams API and developer experience

The following is the first feedback after playing around, doing code analysis, and speaking with some Confluent colleagues and the Quix Streams team.

The good

  • The Quix API and tooling persona is the data engineer (that’s at least my understanding). Hence, it does not directly compete with other offerings, say a Java developer using Kafka Streams. Again, the beauty of microservices and data mesh is the focus of an application or data product per use case. Choose the right tool for the job!
  • The API is mostly sleek, with some weirdness / unintuitive parts. But it is still in beta, so hopefully, it will get more refined in the subsequent releases. No worries at this early stage of the project.
  • The integration with other data engineering and machine learning Python frameworks is excellent. If you can combine stream processing with Pandas, NumPy and similar libraries is a massive benefit for the developer experience.
  • The Quix library and SaaS platform are compatible with open-source Kafka and commercial offerings and cloud services like Cloudera, Confluent Cloud, or Amazon MSK. Quix’s commercial UI provides out-of-the-box integration with self-managed Kafka and Confluent Cloud. The cloud platform also provides a managed Kafka for testing purposes (for a few cents per Kafka topic, and not meant for production).

The improvable

  • The stream processing capabilities (like powerful sliding windows) are still pretty limited and not comparable to advanced engines like Kafka Streams or Apache Flink. The roadmap includes enhanced features.
  • The architecture is complex since executing the Python API jumps through three languages: Python -> C# -> C++. Does it matter to the end user? It depends on the use case, security requirements, and more. The reasoning for this architecture is Quix’s background coming from the McLaren F1 team and ultra-low latency use cases and building a polyglot platform for different programming environments.
  • It would be interesting to see a benchmark for throughput and latency versus Faust, which is Python top to bottom. There is a trade-off between inter-language marshaling/unmarshalling versus the performance boost of lower-level compiled languages. This should be fine if we trust Quix’s marketing and business model. I expect they will provide some public content soon, as this question will arise regularly.

The Quix Streams Data Pipeline Low Code GUI

The commercial product provides a user interface for building data pipelines and code, MLOps, and a production infrastructure for operating and monitoring the built applications.

Here is an example:

Quix Streams Data Pipeline for Stream Processing

  • Tiles are K8’s containers, each purple (transformation) and orange (destination) node is backed by a Git project containing the application code.
  • The three blue (source) nodes on the left are replay services used to test the pipeline by replaying specific streams of data.
  • Arrows are individual Kafka topics in Confluent Cloud (green = live data).
  • The first visible pipeline node (bottom left) is joining data from different physical sites (see the three input topics, one was receiving data when I took the image).
  • There are three modular transformations in the visible pipeline (two rolling windows and one interpolation).
  • There are two real-time apps (one real-time Streamlit dashboard and the other is an integration with a Twilio SMS service).

The Quix team wrote a detailed comparison of Apache Flink and Quix Streams. I don’t think it’s an entirely fair comparison as it compares open-source Apache Flink to a Quix SaaS offering. Nevertheless, for the most part, it is a good comparison.

Flink was always Java-first and has added support for Python for its DataStream and Table APIs at a later stage. Contrary, Quix Streams is brand new. Hence, it lacks maturity and customer case studies.

Having said all this, I think Quix Streams is a great choice for some stream processing projects in the Python ecosystem!

TL;DR: There is a place for both! Choose the right tool… Modern enterprise architectures built with concepts like data mesh, microservices, and domain-driven design allow this flexibility per use case and problem.

I recommend using Flink if the use case makes sense with SQL or Java. And if the team is willing to operate its own Flink cluster or has a platform team or a cloud service taking over the operational burden and complexity.

Contrary, I would use Quix Streams for Python projects if I want to go to production with a more microservice-like architecture building Python applications. However, beware that Quix currently only has a few built-in stateful functions or JOINs. More advanced stream processing use cases cannot be done with Quix (yet). This is likely changing in the next months by adding more capabilities.

Hence, make sure to read Quix’ comparison with Flink. But keep in mind if you want to evaluate the open-source Quix Streams library or the Quix SaaS platform. If you are in the public cloud, you might combine Quick Streams SaaS with other fully-managed cloud services like Confluent Cloud for Kafka. On the other side, in your own private VPC or on premise, you need to build your own platform with technologies like the Quix Streams library, Kafka or Confluent Platform, and so on.

The current state and future of Quix Streams

If you build a new framework or product for data streaming, you need to make sure that it does not overlap with existing established offerings. You need differentiators and/or innovation in a new domain that does not exist today.

Quix Streams accomplishes this essential requirement to be successful: The target audience is data engineers with Python backgrounds. No severe and mature tool or vendor exists in this space today. And the demand for Python will grow more and more with the focus on leveraging data for solving business problems in every company.

Maturity: Making the right (marketing) choices in the early stage

Quix Streams is in the early maturity stage. Hence, a lot of decisions can still be strengthened or revamped.

The following buzzwords come into my mind when I think about Quix Streams: Python, data streaming, stream processing, Python, data engineering, Machine Learning, open source, cloud, Python, .NET, C#, Apache Kafka, Apache Flink, Confluent, MSK, DevOps, Python, governance, UI, time series, IoT, Python, and a few more.

TL;DR: I see a massive opportunity for Quix Streams to become a great data engineering framework (and SaaS offering) for Python users.

I am not a fan of polyglot platforms. It requires finding the lowest common denominator. I was never a fan of Apache Beam for that reason. The Kafka Streams community did not choose to implement the Beam API because of too many limitations.

Similarly, most people do not care about the underlying technology. Yes, Quix Streams’ core is C++. But is the goal to roll out stream processing for various programming languages, only starting with Python, then going to .NET, and then to another one? I am skeptical.

Hence, I like to see a change in the marketing strategy already: Quix Streams started with the pitch of being designed for high-frequency telemetry services when you must process high volumes of time-series data with up to nanosecond precision. It is now being revamped to focus mainly on Python and data engineering.

Competition: Friends or enemies?

Getting market adoption is still hard. Intuitive use of the product, building a broad community, and the right integrations and partnerships (can) make a new product such as Quix Streams successful. Quix Streams is on a good way here. For instance, integrating serverless Confluent Cloud and other Kafka deployments works well:

Quix Streams Integration with Apache Kafka and Confluent Cloud

This is a native integration, not a connector. Everything in the pipeline image runs as a direct Kafka protocol connection using raw TCP/IP packets to produce and consume data to topics in Confluent Cloud. Quix platform is orchestrating the management of the Confluent Cloud Kafka Cluster (create/delete topics, topic sizing, topic monitoring etc) using Confluent APIs.

However, one challenge of these kinds of startups is the decision to complement versus compete with existing solutions, cloud services, and vendors. For instance, how much time and money do you invest in data governance? Should you build this or use the complementing streaming platform or a separate independent tool (like Collibra)? We will see where Quix Streams will go here. Building its cloud platform for addressing Python engineers or overlapping with other streaming platforms?

My advice is the proper integration with partners that lead in their space. Working with Confluent for over six years, I know what I am talking about: We do one thing, data streaming, but we are the best in that one. We don’t even try to compete with other categories. Yes, a few overlaps always exist, but instead of competing, we strategically partner and integrate with other vendors like Snowflake (data warehouse), MongoDB (transactional database), HiveMQ (IoT with MQTT), Collibra (enterprise-wide data governance), and many more. Additionally, we extend our offering with more data streaming capabilities, i.e., improving our core functionality and business model. The latest example is our integration of Apache Flink into the fully-managed cloud offering.

Kafka for Python? Look at Quix Streams!

In the end, a data engineer or developer has several options for stream processing deeply integrated into the Kafka ecosystem:

  • Kafka Streams: Java client library
  • ksqlDB: SQL service
  • Apache Flink: Java, SQL, Python service
  • Faust: Python client library
  • Quix Streams: Python client library

All have their pros and cons. The persona of the data engineer or developer is a crucial criterion. Quix Streams is a nice new open-source framework for the broader data streaming community. If you cannot or do not want to use just SQL, but native Python, then watch the project (and the company/cloud service behind it).

UPDATE – May 30th, 2023: bytewax is another open-source stream processing library for Python integrating with Kafka. It is implemented in Rust under the hood. I never saw it in the field yet. But a few comments mentioned it after I shared this blog post on social networks. I think it is worth a mention. Let’s see if it gets more traction in the following months.

Do you already use stream processing, or is Kafka just your data hub and pipeline? How do you combine Python and Kafka today? Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

The post Quix Streams – Stream Processing with Kafka and Python appeared first on Kai Waehner.

]]>