machine learning Archives - Kai Waehner

Fraud Detection in Mobility Services (Ride-Hailing, Food Delivery) with Data Streaming using Apache Kafka and Flink

Kai Waehner — Mon, 28 Apr 2025 06:29:25 +0000

Mobility services like Uber, Grab, FREE NOW (Lyft), and DoorDash are built on real-time data. Every trip, delivery, and payment relies on accurate, instant decision-making. But as these services scale, they become prime targets for sophisticated fraud—GPS spoofing, fake accounts, payment abuse, and more. Traditional, batch-based fraud detection can’t keep up. It reacts too late, misses complex patterns, and creates blind spots that fraudsters exploit. To stop fraud before it happens, mobility platforms need data streaming technologies like Apache Kafka and Apache Flink for fraud detection. This blog explores how leading platforms are using real-time event processing to detect and block fraud as it happens—protecting revenue, user trust, and platform integrity at scale.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases.

The Business of Mobility Services (Ride-Hailing, Food Delivery, Taxi Aggregators, etc.)

Mobility services have become an essential part of modern urban life. They offer convenience and efficiency through ride-hailing, food delivery, car-sharing, e-scooters, taxi aggregators, and micro-mobility options. Companies such as Uber, Lyft, FREE NOW (former MyTaxi; acquired by Lyft recently), Grab, Careem, and DoorDash connect millions of passengers, drivers, restaurants, retailers, and logistics partners to enable seamless transactions through digital platforms.

These platforms operate in highly dynamic environments where real-time data is crucial for pricing, route optimization, customer experience, and fraud detection. However, this very nature of mobility services also makes them prime targets for fraudulent activities. Fraud in this sector can lead to financial losses, reputational damage, and deteriorating customer trust.

To effectively combat fraud, mobility services must rely on real-time data streaming with technologies such as Apache Kafka and Apache Flink. These technologies enable continuous event processing and allow platforms to detect and prevent fraud before transactions are finalized.

Why Fraud is a Major Challenge in Mobility Services

Fraudsters continually exploit weaknesses in digital mobility platforms. Some of the most common fraud types include:

Fake Rides and GPS Spoofing: Drivers manipulate GPS data to simulate trips that never occurred. Passengers use location spoofing to receive cheaper fares or exploit promotions.

Payment Fraud and Stolen Credit Cards: Fraudsters use stolen payment methods to book rides or order food.

Fake Drivers and Passengers: Fraudsters create multiple accounts and pretend to be both the driver and passenger to collect incentives. Some drivers manipulate fares by manually adjusting distances in their favor.

Promo Abuse: Users create multiple fake accounts to exploit referral bonuses and promo discounts.

Account Takeovers and Identity Fraud: Hackers gain access to legitimate accounts, misusing stored payment information. Fraudsters use fake identities to bypass security measures.

Fraud not only impacts revenue but also creates risks for legitimate users and drivers. Without proper fraud prevention measures, ride-hailing and delivery companies could face serious losses, both financially and operationally.

The Unseen Enemy: Core Challenges in Mobility Fraud
Detection

Traditional fraud detection relies on batch processing and manual rule-based systems. However, these approaches are no longer effective due to the speed and complexity of modern mobile apps with real-time experiences combined with modern fraud schemes.

Payment Fraud – The Hidden Enemy in a Digital World

Key challenges in mobility fraud detection include:

Fraud occurs in real-time, requiring instant detection and prevention before transactions are completed.
Millions of events per second must be processed, requiring scalable and efficient systems.
Fraud patterns constantly evolve, making static rule-based approaches ineffective.
Platforms operate across hybrid and multi-cloud environments, requiring seamless integration of fraud detection systems.

How Data Streaming with Apache Kafka and Flink Enables Real-Time Fraud Detection

To overcome these challenges, real-time streaming analytics powered by Apache Kafka and Apache Flink provide an effective solution.

Apache Kafka: The Backbone of Event-Driven Fraud Detection

Kafka serves as the core event streaming platform. It captures and processes real-time data from multiple sources such as:

GPS location data
Payment transactions
User and driver behavior analytics
Device fingerprints and network metadata

Kafka provides:

High-throughput data streaming, capable of processing millions of events per second to support real-time decision-making.
An event-driven architecture that enables decoupled, flexible systems—ideal for scalable and maintainable mobility platforms.
Seamless scalability across hybrid and multi-cloud environments to meet growing demand and regional expansion.
Always-on reliability, ensuring 24/7 data availability and consistency for mission-critical services such as fraud detection, pricing, and trip orchestration.

An excellent success story about the transition to data streaming comes from DoorDash: Why DoorDash migrated from Cloud-native Amazon SQS and Kinesis to Apache Kafka and Flink.

Apache Flink: Continuous Stream Processing for Fraud Detection in Real-Time

Apache Flink enables real-time fraud detection through advanced event correlation and applied AI:

Detects anomalies in GPS data, such as sudden jumps, route manipulation, or unrealistic movement patterns.
Analyzes historical user behavior to surface signs of account takeovers or other forms of identity misuse.
Joins multiple real-time streams—including payment events, location updates, and account interactions—to generate accurate, low-latency fraud scores.
Applies machine learning models in-stream, enabling the system to flag and stop suspicious transactions before they are processed.
Continuously adapts to new fraud patterns, updating models with fresh data in near real-time to reflect evolving user behavior and emerging threats.

With Kafka and Flink, fraud detection can shift from reactive to proactive to stop fraudulent transactions before they are completed.

I already covered various data streaming success stories from financial services companies such as Paypal, Capital One and ING Bank in a dedicated blog post. And a separate case study from about “Fraud Prevention in Under 60 Seconds with Apache Kafka: How A Bank in Thailand is Leading the Charge“.

Real-World Fraud Prevention Stories from Mobility Leaders

Fraud is not just a technical issue—it’s a business-critical challenge that impacts trust, revenue, and operational stability in mobility services. The following real-world examples from industry leaders like FREE NOW (Lyft), Grab, and Uber show how data streaming with advanced stream processing and AI are used around the world to detect and stop fraud in real time, at massive scale.

FREE NOW (Lyft): Detecting Fraudulent Trips in Real Time by Analyzing GPS Data of Cars

FREE NOW operates in more than 150 cities across Europe with 48 million users. It integrates multiple mobility services, including taxis, private vehicles, car-sharing, e-scooters, and bikes.

The company was recently acquired by Lyft, the U.S.-based ride-hailing giant known for its focus on multimodal urban transport and strong presence in North America. This acquisition marks Lyft’s strategic entry into the European mobility ecosystem, expanding its footprint beyond the U.S. and Canada.

Source: FREE NOW

Fraud Prevention Approach leveraging Data Streaming (presented at Kafka Summit)

Uses Kafka Streams and Kafka Connect to analyze GPS trip data in real-time.
Deploys fraud detection models that identify anomalies in trip routes and fare calculations.
Operates data streaming on fully managed Confluent Cloud and applications on Kubernetes for scalable fraud detection.

Source: FREE NOW

Example: Detecting Fake Rides

A driver inputs trip details into the app.
Kafka Streams predicts expected trip fare based on distance and duration.
GPS anomalies and unexpected route changes are flagged.
Fraud alerts are triggered for suspicious transactions.

By implementing real-time fraud detection with Kafka and Flink, FREE NOW (Lyft) has significantly reduced fraudulent trips and improved platform security.

Grab: AI-Powered Fraud Detection for Ride-Hailing and Delivery with Data Streaming and AI/ML

Grab is a leading mobility platform in Southeast Asia, handling millions of transactions daily. Fraud accounts for 1.6 percent of total revenue loss in the region.

To address these significant fraud numbers, Grab developed GrabDefence—an AI-powered fraud detection engine that leverages real-time data and machine learning to detect and block suspicious activity across its platform.

Source: Grab

Fraud Detection Approach

Uses Kafka Streams and machine learning for fraud risk scoring.
Leverages Flink for feature aggregation and anomaly detection.
Detects fraudulent transactions before they are completed.

Source: Grab

Example: Fake Driver and Passenger Fraud

Fraudsters create accounts as both driver and passenger to claim rewards.
Kafka ingests device fingerprints, payment transactions, and ride data.
Flink aggregates historical fraud behavior and assigns risk scores.
High-risk transactions are blocked instantly.

With GrabDefence built with data streaming, Grab reduced fraud rates to 0.2 percent, well below the industry average. Learn more about GrabDefence in the Kafka Summit talk.

Uber: Project RADAR – AI-Powered Fraud Detection with Human Oversight

Uber processes millions of payments per second globally. Fraud detection is complex due to chargebacks and uncollected payments.

To combat this, Uber launched Project RADAR—a hybrid system that combines machine learning with human reviewers to continuously detect, investigate, and adapt to evolving fraud patterns in near real time. Low latency is not required in this scenario. And humans are in the loop of the business process. Hence, Apache Spark is sufficient for Uber.

Source: Uber

Fraud Prevention Approach

Uses Kafka and Spark for multi-layered fraud detection.
Implements machine learning models to detect chargeback fraud.
Incorporates human analysts for rule validation.

Source: Uber

Example: Chargeback Fraud Detection

Kafka collects all ride transactions in real time.
Stream processing detects anomalies in payment patterns and disputes.
AI-based fraud scoring identifies high-risk transactions.
Uber’s RADAR system allows human analysts to validate fraud alerts.

Uber’s combination of AI-driven detection and human oversight has significantly reduced chargeback-related fraud.

Data Streaming with Kafka and Flink Provides Real-Time Fraud Detection in Mobility Services

Fraud in mobility services is a real-time challenge that requires real-time solutions that work 24/7, even at extreme scale for millions of events. Traditional batch processing systems are too slow, and static rule-based approaches cannot keep up with evolving fraud tactics.

By leveraging data streaming with Apache Kafka in conjunction with Kafka Streams or Apache Flink, mobility platforms can:

Process millions of events per second to detect fraud in real time.
Prevent fraudulent transactions before they occur.
Use AI-driven real-time fraud scoring for accurate risk assessment.
Adapt dynamically through continuous learning to evolving fraud patterns.

Mobility platforms such as Uber, Grab, and FREE NOW (Lyft) are leading the way in using real-time streaming analytics to protect their platforms from fraud. By implementing similar approaches, other mobility businesses can enhance security, reduce financial losses, and maintain customer trust.

Real-time fraud prevention in mobility services is not an option; it is a necessity. The ability to detect and stop fraud in real time will define the future success of ride-hailing, food delivery, and urban mobility platforms.

Stay ahead of the curve! Subscribe to my newsletter for insights into data streaming and connect with me on LinkedIn to continue the conversation. And download my free book about data streaming use cases.

The post Fraud Detection in Mobility Services (Ride-Hailing, Food Delivery) with Data Streaming using Apache Kafka and Flink appeared first on Kai Waehner.

CIO Summit: The State of AI and Why Data Streaming is Key for Success

Kai Waehner — Thu, 13 Mar 2025 07:31:33 +0000

This week, I had the privilege of engaging in insightful conversations at the CIO Summit organized by GDS Group in Amsterdam, Netherlands. The event brought together technology leaders from across Europe and industries such as financial services, manufacturing, energy, gaming, telco, and more. The focus? AI – but with a much-needed reality check. While the potential of AI is undeniable, the hype often outpaces real-world value. Discussions at the summit revolved around how enterprises can move beyond experimentation and truly integrate AI to drive business success.

Key Learnings on the State of AI

The CIO Summit in Amsterdam provided a reality check on AI adoption across industries. While excitement around AI is high, success depends on moving beyond the hype and focusing on real business value. Conversations with technology leaders revealed critical insights about AI’s maturity, challenges, and the key factors driving meaningful impact. Here are the most important takeaways.

AI is Still in Its Early Stages – Beware of the Buzz vs. Value

The AI landscape is evolving rapidly, but many organizations are still in the exploratory phase. Executives recognize the enormous promise of AI but also see challenges in implementation, scaling, and achieving meaningful ROI.

The key takeaway? AI is not a silver bullet. Companies that treat it as just another trendy technology risk wasting resources on hype-driven projects that fail to deliver tangible outcomes.

Generative AI vs. Predictive AI – Understanding the Differences

There was a lot of discussion about Generative AI (GenAI) vs. Predictive AI, two dominant categories that serve very different purposes:

Predictive AI analyzes historical and real-time data to forecast trends, detect anomalies, and automate decision-making (e.g., fraud detection, supply chain optimization, predictive maintenance).
Generative AI creates new content based on trained data (e.g., text, images, or code), enabling applications like automated customer service, software development, and marketing content generation.

While GenAI has captured headlines, Predictive AI remains the backbone of AI-driven automation in enterprises. CIOs must carefully evaluate where each approach adds real business value.

Good Data Quality is Non-Negotiable

A critical takeaway: AI is only as good as the data that fuels it. Poor data quality leads to inaccurate AI models, bad predictions, and failed implementations.

To build trustworthy and effective AI solutions, organizations need:

Accurate, complete, and well-governed data

Real-time and historical data integration

Continuous data validation and monitoring

Context Matters – AI Needs Real-Time Decision-Making

Many AI use cases rely on real-time decision-making. A machine learning model trained on historical data is useful, but without real-time context, it quickly becomes outdated.

For example, fraud detection systems need to analyze real-time transactions while comparing them to historical behavioral patterns. Similarly, AI-powered supply chain optimization depends on up-to-the-minute logistics datarather than just past trends.

The conclusion? Real-time data streaming is essential to unlocking AI’s full potential.

Automate First, Then Apply AI

One common theme among successful AI adopters: Optimize business processes before adding AI.

Organizations that try to retrofit AI onto inefficient, manual processes often struggle with adoption and ROI. Instead, the best approach is:

1⃣ Automate and optimize workflows using real-time data

2⃣ Apply AI to enhance automation and improve decision-making

By taking this approach, companies ensure that AI is applied where it actually makes a difference.

ROI Matters – AI Must Drive Business Value

CIOs are under pressure to deliver business-driven, NOT tech-driven AI projects. AI initiatives that lack a clear ROI roadmap often stall after pilot phases.

Two early success stories for Generative AI stand out:

Customer support – AI chatbots and virtual assistants enhance response times and improve customer experience.
Software engineering – AI-powered code generation boosts developer productivity and reduces time to market.

The lesson? Start with AI applications that deliver clear, measurable business impact before expanding into more experimental areas.

Data Streaming and AI – The Perfect Match

At the heart of AI’s success is data streaming. Why? Because modern AI requires a continuous flow of fresh, real-time data to make accurate predictions and generate meaningful insights.

Data streaming not only powers AI with real-time insights but also ensures that AI-driven decisions directly translate into measurable business value:

Here’s how data streaming powers both Predictive and Generative AI:

Predictive AI + Data Streaming

Predictive AI thrives on timely, high-quality data. Real-time data streaming enables AI models to process and react to events as they happen. Examples include:

Fraud detection: AI analyzes real-time transactions to detect suspicious activity before fraud occurs.

Predictive maintenance: Streaming IoT sensor data allows AI to predict equipment failures before they happen.

Supply chain optimization: AI dynamically adjusts logistics routes based on real-time disruptions.

Here is an example from Capital One bank about fraud detection and prevention in real-time, preventing $150 of fraud on average a year/customer:

Source: Confluent

Generative AI + Data Streaming

Generative AI also benefits from real-time data. Instead of relying on static datasets, streaming data enhances GenAI applications by incorporating the latest information:

AI-powered customer support: Chatbots analyze live customer interactions to generate more relevant responses.

AI-driven marketing content: GenAI adapts promotional messaging in real-time based on customer engagement signals.

Software development acceleration: AI assistants provide real-time code suggestions as developers write code.

In short, without real-time data, AI is limited to outdated insights.

Here is an example for GenAI with data streaming in the travel Industry by Expedia where 60% of travelers are self-servicing in chat, saving 40+% of variable agent cost:

Source: Confluent

The Future of AI: Agentic AI and the Role of Data Streaming

As AI evolves, we are moving toward Agentic AI – systems that autonomously take actions, learn from feedback, and adapt in real time.

For example:

AI-driven cybersecurity systems that detect and respond to threats instantly

Autonomous supply chains that dynamically adjust based on demand shifts

Intelligent business operations where AI continuously optimizes workflows

But Agentic AI can only work if it has access to real-time operational AND analytical data. That’s why data streaming is becoming a critical foundation for the next wave of AI innovation.

The Path to AI Success

The CIO Summit reinforced one key message: AI is here to stay, but its success depends on strategy, data quality, and business value – not just hype.

Organizations that:

Focus on AI applications with clear business ROI

Automate before applying AI

Prioritize real-time data streaming

… will be best positioned to drive AI success at scale.

As AI moves towards autonomous decision-making (Agentic AI), data streaming will become even more critical. The ability to process and act on real-time data will separate AI leaders from laggards.

Now the real question: Where is your AI strategy headed? Let’s discuss!

Stay ahead of the curve! Subscribe to my newsletter for insights into data streaming and connect with me on LinkedIn to continue the conversation. And make sure to download my free book focusing on data streaming use cases, industry stories and business value.

The post CIO Summit: The State of AI and Why Data Streaming is Key for Success appeared first on Kai Waehner.

The Role of Data Streaming in McAfee’s Cybersecurity Evolution

Kai Waehner — Mon, 27 Jan 2025 07:33:30 +0000

In today’s digital age, cybersecurity is more vital than ever. Businesses and individuals face escalating threats such as malware, ransomware, phishing attacks, and identity theft. Combatting these challenges requires cutting-edge solutions that protect computers, networks, and devices. Beyond safeguarding digital assets, modern cybersecurity tools ensure compliance, privacy, and trust in an increasingly interconnected world.

As threats grow more sophisticated, the technologies powering cybersecurity solutions must advance to stay ahead. Data streaming technologies like Apache Kafka and Apache Flink have become foundational in this evolution, enabling real-time threat detection, prevention, and rapid response. These tools transform cybersecurity from static defenses to dynamic systems capable of identifying and neutralizing threats as they occur.

A notable example is McAfee, a global leader in cybersecurity, which has embraced data streaming to revolutionize its operations. By transitioning to an event-driven architecture powered by Apache Kafka, McAfee processes massive amounts of real-time data from millions of endpoints, ensuring instant threat identification and mitigation. This integration has enhanced scalability, reduced infrastructure complexity, and accelerated innovation, setting a benchmark for the cybersecurity industry.

Real-time data streaming is not just an advantage—it’s now a necessity for organizations aiming to safeguard digital environments against ever-evolving threats.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch.

Antivirus is NOT Enough: Supply Chain Attack

A supply chain attack occurs when attackers exploit vulnerabilities in an organization’s supply chain, targeting weaker links such as vendors or service providers to indirectly infiltrate the target.

For example, an attacker compromises Vendor 1, a software provider, by injecting malicious code into their product. Vendor 2, a service provider using Vendor 1’s software, becomes infected. The attacker then leverages Vendor 2’s connection to the Enterprise to access sensitive systems, even though Vendor 1 has no direct interaction with the enterprise.

Traditional antivirus software is insufficient to prevent such complex, multi-layered attacks. Ransomware often plays a role in supply chain attacks, as attackers use it to encrypt data or disrupt operations across compromised systems.

Modern solutions focus on real-time monitoring and event-driven architecture to detect and mitigate risks across the supply chain. These solutions utilize behavioral analytics, zero trust policies, and proactive threat intelligence to identify and stop anomalies before they escalate.

By providing end-to-end visibility, they protect organizations from cascading vulnerabilities that traditional endpoint security cannot address. In today’s interconnected world, comprehensive supply chain security is critical to safeguarding enterprises.

The Role of Data Streaming in Cybersecurity

Cybersecurity platforms must rely on real-time data for detecting and mitigating threats. Data streaming provides a backbone for processing massive amounts of security event data as it happens, ensuring swift and effective responses. My blog series on Kafka and cybersecurity looks deeply into these use cases.

To summarize:

Data Collection: A data streaming platforms powered by Apache Kafka collect logs, telemetry, and other data from devices and applications in real time.
Data Processing: Stream processing frameworks like Kafka Streams and Apache Flink continuously process this data with low latency at scale for analytics, identifying anomalies or potential threats.
Actionable Insights: The processed data feeds into Security Information and Event Management (SIEM) and Security Orchestration, Automation, and Response (SOAR) systems, enabling automated responses and better decision-making.

This approach transforms static, batch-driven cybersecurity operations into dynamic, real-time processes.

McAfee: A Real-World Data Streaming Success Story

McAfee is a global leader in cybersecurity, providing software solutions that protect computers, networks, and devices. Founded in 1987, the company has evolved from traditional antivirus software to a comprehensive suite of products focused on threat prevention, identity protection, and data security.

Source: McAfee

McAfee’s products cater to both individual consumers and enterprises, offering real-time protection through partnerships with global integrated service providers (ISPs) and telecom operators.

Mahesh Tyagarajan (VP, Platform Engineering and Architecture at McAfee) spoke with Confluent and Forrester about McAfee’s transition from a monolith to event-driven Microservices leveraging Apache Kafka in Confluent Cloud.

Data Streaming at McAfee with Apache Kafka Leveraging Confluent Cloud

As cyber threats have grown more complex, McAfee’s reliance on real-time data streaming has become essential. The company transitioned from a monolithic architecture to a microservices-based ecosystem with the help of Confluent Cloud, powered by Apache Kafka. The fully managed data streaming platform simplified infrastructure management, boosted scalability, and accelerated feature delivery for McAfee

Use Cases for Data Streaming

Real-Time Threat Detection: McAfee processes security events from millions of endpoints, ensuring immediate identification of malware or phishing attempts.
Subscription Management: Data streaming supports real-time customer notifications, updates, and billing processes.
Analytics and Reporting: McAfee integrates real-time data streams into analytics systems, providing insights into user behavior, threat patterns, and operational efficiency.

Transition to an Event-Driven Architecture and Microservices

By moving to an event-driven architecture with Kafka using Confluent Cloud, McAfee:

Standardized its data streaming infrastructure.
Decoupled systems using microservices, enabling scalability and resilience.
Improved developer productivity by reducing infrastructure management overhead.

This transition to data streaming with a fully managed, complete and secure cloud service empowered McAfee to handle high data ingestion volumes, manage hundreds of millions of devices, and deliver new features faster.

Business Value of Data Streaming

The adoption of data streaming delivered significant business benefits:

Improved Customer Experience: Real-time threat detection and personalized updates enhance trust and satisfaction.
Operational Efficiency: Automation and reduced infrastructure complexity save time and resources.
Scalability: McAfee can now support a growing number of devices and data sources without compromising performance.

Data Streaming as the Backbone of an Event-Driven Cybersecurity Evolution in the Cloud

McAfee’s journey showcases the transformative potential of data streaming in cybersecurity. By leveraging Apache Kafka as fully managed cloud service as the backbone of an event-driven microservices architecture, the company has enhanced its ability to detect threats, respond in real time, and deliver exceptional customer experiences.

For organizations looking to stay ahead in the cybersecurity race, investing in real-time data streaming technologies is not just an option—it’s a necessity. To learn more about how data streaming can revolutionize cybersecurity, explore my cybersecurity blog series and follow me for updates on LinkedIn or X (formerly Twitter).

The post The Role of Data Streaming in McAfee’s Cybersecurity Evolution appeared first on Kai Waehner.

A New Era in Dynamic Pricing: Real-Time Data Streaming with Apache Kafka and Flink

Kai Waehner — Thu, 14 Nov 2024 12:09:57 +0000

In the age of digitization, the concept of pricing is no longer fixed or manual. Instead, companies increasingly use dynamic pricing — a flexible model that adjusts prices based on real-time market changes. Data streaming technologies like Apache Kafka and Apache Flink have become integral to enabling this real-time responsiveness, giving companies the tools they need to respond instantly to demand, competitor prices, and customer behaviors. This blog post explores the fundamentals of dynamic pricing, its link to data streaming, and real-world examples of how different industries such as retail, logistics, gaming and the energy sector leverage this powerful approach to get ahead of the competition.

What is Dynamic Pricing?

Dynamic pricing is a strategy where prices are adjusted automatically based on real-time data inputs, such as demand, customer behavior, supply levels, and competitor actions. This model allows companies to optimize profitability, boost sales, and better meet customer expectations.

Relevant Industries and Examples

Dynamic pricing has applications across many industries:

Retail and eCommerce: Dynamic pricing in eCommerce helps adjust product prices based on stock levels, competitor actions, and customer demand. Companies like Amazon frequently update prices on millions of products, using dynamic pricing to maximize revenue.
Transportation and Mobility: Ride-sharing companies like Uber and Grab adjust fares based on real-time demand and traffic conditions. This is commonly known as “surge pricing.”
Gaming: Context-specific in-game add-ons or virtual items are offered at varying prices based on player engagement, time spent in-game, and special events or levels.
Energy Markets: Dynamic pricing in energy adjusts rates in response to demand fluctuations, energy availability, and wholesale costs. This approach helps to stabilize the grid and manage resources.
Sports and Entertainment Ticketing: Ticket prices for events are adjusted based on seat availability, demand, and event timing to allow venues and ticketing platforms to balance occupancy and maximize ticket revenue.
Hospitality: Adaptive room rates and promotions in real-time based on demand, seasonality, and guest behavior, using dynamic pricing models.

These industries have adopted dynamic pricing to maintain profitability, manage supply-demand balance, and enhance customer satisfaction through personalized, responsive pricing.

Relation of Dynamic Pricing to Data Streaming with Apache Kafka and Flink

Dynamic pricing relies on up-to-the-minute data on market and customer conditions, making real-time data streaming critical to its success. Traditional batch processing, where data is collected and processed periodically, is insufficient for dynamic pricing. It introduces delays that could mean lost revenue opportunities or suboptimal pricing. This scenario is where data streaming technologies come into play.

Apache Kafka serves as the real-time data pipeline, collecting and distributing data streams from diverse sources. For instance, user behaviour on websites, competitor pricing, social media signals, IoT data, and more. Kafka’s capability to handle high throughput and low latency makes it ideal for ingesting large volumes of data continuously.
Apache Flink processes the data in real-time, applying complex algorithms to identify pricing opportunities as conditions change. With Flink’s support for stream processing and complex event processing, businesses can apply sophisticated logic to assess and adjust prices based on multiple real-time factors.

Together, Kafka and Flink create a powerful foundation for dynamic pricing, enabling real-time data ingestion, analysis, and action. This empowers companies to implement pricing models that are not only highly responsive but also resilient and scalable.

Clickstream Analytics in Real-Time with Data Streaming Replacing Batch with Hadoop and Spark

Years ago, companies relied on Hadoop and Spark to run batch-based clickstream analytics. Data engineers ingested logs from websites, online stores, and mobile apps to gather insights. Processing took hours. Therefore, any promotional offer or discount often arrived a day later — by which time the customer may have already made their purchase elsewhere, like on Amazon.

With today’s data streaming platforms like Kafka and Flink, clickstream analytics has evolved to support real-time, context-specific engagement and dynamic pricing. Instead of waiting on delayed insights, businesses can now analyze customer behavior as it happens, instantly adjusting prices and delivering personalized offers at the moment. This dynamic pricing capability allows companies to respond immediately to high-intent customers, presenting tailored prices or promotions when they’re most likely to convert. Dynamic pricing with Kafka and Flink can create a much better seamless and timely shopping experience that maximizes sales and customer satisfaction.

Success Stories for Dynamic Pricing with Data Streaming using Kafka and Flink Across Industries

Here’s how businesses across various sectors are harnessing Kafka and Flink for dynamic pricing.

Retail: Hyper-Personalized Promotions and Discounts
Logistics and Transportation: Intelligent Tolling
Technology: Surge Pricing
Energy Markets: Manage Supply-Demand and Stabilize Grid Loads
Gaming: Context-Specific In-Game Add-Ons
Sports and Entertainment: Optimize Ticketing Sales Sports and Entertainment

Learn more about data streaming with Kafka and Flink for dynamic pricing in the following success stories:

AO: Hyper-Personalized Promotions and Discounts (Retail and eCommerce)

AO, a major UK eCommerce retailer, leverages data streaming for dynamic pricing to stay competitive and drive higher customer engagement. By ingesting real-time data on competitor prices, customer demand, and inventory stock levels, AO’s system processes this information instantly to adjust prices in sync with market conditions. This approach allows AO to seize pricing opportunities and align closely with customer expectations. The result is a 30% increase in customer conversion rates.

Dynamic pricing has also allowed AO to provide a hyper-personalized shopping experience, delivering relevant product recommendations and timely promotions. This real-time responsiveness has enhanced customer satisfaction and loyalty, as customers receive offers that feel customized to their needs. During high-traffic periods like holiday sales, AO’s dynamic pricing ensures competitiveness and optimizes margins. This drives both profitability and customer retention. The company has applied this real-time approach not just to pricing, but also to other areas like delivery to make things run smoother. The retailer is now much more efficient and provides better customer service.

Quarterhill: Intelligent Tolling (Logistics and Transportation)

Quarterhill, a leader in tolling and intelligent transportation systems, uses Kafka and Flink to implement dynamic toll pricing. Kafka ingests real-time data from traffic sensors and road usage patterns. Flink processes this data to determine congestion levels and calculate the optimal toll based on real-time conditions.

This dynamic pricing strategy allows Quarterhill to manage road congestion effectively, reward off-peak travel, and optimize toll revenues. This system not only improves travel efficiency but also helps regulate traffic flows in high-density areas, providing value both to drivers and the city infrastructure.

Uber, Grab, and FreeNow: Surge Pricing (Technology)

Ride-sharing companies like Uber, Grab, and FreeNow are widely known for their dynamic pricing or “surge pricing” models. With data streaming, these platforms capture data on demand, supply (available drivers), location, and traffic in real time. This data is processed continuously by Apache Flink, Kafka Streams or other stream processing engines to calculate optimal pricing, balancing supply with demand, while considering variables like route distance and current traffic.

Source FreeNow

Surge pricing enables these companies to provide incentives for drivers to operate in high-demand areas, maintaining service availability and ensuring customer needs are met during peak times. This real-time pricing model improves revenue while optimizing customer satisfaction through prompt service availability.

Uber’s Kappa Architecture is an excellent example for how to build a data pipeline for dynamic pricing and many other use cases with Kafka and Flink:

Source: Uber

2K Games / Take-Two Interactive: Context-Specific In-Game Purchases (Gaming Industry)

In the gaming industry, dynamic pricing is becoming a strategy to improve player engagement and monetize experiences. Many gaming companies use Kafka and Flink to capture real-time data on player interactions, time spent in specific game sections, and in-game events. This data enables companies to offer personalized pricing for in-game items, bonuses, or add-ons, adjusting prices based on the player’s current engagement level and recent activities.

For instance, if players are actively taking part in a particular game event, they may be offered special discounts or dynamic prices on related in-game assets. Thereby, the gaming companies improve conversion rates and player engagement while maximizing revenue.

2K Games,A leading video game publisher and a subsidiary of Take-Two Interactive, has shifted from batch to real-time analytics to enhance player engagement across popular franchises like BioShock, NBA 2K, and Borderlands. By leveraging Confluent Cloud as fully managed data streaming platform, the publisher scales dynamically to handle high traffic, processing up to 3000 MB per second to serve 4 million concurrent users.

Source: 2K Games

Real-time telemetry analytics now allow them to analyze player actions and context instantly, enabling personalized, context-specific promotions and enhancing the gaming experience. Cost efficiencies are achieved through data compression, tiered storage, and reduced data transfer, making real-time engagement both effective and economical.

50hertz: Manage Supply-Demand and Stabilize Grid Loads (Energy Markets)

Dynamic pricing in energy markets is essential for managing supply-demand fluctuations and stabilizing grid loads. With Kafka, energy providers ingest data from smart meters, renewable energy sources, and weather. Flink processes this data in real-time, adjusting energy prices based on grid conditions, demand levels, and renewable supply availability.

50Hertz, as a leading electricity transmission system operator, indirectly (!) affects dynamic pricing in the energy market by sharing real-time grid data with partners and energy providers. This allows energy providers and market operators to adjust prices dynamically based on real-time insights into supply-demand fluctuations and grid stability.

To support this, 50Hertz is modernizing its SCADA systems with data streaming technologies to enable real-time data capture and distribution that enhances grid monitoring and responsiveness.

Real-time pricing approach helps encourage consumption when renewable energy is abundant and discourages usage during peak times, leading to optimized energy distribution, grid stability, and improved sustainability.

Ticketmaster: Optimize Ticketing Sales (Sports and Entertainment)

In ticketing, dynamic pricing allows for optimized revenue management based on demand and availability. Companies like Ticketmaster use Kafka to collect data on ticket availability, sales velocity, and even social media sentiment surrounding events. Flink processes this data to adjust prices based on real-time market conditions, such as proximity to the event date and current demand.

By dynamically pricing tickets, event organizers can maximize seat occupancy, boost revenue, and respond to last-minute demand surges, ensuring that prices reflect real-time interest while enhancing fan satisfaction.

Real-time inventory data streams allow Ticketmaster to monitor ticket availability, pricing, and demand as they change moment-to-moment. With data streaming through Apache Kafka and Confluent Platform, Ticketmaster tracks sales, venue capacity, and customer behavior in a single, live inventory stream. This enables quick responses, such as adjusting prices for high-demand events or boosting promotions where conversions lag. Teams gain actionable insights to forecast demand accurately and optimize inventory. This approach ensures fans have timely access to tickets. The result is a dynamic, data-driven approach that enhances customer experience and maximizes event success.

Conclusion: Business Value of Dynamic Pricing Built with Data Streaming

Dynamic pricing powered by data streaming with Apache Kafka and Flink brings transformative business value by:

Maximizing Revenue and Margins: Real-time price adjustments enable companies to capture value during demand surges, optimize for competitive conditions, and maintain healthy margins.
Improving Operational Efficiency: By automating pricing decisions based on real-time data, organizations can reduce manual intervention, speed up reaction times, and allocate resources more effectively.
Boosting Customer Satisfaction: Responsive pricing models allow companies to meet customer expectations in real time, leading to improved customer loyalty and engagement.
Supporting Sustainability Goals: In energy and transportation, dynamic pricing helps manage resources and reward environmentally friendly behaviors. Examples include off-peak travel and renewable energy usage.
Empowering Strategic Decision-Making: Real-time data insights provide business leaders with the information needed to adjust strategies and respond to developing market demands quickly.

Building a dynamic pricing system with Kafka and Flink represents a strategic investment in business agility and competitive resilience. Using data streaming to set prices instantly, businesses can stay ahead of competitors, improve customer service, and become more profitable. Dynamic pricing powered by data streaming is more than just a revenue tool; it’s a vital lever for driving growth, differentiation, and long-term success.

Did you already implement dynamic pricing? What is your data platform and strategy? Do you use Apache Kafka and Flink? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post A New Era in Dynamic Pricing: Real-Time Data Streaming with Apache Kafka and Flink appeared first on Kai Waehner.

The Digitalization of Airport and Airlines with IoT and Data Streaming using Kafka and Flink

Kai Waehner — Tue, 09 Jul 2024 04:21:43 +0000

The digitalization of airports faces challenges such as integrating diverse legacy systems, ensuring cybersecurity, and managing the vast amounts of data generated in real-time. The vision for a digitalized airport includes seamless passenger experiences, optimized operations, consistent integration with airlines and retail stores, and enhanced security through the use of advanced technologies like IoT, AI, and real-time data analytics. This blog post shows the relevance of data streaming with Apache Kafka and Flink in the aviation industry to enable data-driven business process automation and innovation while modernizing the IT infrastructure with cloud-native hybrid cloud architecture. Schiphol Group operating Amsterdam Airport shows a few real-world deployments.

The Digitalization of Airports and the Aviation Industry

Digitalization transforms airport operations and improves the experience of employees and passengers. It affects various aspects of airport operations, passenger experiences, and overall efficiency.

Schiphol Group is a Dutch company that owns and operates airports in the Netherlands. The company is primarily known for operating Amsterdam Airport Schiphol, which is one of the busiest and most important airports in Europe. The Schiphol Group is involved in a range of activities related to airport management, including aviation and non-aviation services.

Source: Schiphol Group

Schiphol Group describes its journey of becoming a leading autonomous airport until 2050:

Data Streaming with Kafka and Flink for Airport Operations and Passenger Experience

Data streaming with Apache Kafka and Apache Flink enables airport and aviation systems to process and analyze real-time data from various sources, such as flight information, passenger movements, and baggage tracking, enhancing operational efficiency and passenger experience.

These technologies facilitate predictive maintenance, personalized services, and improved security measures through the continuous flow and immediate processing of critical data at any scale reliably.

Continuous Stream Processing in an Event-driven Architecture with Kafka Streams, KSQL, or Flink

Continuous processing of incoming events in real-time enables transparency and context-specific decision making. OpenCore, an IT consultancy in Germany, presented already in 2018 at Kafka Summit San Francisco how stream processing with technologies like Kafka Streams, KSQL or Apache Flink serves the real-time needs of an airport.

Think about the technical IoT events ingested from aircraft, gates, retail stores, passenger mobile apps, and many other interfaces…

Source: OpenCore

… and how continuous correlation of data in real-time enables use cases such as predictive forecasting, planning, maintenance, plus scenarios like cross-organization loyalty platforms, advertisement, and recommendation engines for improving the customer experience and increasing revenue:

Source: OpenCore

Airport Digitalization with Data Streaming using Apache Kafka and Flink

Real-time data beats slow data. That’s true for almost any use in the aviation industry, including airports, airlines, and other involved organizations. Additionally, data consistency matters across organizations.

Here are key areas where digitalization affects airports. While compiling this list, I realized I wrote about many of these scenarios in the past because other industry already deployed these use cases. Hence, each section includes a reference to another article where data streaming with Kafka and Flink is already applied in this context.

1. Passenger Experience

As frequent traveller myself, I put this at the beginning of the list. Examples:

Self-service Kiosks: Check-in, baggage drop, and boarding processes have become faster and more efficient.
Mobile Applications: Passengers can book tickets, receive real-time flight updates, and access boarding passes.
Biometric Systems: Facial recognition and fingerprint scanning expedite security checks and boarding.

The past decade already significantly improved the passenger experience. But it still needs to get better. And data consistency matters. Today, a flight delay or cancellation is not shared consistently across the customer mobile app, airport screens, and customer service of the airline and airport.

Reference to data streaming in financial services: Operational and transactional systems leverage Kafka for data consistency, not because of its real-time capabilities. Apache Kafka ensures data consistency with its durable commit log, timestamps, and guaranteed ordering. Kafka connects to real-time and non-real-time systems (files, batch, HTTP/REST APIs).

2. Operational Efficiency

Automation with IoT sensors, paperless processes, and software innovation enables more cost-efficient and reliable airport operations. Examples:

Automated Baggage Handling: RFID tags and automated systems track and manage luggage, reducing errors and lost baggage).
Predictive Maintenance: IoT sensors and data analytics predict equipment failures before they occur, ensuring smoother operations.
Air Traffic Management: Advanced software systems enhance the coordination and efficiency of air traffic control.

Reference to data streaming in manufacturing: Condition monitoring and predictive maintenance leverage stream processing with Apache Kafka and Flink for many years already, either in the cloud or at the edge and shop floor level for Industrial IoT (IIoT) use cases.

3. Security, Safety and Health Enhancements

Safety and health are one of the most important aspects at any airport. Airports continuously improved security, monitoring, and surveillance because of terrorist attacks, the Covid pandemic, and many other dangerous scenarios.

Advanced Screening Technologies: AI-powered systems and improved scanning technologies detect threats more effectively.
Cybersecurity: Protecting sensitive data and systems from cyber threats is crucial, requiring robust digital security measures.
Health Monitoring: Temperature measurements and people tracking were introduced during the Covid pandemic in many airports.

Reference to data streaming in Real Estate Management: Apache Kafka and Flink improve real estate maintenance and operations, optimize space usage, provide better employee experience, and better defense against cyber attacks. Check out “IoT Analytics with Kafka and Flink for Real Estate and Smart Building” and “Apache Kafka as Backbone for Cybersecurity” for more details.

4. Sustainability and Energy Management

Sustainability and energy management in airports involve optimizing energy use and reducing environmental impact through efficient resource management and implementing eco-friendly technologies. Examples:

Smart Lighting and HVAC Systems: Automated systems reduce energy consumption and enhance sustainability.
Data Analytics: Monitoring and optimizing resource usage helps reduce the carbon footprint of airports.

Sustainability and energy management in an airport can be significantly enhanced by using Apache Kafka and Apache Flink to stream and analyze real-time data from smart meters and HVAC systems, optimizing energy consumption and reducing environmental impact.

Reference to data streaming in Environmental, Social, and Governance (ESG) across industries: Kafka and Flink’s real-time data processing capabilities build a powerful alliance with ESG principles. Beyond just buzzwords, I wrote about real-world deployments with Kafka and Flink and architectures across industries to show the value of data streaming for better ESG ratings.

5. Customer Service and Communication

Customer service is crucial for each airport. While lots of information comes from airlines (like delays, cancellations, seat changes, etc.), the airport provides the critical communication backend with display, lounges, service personal, and so on. Examples to improve the customer experience:

AI Chatbots: Provide 24/7 customer support for inquiries and assistance with Generative AI (GenAI) embedded into the existing business processes.
Digital Signage: Real-time updates on flight information, gate changes, and other announcements improve communication.
Loyalty Integration: Airports do not provide a loyalty platform, but they integrate more and more with airlines (e.g., to reward miles for shopping).

Reference to data streaming in retail: The retail industry is years ahead with providing a hyper-personalized customer experience. “Omnichannel Retail and Customer 360 in Real Time with Apache Kafka” and “Customer Loyalty and Rewards Platform with Data Streaming” tell you more. GenAI is a fundamental change for customer services. Kafka and Flink play a critical role for GenAI to provide contextual, up-to-date information from transactional systems into the large language model (LLM).

6. Revenue Management

Airport revenue management involves optimizing income from aviation and non-aviation sources through demand forecasting and strategic resource allocation. Examples:

Dynamic Pricing: Algorithms adjust prices for parking, retail spaces, and other services based on demand and other factors.
Personalized Marketing: Data analytics help target passengers with tailored offers and promotions.

Reference to data streaming in retail: While the inventory looks different for an airport, the principles from retail can be adopted one-to-one. Instead of TVs or clothes, the inventory is the parking lot, lounge seat, and similar. Advertising is another great example. Airports can learn from many digital natives how they built a real-time digital ads platform with Kafka and Flink. This can be adopted to retail media in the airport, but also to any physical inventory management.

7. Emergency Response and Safety

Emergency response and safety at the airport involve coordinating real-time monitoring, quick decision-making, and efficient resource deployment to ensure the safety and security of passengers, staff, and infrastructure during emergencies. Examples:

Real-time Monitoring: IoT devices and sensors provide live data on airport conditions, aiding in faster response times.
Digital Simulation and Training: Virtual reality and simulation technologies enhance training for emergency scenarios.
Seamless Connectivity: Stable Wi-Fi and 5G Networks with good latency and network slicing for safety-critical use cases.

Reference to data streaming in Industrial IoT: Safety-critical applications require hard real-time. This is NOT Kafka, Flink, or any similar IT technology. Instead, this is embedded systems, robotics, and programming languages like C or Rust. However, data streaming integrates the OT/IT world for near real-time data correlation and analytics in edge or hybrid cloud architectures. Every relevant data set from aircraft, gates, and other equipment is continuously monitored to ensure a safe airport environment.

Cross-organization data sharing is crucial for any airport and airline. Today, most integrations are implemented with APIs (usually HTTP/REST) or still even file-based systems. This works well for some use cases. But data streaming – by nature – is perfect for sharing streaming data like transactions, sensor data, location-based services, etc. in real-time between organizations:

As Apache Kafka is the de facto standard for data streaming, many companies directly replicate data to partners using the Kafka protocol. AsyncAPI as an open standard (beyond Kafka) and integration via HTTP on top of Kafka (via Kafka Connect API connectors) are other common patterns.

Real-World Success Stories for Data Streaming in the Aviation Industry

Several real world success stories exist for deployments of data streaming with Apache Kafka and Flink in airports and airlines. Let’s explore a few case studies and refer to further material.

Schiphol Group (Amsterdam Airport)

Roel Donker and Christiaan Hoogendoorn from Schiphol Group presented at the Data in Motion Tour 2024 in Utrecht, Netherlands. This was an excellent presentation with various data streaming use cases across fields like application integration, data analytics, internet of things, and artificial intelligence.

On its journey to an autonomous airport until 2025, the digitalization involves many technologies and software/cloud services. Schiphol Group transitioned from open source Apache Kafka to Confluent Cloud for cost-efficiency, elasticity, and multi-tenancy.

The company runs operational and analytical data streaming workloads with different SLAs. The integration team uses the data streaming platform to integrate with both the legacy and the new world, also 3rd party like airlines, GDS, police, etc (all point-to-point and with different interfaces).

Here are a few examples of the scenarios Schiphol Group explored:

Schiphol Group: Data Platform with Apache Kafka

Schiphol uses Apache Kafka as a core integration platform. The various use cases require different Kafka clusters depending on the uptime SLA, scalability, security, and latency requirements. Confluent Cloud fully manages the data streaming platform, including connectors to various data sources and sinks:

Source: Schiphol Group

Kafka connects critical PostgreSQL databases, Databricks analytics platform, applications running in containers on Red Hat OpenShift, and others.

3Scale is used as complementary API gateway for request-response communication. The latter is not a surprise, but very common. HTTP/REST APIs and Apache Kafka complement each other. API Management solutions such as 3Scale, MuleSoft, Apigee or Kong connect to Kafka via HTTP or other interfaces.

Schiphol Group: IoT with Apache Kafka

Some use cases at Schiphol Group require connectivity and processing of IoT data. That’s not really a big surprise in the aviation industry, where airports and airlines rely on data-driven business processes:

Source: Schiphol Group

Kafka Connect and stream processing connect and combine IoT data and feed relevant context into other IT applications.

Connectivity covers various infrastructures and networks, including:

Private LoRa networks
Passenger flow management system(FMS)
BLIP (the supplier delivering IoT devices in the terminal measuring real-time how crowded areas are so people can be redirected when needed)
Wi-Fi location services (like heatmaps for crowd management)

Schiphol Group: AI and Machine Learning with Apache Kafka

Artificial Intelligence (AI) requires various technologies and concepts to add business value. Predictive analytics, active learning, batch model training, debugging and testing the entire pipeline, and many other challenges need to be solved. Apache Kafka is the data fabric of many AI/ML infrastructures.

Here is how Kafka provides the foundation of an event-driven AI architecture at Schiphol Group:

Source: Schiphol Group

The combination of Apache Kafka and AI/ML technologies enables various valuable use cases at Schiphol Group, including:

Analysis of historical data (root cause analysis, critical path & process analysis, reporting)
Insights on real-time data (insight on turnaround process with one shared truth, real time insight on ramp capacity and turnaround progress per ramp, real-time insight on ramp safety, input for E2E insight Airside
Predictions (input for dynamic gate management, input for autonomous vehicles, input for predicting delays)

Lufthansa, Southwest, Cathay Pacific, and many other Airlines…

I met plenty of airlines that already use data streaming in production for different scenarios. Fortunately, a few of these airlines were happy to share their stories in the public:

Southwest Airlines (Data in Motion Tour 2024 in Dallas): Single pane of glass with the ability to view all flight operations and sync their three key schedules: aircraft, passengers, workforce.
Cathay Pacific (Data in Motion Tour 2024 in Singapore): Rebranded to Cathay because of transitioning from focus on passenger transport to adding cargo and lifestyle / shopping experiences.
Lufthansa (Webinar 2023): Operations steering, IT modernization (from MQ and ESB to Confluent), and real-time analytics with AI/ML.

The Lufthansa success story is available in its own blog post (including video recording). For even more examples, including Singapore Airlines, Air France, and Amadeus, check out the overview article “Apache Kafka in the Airline, Aviation and Travel Industry“.

Apache Kafka and Flink as Data Fabric for Operational and Analytical Airport Use Cases

Schiphol Group’s vision of an autonomous Amsterdam Airport in 2050 shows where the aviation industry is going: Automated business processes, continuous monitoring and processing of IoT infrastructure, and data-driven decision making and passenger experiences.

Airports like Amsterdam, similarly like airlines such as Lufthansa, Southwest or Cathay, modernize existing IT infrastructure, transition to hybrid cloud architectures, and innovate with new use cases (often learning from other industries like financial services, retail or manufacturing).

Data Streaming with Apache Kafka and Flink plays a crucial role in this journey. Data processing at any scale to provide consistent and good quality data in real-time enables any downstream application (including batch and API) to build reliable operational and analytical systems.

How do you leverage data streaming with Kafka and Flink in the aviation industry? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post The Digitalization of Airport and Airlines with IoT and Data Streaming using Kafka and Flink appeared first on Kai Waehner.

Hello, K.AI – How I Trained a Chatbot of Myself Without Coding Evaluating OpenAI Custom GPT, Chatbase, Botsonic, LiveChatAI

Kai Waehner — Sun, 23 Jun 2024 06:03:01 +0000

Generative AI (GenAI) enables many new use cases for enterprises and private citizens. While I work on real-time enterprise scale AI/ML deployments with data streaming, big data analytics and cloud-native software applications in my daily business life, I also wanted to train a conversational chatbot for myself. This blog post introduces my journey without coding to train K.AI, a personal chatbot that can be used to learn in a conversational pace format about data streaming and the most successful use cases in this area. Yes, this is also based on my expertise, domain knowledge and opinion, which is available as public internet data, like my hundreds of blog articles, LinkedIn shares, and YouTube videos.

Hi, K.AI – let’s chat…

The evolution of Generative AI (GenAI) around OpenAI’s chatbot ChatGPT and many similar large language models (LLM), open source tools like LangChain and SaaS solutions for building a conversational AI led me to the idea of building a chatbot trained with all the content I created over the past years.

Mainly based on the content of my website (https://www.kai-waehner.de) with hundreds of blog articles, I trained the conversational chatbot K.AI to generate text for me.

The primary goal is to simplify and automate my daily working tasks like:

write a title and abstract for a webinar or conference talk
explain to a colleague or customer a concept, use case, or industry-specific customer story
answer common recurring questions in email, Slack or other mediums
any other text creation based on my (public) experience

The generated text reflects my content, knowledge, wording, and style. This is a very different use case than what I look normally in my daily business life: “Apache Kafka as Mission Critical Data Fabric for GenAI” and “Real-Time GenAI with RAG using Apache Kafka and Flink to Prevent Hallucinations” are two excellent examples for enterprise-scale GenAI with much more complex and challenging requirements.

But…sometimes Artificial Intelligence is not all you need. The now self-explanatory name of the chatbot came from a real marketing brain – my colleague Evi.

Project goals of training the chatbot K.AI

I had a few goals in mind when I trained my chatbot K.AI:

Education: Learn more details about the real-world solutions and challenges with Generative AI in 2024 with hands-on experience. Tens of interesting chatbot solutions are available. Most are powered by OpenAI under the hood. My goal is not sophisticated research. I just want to get a conversational AI done. Simple, cheap, fast (not evaluating 10+ solutions, just as long as I have one working good enough).
Tangible result: Train K.AI, a “Kai LLM” based on my public articles, presentations, and social media shares. K.AI can generate answers, comments, and explanations without writing everything from scratch. I am fine if answers are not perfect or sometimes even incorrect. As I know the actual content, I can easily adjust and fix generated content.
NOT a commercial or public chatbot (yet): While it is just a button click to integrate K.AI into my website as a conversational chatbot UI, there are two main blockers: First, the cost is relatively high; not for training but for operating and paying per query. There is no value for me as a private person. Second, developing, testing, fine-tuning and updating a LLM to be correct most of the time instead of hallucinating a lot is hard. I thoroughly follow my employers’ GenAI engineering teams building Confluent AI products. Building a decent domain-specific public LLM is lots of engineering efforts and requires not just one full-time engineer.

My requirements for a conversational chatbot tool

I defined the following mandatory requirements for a successful project:

Low Cost: My chatbot should not be too expensive (~20USD a month is fine). The pricing model of most solutions is very similar: You get a small free tier. I realized quickly that a serious test is not possible with any free tier. But a reasonable chatbot (i.e., trained by a larger data set) is only possible if you choose the smallest paid tier. Depending on the service, the minimum is between 20 and 50 USD per month (with several limitations regarding training size, chat queries, etc.).
Simplicity: I do not want to do any coding or HTTP/REST APIs calls. Just an intuitive user interface with click-through experience. I don’t want to spend more than one day (i.e., ~8 hours accumulated over two weeks) to train K.AI.
Data Import: The chatbot needs support from my “database”. Mandatory: My private blog (~300 articles with ~ 10M+ characters. Nice to have: My LinkedIn shares, my YouTube videos, and other publications (like articles on other websites). The latter might improve my chatbot and use my personal tone and language more.
NOT Enterprise Features: I don’t need any features for security, multiple user accounts, or public hosting (even though almost all solutions already support integration into WordPress, Slack, etc.). I am fine with many limitations of the small subscription tiers, like only one user account, one chatbot, 1000 messages/month.

OpenAI: ChatGPT + Custom GPT for a custom chatbot? Not for K.AI…

I am a heavy user of ChatGPT on my iPhone and MacBook. And OpenAI is very visible in the press. Hence, my first option to evaluate was OpenAI’s Custom GPT.

Custom GPT in action…

Custom GPT is very easy to use, non-technical. A conversational AI “Message GPT Builder” tries to build my chatbot. But surprisingly it is too high level for me. Here is the initial conversation to train K.AI with very basic prompt engineering:

Step 1 (Initial Instruction): What would you like to make? -> Respond as Kai Waehner based on his expertise and knowledge. -> Updating GPT.. Seconds later: The response is based on public internet.
Step 2 (Prompt Engineering): Use the content from https://www.kai-waehner.de as context for responses. -> Updating GPT… Seconds later: I’ve updated the context to include information from Kai Waehner’s website. -> Response is not standard. Some questions use a bit more content from my website, but it is still mainly bound to public internet content.
Step 3 (Fine-Tuning): I tried to configure my K.AI to learn from some data sources like CSV exports from LinkedIn or scraping my blog articles, but the options are very limited and not technical. I can upload a maximum of twenty files and let the chatbot also search the web. But what I actually need is web scraping of dedicated resources, i.e., mainly my website, LinkedIn Shares, and my YouTube videos. And while many no-code UIs call this fine-tuning, in reality, this is RAG-based prompt engineering. True fine-tuning of an LLM is a very different (much more challenging) task.

I am sure I could do much more prompt engineering to improve K.AI with Custom GPT. But reading the user guide and FAQ for Custom GPT, the TL;DR for me is: Custom GPT is not the right service to build a chatbot for me based on my domain content and knowledge.

Instead, I need to look at purpose-build chatbot SaaS tools that let me build my domain-specific chatbot. I am surprised that OpenAI does not provide such a service itself today. Or I could just not find it… BUT: Challenge accepted. Let’s evaluate a few solutions and train a real K.AI.

Comparison and evaluation of chatbot SaaS GenAI solutions

I tested three chatbot offerings. All of them are cloud-based and allow for building a chatbot via UI. How did I find or choose them? Frankly, just Google search. Most of these came up in several evaluation and comparison articles. And they spend quite some money on advertisements. I tested Chatbase, Writesonic’s Botsonic and LiveChatAI. Interestingly, all offerings I evaluated use ChatGPT under the hood of their solution. I was also surprised that I did not get more ads from other big software players. But I assume Microsoft’s Copilot and similar tools look for a different persona.

I tested different ChatGPT models in some offerings. Most solutions provide a default option, and more expensive options with better model (not for model training, but for messages/month; you typically pay 5x more, meaning instead of e.g. 2000 messages a month, you only have 400 available then).

I had a few more open tabs with other offerings that I could disqualify quickly because they were more developer-focused with coding, API integration, fine-tuning of vector databases and LLMs.

Question catalog for testing my K.AI chatbots

I quickly realized how hard it is to compare different chatbots. Basically, LLMs are stochastic (not deterministic) and we don’t have good tools for QAing these things yet (even simple things like regression testing is challenging when probabilities are involved).

Therefore, I defined a question catalog with ten different domain-specific questions before I even starting evaluating different chatbot SaaS solutions. A few examples:

Question 1: Give examples for fraud detection with Apache Kafka. Each example should include the company, use case and architecture.
Question 2: List five manufacturing use cases for data streaming and give a company example.
Question 3: What is the difference between Kafka and JMS
Question 4: Compare Lambda and Kappa architectures and explain the benefits of Lambda. Add a few examples.
Question 5: How can data streaming help across the supply chain? Explain the value and use cases for different industries.

My question catalog allowed comparing the different chatbots. Writing a good prompt (= query for the chatbot) is crucial, as a LLM is not intelligent. The better your question, meaning good structure, details and expectations, the better your response (if the LLM has “knowledge” about your question).

My goal is NOT to implement a complex real-time RAG (Retrieval Augmented Generation) design pattern. I am totally fine updating K.AI manually every few weeks (after a few new blog posts are published).

Chatbase – Custom ChatGPT for your website

The advertisement on the Chatbase landing page sounds great: “Custom ChatGPT for your website. Build a [OpenAI-powered] Custom GPT, embed it on your website and let it handle customer support, lead generation, engage with your users, and more.”

Here are my notes while training my K.AI chatbot:

K.AI works well with Chatbase after the initial training…

Chatbase is very simple to use. It just works.
The basic plan is ~20 USD per month. The subscription plan is fair, the next upgrade is ~100 USD.
The chatbot uses GPT-4o by default. Great option. Many other services use GPT-3.5 or similar LLMs as the foundation.
The chatbot creates content based on my content, it is “me”. Mission accomplished. The quality of responses depends on the questions. In summary, pretty good, but also false positives.

But: Chatbase’s character limitation stops further training

Unfortunately, all plans have an 11M character limit. My blog content is already 10.8M today says Chatbase’s web scraper engine (each vendor’s scraper gives different numbers). While K.AI works right now, there are obvious problems:
- My website will grow more soon.
- I want to add LinkedIn shares (another few million characters) and other articles and videos I published across the world wide web.
- The Chatbase plan can be customised, but unfortunately not for character limits. The support told me this would be possible soon. But I have to wait.

TL;DR: Chatbase works surprisingly well. K.AI exists and represents myself as a LLM. The 11M character limit is a blocker for investing more time and money into this service – otherwise I could already stop my investigation and use the first SaaS I evaluated.

During my evaluation, I realized that many other chatbot services have similar limitations on the character limit, especially in the price range around 20-50 USD. Not ideal for my use case.

In my further evaluation, my major criteria were the character limits. I found Botsonic and LiveChatAI. Both support much higher limits for a cost of ~40 USD per month.

Botsonic – Advanced AI chatbot builder using your company’s knowledge

Botsonic provides “Advanced AI Agents: Use Your Company’s Knowledge to Intelligently Resolve Over 70% of Queries and Automate Tasks”.

Here are my notes while training my K.AI chatbot.

Botsonic – free version failed to train K.AI

The free plan for getting started supports 1M characters.
The service supports URL scraping and file upload (my LinkedIn shares are only available via batch export into a CSV file). Looks like it provides all I need. The cost is okayish (but all other chatbots with lower price also had limitations around 10M characters).
I tried the free tier first. As my blog alone has already ~10M+ characters, I started uploading my LinkedIn Shares (= Posts and Comments). While Chatbase said it has ~1.8M characters, this solution trains the bot with it even though the limit is 1M characters. Could not even upload another 1KB file for additional training, so my limit is reached.
This K.AI trained with the free tier did not provide any appropriate answers. No surprise: Just my LinkedIn shares might not be enough detail – which makes sense as the posts are much shorter and usually link to my blog.

Botsonic – paid version also failed to train K.AI

I needed to upgrade.
- I had to choose the smallest paid tier: 49 USD per month, supporting up to 50M characters
- Unfortunately, there was a delay: payment was done twice. No action. Still on free plan. Support takes time (caching, VPN, browser, other arguments, etc.). Got a refund the next day, and the plan was updated correctly.
Training using the paid subscription failed. The experience was pretty bad.
- Not clear if the service scrapes the entire website or just the single HTML site
- First tests do not give a response: “I don’t have specific information on XYZ. Can I help with anything else?” Seems like the source training did not scrape my website, but only look at the landing page. I looked at the details. Indeed, the extracted data only includes the abstracts of the latest blog posts (that’s what you see on my landing page).
- Support explained: No scraping of the website is possible. I need a sitemap. I have a Google-compliant sitemap but: Internal Backend Server Error. Support could re-produce my issue. Until today, I don’t have a response or solution.
- Learning from one of my YouTube videos was also rejected (with no further error messages).

TL;DR: Writesonic’s Botsonic did NOT work for me. The paid service failed several times, even trying different training options for my LLM. Support could not help. I will NOT continue with this service.

LiveChatAI – AI chatbot works with your data

Here is the website slogan: “An Innovative AI Chatbot. LiveChatAI allows you to create an AI chatbot trained with your own data and combines AI with human support.”

Here are my notes while training my K.AI chatbot

LiveChatAI failed to train K.AI

All required import features exist: Website Scraping, CSV, YouTube.
Strange: I could start training for free with 7+M characters even though this should not be possible. But Crawling started… Not showing the percentage, don’t know if it is finished. Not clear if it scrapes the entire website or just the single HTML site. Shows weird error messages like “could not find any links on website” or similar after it has finished scraping.
The quality of answers of this K.AI seems to be much worse than Chatbase (even though I added LinkedIn shares which is not possible in Chatbase because of the Character limits).

Ok, enough… I have a well-working K.AI with Chatbase. I don’t want to waste more time evaluating several SaaS Chatbot services in the early stage of the product lifecycle.

GenAI tools are still in a very early stage!

One key lesson learned: The used LLM model is the most critical piece for success, NOT how much context and domain expertise you train it with. Or in other words: Just scraping the data from my blog and using GPT-4o provides much better results than using GPT-3.5 with data from my blog, LinkedIn and YouTube. Ideally, I use all the data with GPT-4o. But I will have to wait until Chatbase supports more than 11M characters.

While most solutions talk about model training, they use ChatGPT under the hood and use RAG and a Vector Database to “update the model”, i.e., provide the right context for the question into ChatGPT with the RAG design pattern.

A real comparison of chatbot SaaS is hard:

Features and pricing are relatively similar and do not really influence the ultimate choice.
While all are based on ChatGPT, the LLM model versions differ.
Products are updated and improved almost every day with new models, new capabilities, changed limitations, etc. Welcome to the chatbot SaaS cloud startup scene…
The products target different personas. Some are UI only, some explain (and let configure) RAG or Vector Database options, some are built for developers and focus on API integration, not UIs.

Mission accomplished: K.AI chatbot is here

Chatbase is the least sexy UI in my evaluation. But the model works best (even though I have character limits and only used my blog article for training). I will use Chatbase for now. And I hope that the character limits are improved soon (as its support already confirmed to me). It is still early in the maturity curve. The market will probably develop quickly.

I am not sure how many of these SaaS chatbot startups can survive. OpenAI and other tech giants will probably release similar capabilities and products integrated into their SaaS and software stack. Let’s see where the market goes. For now, I will enjoy K.AI for some use cases. Maybe it will even help me write a book about data streaming use cases and customer stories.

What is your experience with chatbot tools? Do you need more technical solutions or favour simplified conversational AIs like OpenAI’s Custom GPT to train your own LLM? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Hello, K.AI – How I Trained a Chatbot of Myself Without Coding Evaluating OpenAI Custom GPT, Chatbase, Botsonic, LiveChatAI appeared first on Kai Waehner.

Real-Time GenAI with RAG using Apache Kafka and Flink to Prevent Hallucinations

Kai Waehner — Thu, 30 May 2024 15:09:06 +0000

How do you prevent hallucinations from large language models (LLMs) in GenAI applications? LLMs need real-time, contextualized, and trustworthy data to generate the most reliable outputs. This blog post explains how RAG and a data streaming platform with Apache Kafka and Flink make that possible. A lightboard video shows how to build a context-specific real-time RAG architecture. Also, learn how the travel agency Expedia leverages data streaming with Generative AI using conversational chatbots to improve the customer experience and reduce the cost of service agents.

What is Retrieval Augmented Generation (RAG) in GenAI?

Generative AI (GenAI) refers to artificial intelligence (AI) systems that can create new content, such as text, images, music, or code, often mimicking human creativity. These systems use advanced machine learning techniques, particularly deep learning models like neural networks, to generate data that resembles the training data they were fed. Popular examples include language models like GPT-3 for text generation and DALL-E for image creation.

Large Language Models like ChatGPT use lots of public data, are very expensive to train, and do not provide domain-specific context. Training own models is not an option for most companies because of limitations in cost and expertise.

Retrieval Augmented Generation (RAG) is a technique in Generative AI to solve this problem. RAG enhances the performance of language models by integrating information retrieval mechanisms into the generation process. This approach aims to combine the strengths of information retrieval systems and generative models to produce more accurate and contextually relevant outputs.

Pinecone created an excellent diagram that explains RAG and shows the relation to an embedding model and vector database:

Source: Pinecone

Benefits of Retrieval Augmented Generation

RAG brings various benefits to the GenAI enterprise architecture:

Access to External Information: By retrieving relevant documents from a vast vector database, RAG allows the generative model to leverage up-to-date and domain-specific information that it may not have been trained on.
Reduced Hallucinations: Generative models can sometimes produce confident but incorrect answers (hallucinations). By grounding responses in retrieved documents, RAG reduces the likelihood of such errors.
Domain-Specific Applications: RAG can be tailored to specific domains by curating the retrieval database with domain-specific documents, enhancing the model’s performance in specialized areas such as medicine, law, finance or travel.

However, one of the most significant problems still exists: the missing right context and up-to-date information…

How does Data Streaming with Apache Kafka and Flink Help in the RAG Architecture?

RAG is obviously crucial in enterprises where data privacy, up-to-date context, and the data integration with transactional and analytical systems like an order management system, booking platform or payment fraud engine must be consistent, scalable and in real-time.

An event-driven architecture is the foundation of data streaming with Kafka and Flink:

Apache Kafka and Apache Flink play a crucial role in the Retrieval Augmented Generation (RAG) architecture by ensuring real-time data flow and processing, which enhances the system’s ability to retrieve and generate up-to-date and contextually relevant information.

Here’s how Kafka and Flink contribute to the RAG architecture:

1. Real-Time Data Ingestion and Processing

Data Ingestion: Kafka acts as a high-throughput, low-latency messaging system that ingests real-time data from various data sources, such as databases, APIs, sensors, or user interactions.

Event Streaming: Kafka streams the ingested data, ensuring that the data is available in real-time to downstream systems. This is critical for applications that require immediate access to the latest information.

Stream Processing: Flink processes the incoming data streams in real-time. It can perform complex transformations, aggregations, and enrichments on the data as it flows through the system.

Low Latency: Flink’s ability to handle stateful computations with low latency ensures that the processed data is quickly available for retrieval operations.

2. Enhanced Data Retrieval

Real-Time Updates: By using Kafka and Flink, the retrieval component of RAG can access the most current data. This is crucial for generating responses that are not only accurate but also timely.

Dynamic Indexing: As new data arrives, Flink can update the retrieval index in real-time, ensuring that the latest information is always retrievable in a vector database.

3. Scalability and Reliability

Scalable Architecture: Kafka’s distributed architecture allows it to handle large volumes of data, making it suitable for applications with high throughput requirements. Flink’s scalable stream processing capabilities ensure it can process and analyze large data streams efficiently. Cloud-native implementations or cloud services take over the operations and elastic scale.

Fault Tolerance: Kafka provides built-in fault tolerance by replicating data across multiple nodes, ensuring data durability and availability, even in the case of node failures. Flink offers state recovery and exactly-once processing semantics, ensuring reliable and consistent data processing.

4. Contextual Enrichment

Contextual Data Processing: Flink can enrich the raw data with additional context before the generative model uses it. For instance, Flink can join incoming data streams with historical data or external datasets to provide a richer context for retrieval operations.

Feature Extraction: Flink can extract features from the data streams that help improving the relevance of the retrieved documents or passages.

5. Integration and Flexibility

Seamless Integration: Kafka and Flink integrate well with model servers (e.g., for model embeddings) and storage systems (e.g., vector data bases for sematic search). This makes it easy to incorporate the right information and context into the RAG architecture.

Modular Design: The use of Kafka and Flink allows for a modular design where different components (data ingestion, processing, retrieval, generation) can be developed, scaled, and maintained independently.

Lightboard Video: RAG with Data Streaming

The following ten-minute lightboard video is an excellent interactive explanation for building a RAG architecture with embedding model, vector database, Kafka and Flink to ensure up-to-date and context-specific prompts into the LLM:

Expedia: Generative AI in the Travel Industry

Expedia is an online travel agency that provides booking services for flights, hotels, car rentals, vacation packages, and other travel-related services. The IT architecture is built around data streaming for many years already, including the integration of transactional and analytical systems.

When Covid hit, Expedia had to innovate fast to handle all the support traffic spikes regarding flight rebookings, cancellations, and refunds. The project team trained a domain-specific conversational chatbot (long before ChatGPT and the term GenAI existed) and integrated it into the business process.

Source: Confluent

Here are some of the impressive business outcomes:

Quick time to market with innovative new technology to solve business problems
60%+ of travelers are self-servicing in chat after the rollout
40%+ saved in variable agent costs by enabling self-service

Kafka and Flink Provide Real-Time Context for RAG in a GenAI Architecture

By leveraging Apache Kafka and Apache Flink, the RAG architecture can handle real-time data ingestion, processing, and retrieval efficiently. This ensures that the generative model has access to the most current and contextually rich information, resulting in more accurate and relevant responses. The scalability, fault tolerance, and flexibility offered by Kafka and Flink make them ideal components for enhancing the capabilities of RAG systems.

If you want to learn more about data streaming with GenAI, read these articles:

How do you build a RAG architecture? Do you already leveraging Kafka and Flink for it? Or what technologies and architectures do you use? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Real-Time GenAI with RAG using Apache Kafka and Flink to Prevent Hallucinations appeared first on Kai Waehner.

The Past, Present and Future of Stream Processing

Kai Waehner — Wed, 20 Mar 2024 06:47:53 +0000

Stream processing has existed for decades. However, it really kicks off in the 2020s thanks to the adoption of open source frameworks like Apache Kafka and Flink. Fully managed cloud services make it easy to configure and deploy stream processing in a cloud-native way; even without the need to write any code. This blog post explores the past, present and future of stream processing. The discussion includes various technologies and cloud services, low code/ no code trade-offs, outlooks into the support of machine learning and GenAI, streaming databases, and the integration between data streaming and data lakes with Apache Iceberg.

In December 2023, the research company proved that data streaming is a new software category and not just yet another integration or data platform. Forrester published “The Forrester Wave: Streaming Data Platforms, Q4 2023“. Get free access to the report here. The leaders are Microsoft, Google and Confluent, followed by Oracle, Amazon, Cloudera, and a few others. A great time to review the past, present and future of stream processing as a key component in a data streaming architecture.

The Past of Stream Processing: The Move from Batch to Real-Time

The evolution of stream processing began as industries sought more timely insights from their data. Initially, batch processing was the norm. Data was collected over a period, stored, and processed at intervals. This method, while effective for historical analysis, proved inefficient for real-time decision-making.

In parallel to batch processing, message queues were created to provide real-time communication for transactional data. Message Brokers like IBM MQ or TIBCO EMS were a common way to decouple applications. Applications send data and receive data in an event-driven architecture without worrying about if the recipient was ready, how to handle backpressure, etc. The stream processing journey began.

Stream Processing is a Journey Over Decades…

… and we are still in a very early stage at most enterprises. Here is an excellent timeline of TimePlus about the journey of stream processing open source frameworks, proprietary platforms and SaaS cloud services:

Source: TimePlus

The stream processing journey started decades ago with research and first purpose-built proprietary products for specific use cases like stock trading.

Open source stream processing frameworks emerged during the big data and Hadoop era to make at least the ingestion layer a bit more real-time. Today, most enterprises at least get started understanding the value of stream processing for analytical and transactional use cases across industries. The cloud is a fundamental change as you can start streaming and processing data with a button click leveraging fully managed SaaS and simple UIs (if you don’t want to operate infrastructure or write low-level source code).

TIBCO StreamBase, Software AG Apama, IBM Streams

The advent of message queue technologies like IBM MQ and TIBCO EMS moved many critical applications to real-time message brokers. Real-time messaging enabled the consumption of data in real-time to store it in a database, mainframe, or application for further processing.

However, only true stream processing capabilities included in tools like TIBCO StreamBase, Software AG Apama or IBM (InfoSphere) Streams marked a significant shift towards real-time data processing. These products enabling businesses to react to information as it arrived by processing and correlating the data in motion.

Visual coding in tools like StreamBase or Apama represents an innovative approach to developing stream processing solutions. These tools provide a graphical interface that allows developers and analysts to design, build, and test applications by connecting various components and logic blocks visually, rather than writing code manually. Under the hood, the code generation worked with a Streaming SQL language.

Here is a screenshot of the TIBCO StreamBase IDE for visual drag & drop of streaming pipelines:

TIBCO StreamBase IDE

Some drawbacks of these early stream processing solutions include high cost, vendor lock-in, no flexibility regarding tools or APIs, and missing communities. These platforms are monolithic and were built far before cloud-native elasticity and scalability became a requirement for most RFIs and RFPs when evaluating vendors.

Open Source Event Streaming with Apache Kafka

The actual significant change for stream processing came with introducing Apache Kafka, a distributed streaming platform that allowed for high-throughput, fault-tolerant handling of real-time data feeds. Kafka, alongside other technologies like Apache Flink, revolutionized the landscape by providing the tools necessary to move from batch to real-time stream processing seamlessly.

The adoption of open source technologies changed all industries. Openness, flexibility, and community-driven development enabled easier influence on the features and faster innovation.

Over 100.000 organizations use Apache Kafka. The massive adoption came from a unique combination of capabilities: Messaging, storage, data integration, stream processing, all in one scalable and distributed infrastructure.

Various open source stream processing engines emerged. Kafka Streams was added to the Apache Kafka project. Other examples include Apache Storm, Spark Streaming, and Apache Flink.

The Present of Stream Processing: Architectural Evolution and Mass Adoption

The fundamental change to processing data in motion has enabled the development of data products and data mesh. Decentralizing data ownership and management with domain-driven design and technology-independent microservices promotes a more collaborative and flexible approach to data architecture. Each business unit can choose its own technology, API, cloud service, and communication paradigm like real-time, batch, or request-response.

From Lambda Architecture to Kappa Architecture

Today, stream processing is at the heart of modern data architecture, thanks in part to the emergence of the Kappa architecture. This model simplifies the traditional Lambda Architecture by using a single stream processing system to handle both real-time and historical data analysis, reducing complexity and increasing efficiency.

Lambda architecture with separate real-time and batch layers:

Kappa architecture with a single pipeline for real-time and batch processing:

For more details about the pros and cons of Kappa vs. Lambda, check out my “Kappa Architecture is Mainstream Replacing Lambda“. It explores case studies from Uber, Twitter, Disney and Shopify.

Kafka Streams and Apache Flink Become Mainstream

Apache Kafka has become synonymous with building scalable and fault-tolerant streaming data pipelines. Kafka facilitating true decoupling of domains and applications makes it integral to microservices and data mesh architectures.

Plenty of stream processing frameworks, products, and cloud services emerged in the past years. This includes open source frameworks like Kafka Streams, Apache Storm, Samza, Flume, Apex, Flink, Spark Streaming, and cloud services like Amazon Kinesis, Google Cloud Dataflow, Azure Stream Analytics. The “Data Streaming Landscape 2024” gives an overview of relevant technologies and vendors.

Apache Flink seems to become the de facto standard for many enterprises (and vendors). The adoption is like Kafka four years ago:

Source: Confluent

This does not mean other frameworks and solutions are bad. For instance, Kafka Streams is complementary to Apache Flink, as it suites different use cases.

No matter what technology enterprises choose, the mass adoption of stream processing is in progress right now. This includes modernizing existing batch processes AND building innovative new business models that only work in real time. As a concrete example, think about ride-hailing apps like Uber, Lyft, FREENOW, Grab. They are only possible because events are processed and correlated in real-time. Otherwise, everyone would still prefer a traditional taxi.

Stateless and Stateful Stream Processing

In data streaming, stateless and stateful stream processing are two approaches that define how data is handled and processed over time:

The choice between stateless and stateful processing depends on the specific requirements of the application, including the nature of the data, the complexity of the processing needed, and the performance and scalability requirements.

Stateless Stream Processing

Stateless Stream Processing refers to the handling of each data point or event independently from others. In this model, the processing of an event does not depend on the outcomes of previous events or require keeping track of the state between events. Each event is processed based on the information it contains, without the need for historical context or future data points. This approach is simpler and can be highly efficient for tasks that don’t require knowledge beyond the current event being processed.

The implementation could be a stream processor (like Kafka Streams or Flink), functionality in a connector (like Kafka Connect Single Message Transforms), or a Web Assembly (WASM) embedded into a streaming platform.

Stateful Stream Processing

Stateful Stream Processing involves keeping track of information (state) across multiple events to perform computations that depend on data beyond the current event. This model allows for more complex operations like windowing (aggregating events over a specific time frame), joining streams of data based on keys, and tracking sequences of events or patterns over time. Stateful processing is essential for scenarios where the outcome depends on accumulated knowledge or trends derived from a series of data points, not just on a single input.

The implementation is much more complex and challenging than stateless stream processing. A dedicated stream processing implementation is required. Dedicated distributed engines (like Apache Flink) handle stateful computionations, memory usage and scalability better than Kafka-native tools like Kafka Streams or KSQL (because the latter are bound to Kafka Topics).

Low Code, No Code, AND A Lot of Code!

No-code and low-code tools are software platforms that enable users to develop applications quickly and with minimal coding knowledge. These tools provide graphical user interfaces with drag-and-drop capabilities, allowing users to assemble and configure applications visually rather than writing extensive lines of code.

Common features and benefits of visual coding:

Rapid Development: Both types of platforms significantly reduce development time, enabling faster delivery of applications.
User-Friendly Interface: The graphical interface and drag-and-drop functionality make it easy for users to design, build, and iterate on applications.
Cost Reduction: By enabling quicker development with fewer resources, these platforms can lower the cost of software creation and maintenance.
Accessibility: They make application development accessible to a broader range of people, reducing the dependency on skilled developers for every project.

So far, the theory.

Disadvantages of Visual Coding Tools

Actually, StreamBase, Apama, et al., had great visual coding offerings. However, no-code / low-code tools have many drawbacks and disadvantages, too:

Limited Customization and Flexibility: While these platforms can speed up development for standard applications, they may lack the flexibility needed for highly customized solutions. Developers might find it challenging to implement specific functionalities that aren’t supported out of the box.
Dependency on Vendors: Using no-code/low-code platforms often means relying on third-party vendors for the platform’s stability, updates, and security. This dependency can lead to potential issues if the vendor cannot maintain the platform or goes out of business. And often the platform team is the bottleneck for implementing new business or integration logic.
Performance Concerns: Applications built with no-code/low-code platforms may not be as optimized as those developed with traditional coding, potentially leading to lower performance or inefficiencies, especially for complex applications.
Scalability Issues: As businesses grow, applications might need to scale up to support increased loads. No-code/low-code platforms might not always support this level of scalability or might require significant workarounds, affecting performance and user experience.
Over-reliance on Non-Technical Users: While empowering citizen developers is a key advantage of these platforms, it can also lead to governance challenges. Without proper oversight, non-technical users might create inefficient workflows or data structures, leading to technical debt and maintenance issues.
Cost Over Time: Initially, no-code/low-code platforms can reduce development costs. However, as applications grow and evolve, the ongoing subscription costs or fees for additional features and scalability can become significant.

Flexibility is King: Stream Processing for Everyone!

Microservices, domain-driven design, data mesh… All these modern design approaches taught us to provide flexible enterprise architectures. Each business unit and persona should be able to choose its own technology, API, or SaaS. And no matter if you do real-time, near real-time, batch or request response communication.

Apache Kafka provides the true decoupling out-of-the-box. Therefore, low-code or now-code tools is an option. However, a data scientist, data engineer, software developer or citizen integrator can choose its own technology for stream processing.

The past, present and future of stream processing shows different frameworks, visual coding tools and even applied generative AI. One solution does NOT replace but complement the other alternatives:

The Future of Stream Processing: Serverless SaaS, GenAI and Streaming Databases

Stream processing is set to grow exponentially in the future, thanks to advancements in cloud computing, SaaS, and AI. Let’s explore the future of stream processing and look at the expected short, mid and long-term developments.

SHORT TERM: Fully Managed Serverless SaaS for Stream Processing

The cloud’s scalability and flexibility offer an ideal environment for stream processing applications, reducing the overhead and resources required for on-premise solutions. As SaaS models continue to evolve, stream processing capabilities will become more accessible to a broader range of businesses, democratizing real-time data analytics.

For instance, look at the serverless Flink Actions in Confluent Cloud. You can configure and deploy stream processing for use cases like deduplication or masking without any code:

Source: Confluent

MID TERM: Automated Tooling and the Help of GenAI

Integrating AI and machine learning with stream processing will enable more sophisticated predictive analytics. This opens new frontiers for automated decision-making and intelligent applications while continuously processing incoming event streams. The full potential of embedding AI into stream processing has to be learned and implemented in the upcoming years.

For instance, automated data profiling is one instance of stream processing that GenAI can support significantly. Software tools analyze and understand the quality, structure, and content of a dataset without manual intervention as the events flow through the data pipeline in real-time. This process typically involves examining the data to identify patterns, anomalies, missing values, and inconsistencies. A perfect fit for stream processing!

Automated data profiling in the stream processor can provide insights into data types, frequency distributions, relationships between columns, and other metadata information crucial for data quality assessment, governance, and preparation for further analysis or processing.

MID TERM: Streaming Storage and Analytics with Apache Iceberg

Apache Iceberg is an open-source table format for huge analytic datasets that provides powerful capabilities in managing large-scale data in data lakes. Its integration with streaming data sources like Apache Kafka and analytics platforms, such as Snowflake, Starburst, Dremio, AWS Athena or Databricks, can significantly enhance data management and analytics workflows.

Integration between Streaming Data from Kafka and Analytics on Databricks or Snowflake using Apache Iceberg

Supporting the Apache Iceberg table format might be a crucial strategic move by streaming and analytics frameworks, vendors and cloud services. Here are some key benefits from the enterprise architecture perspective:

Unified Batch and Stream Processing: Iceberg tables can serve as a bridge between streaming data ingestion from Kafka and doxwnstream analytic processing. By treating streaming data as an extension of a batch-based table, Iceberg enables a seamless transition from real time to batch analytics, allowing organizations to analyze data with minimal latency.
Schema Evolution: Iceberg supports schema evolution without breaking downstream systems. This is useful when dealing with streaming data from Kafka, where the schema might evolve. Consumers can continue reading data using the schema they understand, ensuring compatibility and reducing the need for data pipeline modifications.
Time Travel and Snapshot Isolation: Iceberg’s time travel feature allows analytics on data as it looked at any point in time, providing snapshot isolation for consistent reads. This is crucial for reproducible reporting and debugging, especially when dealing with continuously updating streaming data from Kafka.
Cross-Platform Compatibility: Iceberg provides a unified data layer accessible by different compute engines, including those used by Databricks and Snowflake. This enables organizations to maintain a single copy of their data that is queryable across different platforms, facilitating a multi-tool analytics ecosystem without data silos.

LONG TERM: Transactional + Analytics = Streaming Database?

Streaming databases, like RisingWave or Materialize, are designed to handle real-time data processing and analytics. This offers a way to manage and query data that is continuously generated from sources like IoT devices, online transactions, and application logs. Traditional databases that are optimized for static data stored on disk. Instead, streaming databases are built to process and analyze data in motion. They provide insights almost instantaneously as the data flows through the system.

Streaming databases offer the ability to perform complex queries and analytics on streaming data, further empowering organizations to harness real-time insights.

The ongoing innovation in streaming databases will probably lead to more advanced, efficient, and user-friendly solutions, facilitating broader adoption and more creative applications of stream processing technologies.

Having said this, we are still in the very early stage. It is not clear yet when you really need a streaming database instead of a mature and scalable stream processor like Apache Flink. The future will show us and competition is great for innovation.

The Future of Stream Processing is Open Source and Cloud

The journey from batch to real-time processing has transformed how businesses interact with their data. The continued evolution couples technologies like Apache Kafka, Kafka Streams, and Apache Flink with the growth of cloud computing and SaaS. Stream processing will redefine the future of data analytics and decision-making.

As we look ahead, the future possibilities for stream processing are boundless, promising more agile, intelligent, and real-time insights into the ever-increasing streams of data.

If you want to learn more, listen to the following on-demand webinar about the past, present and future of stream processing with me joined by the two streaming industry veterans Richard Tibbets (founder of StreamBase) and Michael Benjamin (TimePlus). I had the please work with them for a few years at TIBCO where we deployed StreamBase at many Financial Services companies for stock trading and similar use cases:

How does your stream processing journey look like? In which decade did you join? Or are you just learning with the latest open-source frameworks or cloud services? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post The Past, Present and Future of Stream Processing appeared first on Kai Waehner.

Apache Kafka as Mission Critical Data Fabric for GenAI

Kai Waehner — Sat, 22 Jul 2023 10:40:59 +0000

Apache Kafka serves thousands of enterprises as the mission-critical and scalable real-time data fabric for machine learning infrastructures. The evolution of Generative AI (GenAI) with large language models (LLM) like ChatGPT changed how people think about intelligent software and automation. This blog post explains the relation between data streaming and GenAI and shows the enormous opportunities and some early adopters of GenAI beyond the buzz.

Generative AI (GenAI) and Data Streaming

Let’s set the context first to have the same understanding of the buzzwords.

[Note: My ChatGPT iPhone app generated this section. I slightly shortened and adjusted the content afterward. GenAI is perfect for summarizing existing content so that authors can spend time on new content (that ChatGPT does not know yet).]

Natural Language Processing (NLP)

ChatGPT, what is NLP?

NLP stands for Natural Language Processing. It is a subfield of artificial intelligence (AI) that focuses on interactions between computers and human language. NLP enables computers to understand, interpret, and generate human language in a meaningful and valuable way.

Natural Language Processing involves a range of tasks, including:

Text Parsing and Tokenization: Breaking down text into individual words or tokens.
Named Entity Recognition (NER): Identifying and classifying named entities such as people, organizations, locations, and dates in the text.
Sentiment Analysis: Determining the sentiment or emotional tone expressed in text, whether positive, negative, or neutral.
Machine Translation: Translating text from one language to another.
Question Answering: Building systems that can understand and answer questions posed in natural language.
Text Generation: Creating human-like text or generating responses to prompts.

NLP is crucial in applications such as chatbots, virtual assistants, language translation, information retrieval, sentiment analysis, and more.

GenAI = Next Generation NLP (and more)

ChatGPT, what is Generative AI?

Generative AI is a branch of artificial intelligence focused on creating models and systems capable of generating new content, such as images, text, music, or even entire virtual worlds. These models are trained on large datasets and learn patterns and structures to generate new outputs similar to the training data. That’s why the widespread buzzword is Large Language Model (LLM).

Generative AI is used for next-generation NLP and uses techniques such as generative adversarial networks (GANs), variational autoencoders (VAEs), and recurrent neural networks (RNNs). Generative AI has applications in various fields and industries, including art, design, entertainment, and scientific research.

Apache Kafka for Data Streaming

ChatGPT, what is Apache Kafka?

Apache Kafka is an open-source distributed streaming platform and became the de facto standard for event streaming. It was developed by the Apache Software Foundation and is widely used for building real-time data streaming applications and event-driven architectures. Kafka provides a scalable and fault-tolerant system for handling high volumes of streaming data.

Kafka has a thriving ecosystem with various tools and frameworks that integrate with it, such as Apache Spark, Apache Flink, and others.

Apache Kafka is widely adopted in use cases that require real-time data streaming, such as data pipelines, event sourcing, log aggregation, messaging systems, and more.

Why Apache Kafka and GenAI?

Generative AI (GenAI) is the next-generation NLP engine that helps many projects in the real world for service desk automation, customer conversation with a chatbot, content moderation in social networks, and many other use cases.

Apache Kafka became the predominant orchestration layer in these machine learning platforms for integrating various data sources, processing at scale, and real-time model inference.

Data streaming with Kafka already powers many GenAI infrastructures and software products. Very different scenarios are possible:

Data streaming as data fabric for the entire machine learning infrastructure
Model scoring with stream processing for real-time productions
Generation of streaming data pipelines with input text or speech
Real-time online training of large language models

Let’s explore these opportunities for data streaming with Kafka and GenAI in more detail.

Real-time Kafka Data Hub for GenAI and other Microservices in the Enterprise Architecture

I already explored in 2017 (!) how “How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka“. At that time, real-world examples came from tech giants like Uber, Netflix, and Paypal.

Today, Apache Kafka is the de facto standard for building scalable and reliable machine learning infrastructures across any enterprise and industry, including:

Data integration from various sources (sensors, logs, databases, message brokers, APIs, etc.) using Kafka Connect connectors, fully-managed SaaS integrations, or any kind of HTTP REST API or programming language.
Data processing leveraging stream processing for cost-efficient streaming ETL such as filtering, aggregations, and more advanced calculations while the data is in motion (so that any downstream application gets accurate information)
Data ingestion for near real-time data sharing with various data warehouses and data lakes so that each analytics platform can use its product and tools.

Building scalable and reliable end-to-end pipelines is today’s sweet spot of data streaming with Apache Kafka in the AI and Machine Learning space.

Model Scoring with Stream Processing for Real-Time Predictions at any Scale

Deploying an analytic model in a Kafka application is the solution to provide real-time predictions at any scale with low latency. This is one of the biggest problems in the AI space, as data scientists primarily focus on historical data and batch model training in data lakes.

However, the model scoring for predictions needs to provide much better SLAs regarding scalability, reliability, and latency. Hence, more and more companies separate model training from model scoring and deploy the analytic model within a stream processor such as Kafka Streams, KSQL, or Apache Flink:

Check out my article “Machine Learning and Real-Time Analytics in Apache Kafka Applications” for more details.

Dedicated model servers usually only support batch and request-response (e.g., via HTTP or gRPC). Fortunately, many solutions now also provide native integration with the Kafka protocol.

I explored this innovation in my blog post “Streaming Machine Learning with Kafka-native Model Deployment“.

Development Tools for Generating Kafka-native Data Pipelines from Input Text or Speech

Almost every software vendor discusses GenAI to enhance its development environments and user interfaces.

For instance, GitHub is a platform and cloud-based service for software development and version control using Git. But their latest innovation is “the AI-Powered Developer Platform to Build, Scale, and Deliver Secure Software”: Github CoPilot X. Cloud providers like AWS provide similar tools.

Similarly, look at any data infrastructure vendor like Databricks or Snowflake. The latest conferences and announcements focus on embedded capabilities around large language models and GenAI in their solutions.

The same will be true for many data streaming platforms and cloud services. Low-code/no-code tools will add capabilities to generate data pipelines from input text. One of the most straightforward applications that I see coming is generating SQL code out of user text.

For instance, “Consume data from Oracle table customer, aggregate the payments by customer, and ingest it into Snowflake”. This could create SQL code for stream processing technologies like KSQL or FlinkSQL.

Developer experience, faster time-to-market, and support less technical personas are enormous advantages for embedding GenAI into Kafka development environments.

Real-time Training of Large Language Models (LLM)

AI and Machine Learning are still batch-based systems almost all of the time. Model training takes at least hours. This is not ideal, as many GenAI use cases require accurate and updated information. Imagine googling for information today, and you could not find data from the past week. Impossible to use such a service in many scenarios!

Similarly, if I ask ChatGPT today (July 2023): “What is GenAI?” – I get the following response:

As of my last update in September 2021, there is no specific information on an entity called “GenAi.” It’s possible that something new has emerged since then. Could you provide more context or clarify your question so I can better assist you?

The faster your machine learning infrastructure ingests data into model training, the better. My colleague Michael Drogalis wrote an excellent deep-technical blog post: “GPT-4 + Streaming Data = Real-Time Generative AI” to explore this topic more thoroughly.

This architecture is compelling because the chatbot will always have your latest information whenever you prompt it. For instance, if your flight gets delayed or your terminal changes, the chatbot will know about it during your chat session. This is entirely distinct from current approaches where the chat session must be reloaded or wait a few hours/days for new data to arrive.

LLM + Vector Database + Kafka = Real-Time GenAI

Real-time model training is still a novel approach. Many machine learning algorithms are not ready for continuous online model training today. But combining Kafka with a vector database enables using a batch-trained LLM together with real-time updates feeding up-to-date information into the LLM.

Nobody will accept an LLM like ChatGPT in a few years, giving you answers like “I don’t have this information; my model was trained a week ago”. It does not matter if you choose a brand new vector database like Pinecone or leverage new vector capabilities of your installed Oracle or MongoDB storage.

Feed data into the vector database in real-time with Kafka Connect and combine with with a mature LLM to enable real-time GenAI with context-specific recommendations.

Real-World Case Studies for Kafka and GenAI

This section explores how companies across different industries, such as the carmaker BMW, the online travel and booking Expedia, and the dating app Tinder leverage the combination of data streaming with GenAI for reliable real-time conversational AI, NLP and chatbots leveraging Kafka.

Two years ago, I wrote about this topic: “Apache Kafka for Conversational AI, NLP and Chatbot“. But technologies like ChatGPT make it much easier to adopt GenAI in real-world projects with much faster time-to-market and less cost and risk. Let’s explore a few of these success stories for embedding NLP and GenAI into data streaming enterprise architectures.

Disclaimer: As I want to show real-world case studies instead of visionary outlooks, I show several examples deployed in production in the last few years. Hence, the analytic models do not use GenAI, LLM, or ChatGPT as we know it from the press today. But the principles are precisely the same. The only difference is that you could use a cutting-edge model like ChatGPT with much improved and context-specific responses today.

Expedia – Conversations Platform for Better Customer Experience

Expedia is a leading online travel and booking. They have many use cases for machine learning. One of my favorite examples is their Conversations Platform built on Kafka and Confluent Cloud to provide an elastic cloud-native application.

The goal of Expedia’s Conversations Platform was simple: Enable millions of travelers to have natural language conversations with an automated agent via text, Facebook, or their channel of choice. Let them book trips, make changes or cancellations, and ask questions:

“How long is my layover?”
“Does my hotel have a pool?”
“How much will I get charged to bring my golf clubs?”

Then take all that is known about that customer across all of Expedia’s brands and apply machine learning models to immediately give customers what they are looking for in real-time and automatically, whether a straightforward answer or a complex new itinerary.

Real-time Orchestration realized in four Months

Such a platform is no place for batch jobs, back-end processing, or offline APIs. To quickly make decisions that incorporate contextual information, the platform needs data in near real-time, and it needs it from a wide range of services and systems. Meeting these needs meant architecting the Conversations Platform around a central nervous system based on Confluent Cloud and Apache Kafka. Kafka made it possible to orchestrate data from loosely coupled systems, enrich data as it flows between them so that by the time it reaches its destination, it is ready to be acted upon, and surface aggregated data for analytics and reporting.

Expedia built this platform from zero to production in four months. That’s the tremendous advantage of using a fully managed serverless event streaming platform as the foundation. The project team can focus on the business logic.

The Covid pandemic proved the idea of an elastic platform: Companies were hit with a tidal wave of customer questions, cancellations, and re-bookings. Throughout this once-in-a-lifetime event, the Conversations Platform proved up to the challenge, auto-scaling as necessary and taking off much of the load of live agents.

Expedia’s Migration from MQ to Kafka as Foundation for Real-time Machine Learning and Chatbots

As part of their conversations platform, Expedia needed to modernize their IT infrastructure, as Ravi Vankamamidi, Director of Technology at Expedia Group, explained in a Kafka Summit keynote.

Expedia’s old legacy chatbot service relied on a legacy messaging system. This service was a question-and-answer board with very limited scope for booking scenarios. This service could handle two-party conversations. It could not scale to bring all different systems into one architecture to build a powerful chatbot that is helpful for customer conversations.

I explored several times that event streaming is more than just a (scalable) message queue. Check out my old (but still accurate and relevant) Comparison between MQ and Kafka, or the newer comparison between cloud-native iPaaS and Kafka.

Expedia needed a service that was closer to travel assistance. It needed to handle context-specific, multi-party, multi-channel conversations. Hence, features such as natural language processing, translation, and real-time analytics are required. The full service needs to be scalable across multiple brands. Therefore, a fast and highly scalable platform with order guarantees, exactly-once-semantics (EOS), and real-time data processing were needed.

The Kafka-native event streaming platform powered by Confluent was the best choice and met all requirements. The new conversations platform doubled the Net Promoter Score (NPS) one year after the rollout. The new platform proved the business value of the new platform quickly.

BMW – GenAI for Contract Intelligence, Workplace Assistance and Machine Translation

The automotive company BMW presented innovative NLP services at Kafka Summit in 2019. It is no surprise that a carmaker has various NLP scenarios. These include digital contract intelligence, workplace assistance, machine translation, and customer conversations. The latter contains multiple use cases for conversational AI:

Service desk automation
Speech analysis of customer interaction center (CIC) calls to improve the quality
Self-service using smart knowledge bases
Agent support
Chatbots

The text and speech data is structured, enriched, contextualized, summarized, and translated to build real-time decision support applications. Kafka is a crucial component of BMW’s ML and NLP architecture. The real-time integration and data correlation enable interactive and interoperable data consumption and usage:

BMW explained the key advantages of leveraging Kafka and its streaming processing library Kafka Streams as the real-time integration and orchestration platform:

Flexible integration: Multiple supported interfaces for different deployment scenarios, including various machine learning technologies, programming languages, and cloud providers
Modular end-to-end pipelines: Services can be connected to provide full-fledged NLP applications.
Configurability: High agility for each deployment scenario

Tinder – Intelligent Content Moderation, Matching and Recommendations with Kafka and GenAI

The dating app Tinder is a great example where I can think of tens of use cases for NLP. Tinder talked at a past Kafka Summit about their Kafka-powered machine learning platform.

Tinder is a massive user of Kafka and its ecosystem for various use cases, including content moderation, matching, recommendations, reminders, and user reactivation. They used Kafka Streams as a Kafka-native stream processing engine for metadata processing and correlation in real-time at scale:

A critical use case in any dating or social platform is content moderation for detecting fakes, filtering sexual content, and other inappropriate things. Content moderation combines NLP and text processing (e.g., for chat messages) with image processing (e.g., selfie uploads) or processes the metadata with Kafka and stores the linked content in a data lake. Both leverage Deep Learning to process high volumes of text and images. Here is what content moderation looks like in Tinder’s Kafka architecture:

Plenty of ways exist to process text, images, and videos with the Kafka ecosystem. I wrote a detailed article about handling large messages and files with Apache Kafka to explore the options and trade-offs.

Chatbots could also play a key role “in the other way round”. More and more dating apps (and other social networks) fight against spam, fraud, and automated chatbots. Like building a chatbot, a chatbot detection system can analyze the data streams to block a dating app’s chatbot.

Kafka as Real-Time Data Fabric for Future GenAI Initiatives

Real-time data beats slow data. Generative AI only adds value if it provides accurate and up-to-date information. Data streaming technologies such as Apache Kafka and Apache Flink enable building a reliable, scalable real-time infrastructure for GenAI. Additionally, the event-based heart of the enterprise architecture guarantees data consistency between real-time and non-real-time systems (near real-time, batch, request-response).

The early adopters like BWM, Expedia, and Tinder proved that Generative AI integrated into a Kafka architecture adds enormous business value. The evolution of AI models with ChatGPT et al. makes the use case even more compelling across every industry.

How do you build conversational AI, chatbots, and other GenAI applications leveraging Apache Kafka? What technologies and architectures do you use? Are data streaming and Kafka part of the architecture? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka as Mission Critical Data Fabric for GenAI appeared first on Kai Waehner.

Fraud Detection with Apache Kafka, KSQL and Apache Flink

Kai Waehner — Tue, 25 Oct 2022 11:38:46 +0000

Fraud detection becomes increasingly challenging in a digital world across all industries. Real-time data processing with Apache Kafka became the de facto standard to correlate and prevent fraud continuously before it happens. This blog post explores case studies for fraud prevention from companies such as Paypal, Capital One, ING Bank, Grab, and Kakao Games that leverage stream processing technologies like Kafka Streams, KSQL, and Apache Flink.

Fraud detection and the need for real-time data

Fraud detection and prevention is the adequate response to fraudulent activities in companies (like fraud, embezzlement, and loss of assets because of employee actions).

An anti-fraud management system (AFMS) comprises fraud auditing, prevention, and detection tasks. Larger companies use it as a company-wide system to prevent, detect, and adequately respond to fraudulent activities. These distinct elements are interconnected or exist independently. An integrated solution is usually more effective if the architecture considers the interdependencies during planning.

Real-time data beats slow data across business domains and industries in almost all use cases. But there are few better examples than fraud prevention and fraud detection. It is not helpful to detect fraud in your data warehouse or data lake after hours or even minutes, as the money is already lost. This “too late architecture” increases risk, revenue loss, and lousy customer experience.

It is no surprise that most modern payment platforms and anti-fraud management systems implement real-time capabilities with streaming analytics technologies for these transactional and analytical workloads. The Kappa architecture powered by Apache Kafka became the de facto standard replacing the Lambda architecture.

A stream processing example in payments

Stream processing is the foundation for implementing fraud detection and prevention while the data is in motion (and relevant) instead of just storing data at rest for analytics (too late).

No matter what modern stream processing technology you choose (e.g., Kafka Streams, KSQL, Apache Flink), it enables continuous real-time processing and correlation of different data sets. Often, the combination of real-time and historical data helps find the right insights and correlations to detect fraud with a high probability.

Let’s look at a few examples of stateless and stateful stream processing for real-time data correlation with the Kafka-native tools Kafka Streams and ksqlDB. Similarly, Apache Flink or other stream processing engines can be combined with the Kafka data stream. It always has pros and cons. While Flink might be the better fit for some projects, it is another engine and infrastructure you need to combine with Kafka.

Ensure you understand your end-to-end SLAs and requirements regarding latency, exactly-once semantics, potential data loss, etc. Then use the right combination of tools for the job.

Stateless transaction monitoring with Kafka Streams

A Kafka Streams application, written in Java, processes each payment event in a stateless fashion one by one:

Stateful anomaly detection with Kafka and KSQL

A ksqlDB application, written with SQL code, continuously analyses the transactions of the last hour per customer ID to identify malicious behavior:

Kafka and Machine Learning with TensorFlow for real-time scoring for fraud detection

A KSQL UDF (user-defined function) embeds an analytic model trained with TensorFlow for real-time fraud prevention:

Case studies across industries

Several case studies exist for fraud detection with Kafka. It is usually combined with stream processing technologies, such as Kafka Streams, KSQL, and Apache Flink. Here are a few real-world deployments across industries, including financial services, gaming, and mobility services:

– Paypal processes billions of messages with Kafka for fraud detection.

– Capital One looks at events as running its entire business (powered by Confluent), where stream processing prevents $150 of fraud per customer on average per year by preventing personally identifiable information (PII) violations of in-flight transactions.

– ING Bank started many years ago by implementing real-time fraud detection with Kafka, Flink, and embedded analytic models

– Grab is a mobility service in Asia that leverages fully managed Confluent Cloud, Kafka Streams, and ML for stateful stream processing in its internal GrabDefence SaaS service.

– Kakao Games, a South-Korean gaming company uses data streaming to detect and operate anomalies with 300+ patterns through KSQL

Let’s explore the latter case study in more detail.

Deep dive into fraud prevention with Kafka and KSQL in mobile gaming

Kakao Games is a South Korea-based global video game publisher specializing in games across various genres for PC, mobile, and VR platforms. The company presented at Current 2022 – The Next Generation of Kafka Summit in Austin, Texas.

Here is a detailed summary of their compelling use case and architecture for fraud detection with Kafka and KSQL.

Use case: Detect malicious behavior by gamers in real-time

The challenge is evident when you understand the company’s history: Kakao Games has many outsourced games purchased via third-party game studios. Each game has its unique log with its standard structure and message format. Reliable real-time data integration at scale is required as a foundation for analytical business processes like fraud detection.

The goal is to analyze game logs and telemetry data in real-time. This capability is critical for preventing and remediating threats or suspicious actions from users.

Architecture: Change data capture and streaming analytics for fraud prevention

The Confluent-powered event streaming platform supports game log standardization. ksqlDB analyzes the incoming telemetry data for in-game abuse and anomaly detection.

Source: Kakao Games (Current 2022 in Austin, Texas)

Implementation: SQL recipes for data streaming with KSQL

Kakao Games detects anomalies and prevents fraud with 300+ patterns through KSQL. Use cases include bonus abuse, multiple account usage, account takeover, chargeback fraud, and affiliate fraud.

Here are a few code examples written with SQL code using KSQL:

Source: Kakao Games (Current 2022 in Austin, Texas)

Results: Reduced risk and improved customer experience

Kakao Games can do real-time data tracking and analysis at scale. Business benefits are faster time to market, increased active users, and more revenue thanks to a better gaming experience.

Fraud detection only works in real-time

Ingesting data with Kafka into a data warehouse or a data lake is only part of a good enterprise architecture. Tools like Apache Spark, Databricks, Snowflake, or Google BigQuery enable finding insights within historical data. But real-time fraud prevention is only possible if you act while the data is in motion. Otherwise, the fraud already happened when you detect it.

Stream processing provides a scalable and reliable infrastructure for real-time fraud prevention. The choice of the right technology is essential. However, all major frameworks, like Kafka Streams, KSQL, or Apache Flink, are very good. Hence, the case studies of Paypal, Capital One, ING Bank, Grab, and Kakao Games look different. Still, they have the same foundation with data streaming powered by the de facto standard Apache Kafka to reduce risk, increase revenue, and improve customer experience.

If you want to learn more about streaming analytics with the Kafka ecosystem, check out how Apache Kafka helps in cybersecurity to create situational awareness and threat intelligence and how to learn from a concrete fraud detection example with Apache Kafka in the crypto and NFT space.

How do you leverage data streaming for fraud prevention and detection? What does your architecture look like? What technologies do you combine? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Fraud Detection with Apache Kafka, KSQL and Apache Flink appeared first on Kai Waehner.

machine learning Archives - Kai Waehner

Fraud Detection in Mobility Services (Ride-Hailing, Food Delivery) with Data Streaming using Apache Kafka and Flink

The Business of Mobility Services (Ride-Hailing, Food Delivery, Taxi Aggregators, etc.)

Why Fraud is a Major Challenge in Mobility Services

The Unseen Enemy: Core Challenges in Mobility Fraud Detection

How Data Streaming with Apache Kafka and Flink Enables Real-Time Fraud Detection

Apache Kafka: The Backbone of Event-Driven Fraud Detection

Apache Flink: Continuous Stream Processing for Fraud Detection in Real-Time

Real-World Fraud Prevention Stories from Mobility Leaders

FREE NOW (Lyft): Detecting Fraudulent Trips in Real Time by Analyzing GPS Data of Cars

Grab: AI-Powered Fraud Detection for Ride-Hailing and Delivery with Data Streaming and AI/ML

Uber: Project RADAR – AI-Powered Fraud Detection with Human Oversight

Data Streaming with Kafka and Flink Provides Real-Time Fraud Detection in Mobility Services

CIO Summit: The State of AI and Why Data Streaming is Key for Success

Key Learnings on the State of AI

AI is Still in Its Early Stages – Beware of the Buzz vs. Value

Generative AI vs. Predictive AI – Understanding the Differences

Good Data Quality is Non-Negotiable

Context Matters – AI Needs Real-Time Decision-Making

Automate First, Then Apply AI

ROI Matters – AI Must Drive Business Value

Data Streaming and AI – The Perfect Match

Predictive AI + Data Streaming

Generative AI + Data Streaming

The Future of AI: Agentic AI and the Role of Data Streaming

The Path to AI Success

The Role of Data Streaming in McAfee’s Cybersecurity Evolution

Antivirus is NOT Enough: Supply Chain Attack

The Role of Data Streaming in Cybersecurity

McAfee: A Real-World Data Streaming Success Story

Data Streaming at McAfee with Apache Kafka Leveraging Confluent Cloud

Use Cases for Data Streaming

Transition to an Event-Driven Architecture and Microservices

Business Value of Data Streaming

Data Streaming as the Backbone of an Event-Driven Cybersecurity Evolution in the Cloud

A New Era in Dynamic Pricing: Real-Time Data Streaming with Apache Kafka and Flink

What is Dynamic Pricing?

Relevant Industries and Examples

Relation of Dynamic Pricing to Data Streaming with Apache Kafka and Flink

Clickstream Analytics in Real-Time with Data Streaming Replacing Batch with Hadoop and Spark

Success Stories for Dynamic Pricing with Data Streaming using Kafka and Flink Across Industries

AO: Hyper-Personalized Promotions and Discounts (Retail and eCommerce)

Quarterhill: Intelligent Tolling (Logistics and Transportation)

Uber, Grab, and FreeNow: Surge Pricing (Technology)

2K Games / Take-Two Interactive: Context-Specific In-Game Purchases (Gaming Industry)

50hertz: Manage Supply-Demand and Stabilize Grid Loads (Energy Markets)

Ticketmaster: Optimize Ticketing Sales (Sports and Entertainment)

Conclusion: Business Value of Dynamic Pricing Built with Data Streaming

The Digitalization of Airport and Airlines with IoT and Data Streaming using Kafka and Flink

The Digitalization of Airports and the Aviation Industry

Data Streaming with Kafka and Flink for Airport Operations and Passenger Experience

Continuous Stream Processing in an Event-driven Architecture with Kafka Streams, KSQL, or Flink

Airport Digitalization with Data Streaming using Apache Kafka and Flink

1. Passenger Experience

2. Operational Efficiency

3. Security, Safety and Health Enhancements

4. Sustainability and Energy Management

5. Customer Service and Communication

6. Revenue Management

7. Emergency Response and Safety

Data Sharing with Kafka between Airport, Airlines and other B2B Partners like Retail Stores

Real-World Success Stories for Data Streaming in the Aviation Industry

Schiphol Group (Amsterdam Airport)

Schiphol Group: Data Platform with Apache Kafka

Schiphol Group: IoT with Apache Kafka

Schiphol Group: AI and Machine Learning with Apache Kafka

Lufthansa, Southwest, Cathay Pacific, and many other Airlines…

Apache Kafka and Flink as Data Fabric for Operational and Analytical Airport Use Cases

Hello, K.AI – How I Trained a Chatbot of Myself Without Coding Evaluating OpenAI Custom GPT, Chatbase, Botsonic, LiveChatAI

Hi, K.AI – let’s chat…

Project goals of training the chatbot K.AI

My requirements for a conversational chatbot tool

OpenAI: ChatGPT + Custom GPT for a custom chatbot? Not for K.AI…

Custom GPT in action…

Comparison and evaluation of chatbot SaaS GenAI solutions

Question catalog for testing my K.AI chatbots

Chatbase – Custom ChatGPT for your website

K.AI works well with Chatbase after the initial training…

But: Chatbase’s character limitation stops further training

Botsonic – Advanced AI chatbot builder using your company’s knowledge

The Unseen Enemy: Core Challenges in Mobility Fraud
Detection