Data Fabric Archives - Kai Waehner

Apache Kafka as Mission Critical Data Fabric for GenAI

Kai Waehner — Sat, 22 Jul 2023 10:40:59 +0000

Apache Kafka serves thousands of enterprises as the mission-critical and scalable real-time data fabric for machine learning infrastructures. The evolution of Generative AI (GenAI) with large language models (LLM) like ChatGPT changed how people think about intelligent software and automation. This blog post explains the relation between data streaming and GenAI and shows the enormous opportunities and some early adopters of GenAI beyond the buzz.

Generative AI (GenAI) and Data Streaming

Let’s set the context first to have the same understanding of the buzzwords.

[Note: My ChatGPT iPhone app generated this section. I slightly shortened and adjusted the content afterward. GenAI is perfect for summarizing existing content so that authors can spend time on new content (that ChatGPT does not know yet).]

Natural Language Processing (NLP)

ChatGPT, what is NLP?

NLP stands for Natural Language Processing. It is a subfield of artificial intelligence (AI) that focuses on interactions between computers and human language. NLP enables computers to understand, interpret, and generate human language in a meaningful and valuable way.

Natural Language Processing involves a range of tasks, including:

Text Parsing and Tokenization: Breaking down text into individual words or tokens.
Named Entity Recognition (NER): Identifying and classifying named entities such as people, organizations, locations, and dates in the text.
Sentiment Analysis: Determining the sentiment or emotional tone expressed in text, whether positive, negative, or neutral.
Machine Translation: Translating text from one language to another.
Question Answering: Building systems that can understand and answer questions posed in natural language.
Text Generation: Creating human-like text or generating responses to prompts.

NLP is crucial in applications such as chatbots, virtual assistants, language translation, information retrieval, sentiment analysis, and more.

GenAI = Next Generation NLP (and more)

ChatGPT, what is Generative AI?

Generative AI is a branch of artificial intelligence focused on creating models and systems capable of generating new content, such as images, text, music, or even entire virtual worlds. These models are trained on large datasets and learn patterns and structures to generate new outputs similar to the training data. That’s why the widespread buzzword is Large Language Model (LLM).

Generative AI is used for next-generation NLP and uses techniques such as generative adversarial networks (GANs), variational autoencoders (VAEs), and recurrent neural networks (RNNs). Generative AI has applications in various fields and industries, including art, design, entertainment, and scientific research.

Apache Kafka for Data Streaming

ChatGPT, what is Apache Kafka?

Apache Kafka is an open-source distributed streaming platform and became the de facto standard for event streaming. It was developed by the Apache Software Foundation and is widely used for building real-time data streaming applications and event-driven architectures. Kafka provides a scalable and fault-tolerant system for handling high volumes of streaming data.

Kafka has a thriving ecosystem with various tools and frameworks that integrate with it, such as Apache Spark, Apache Flink, and others.

Apache Kafka is widely adopted in use cases that require real-time data streaming, such as data pipelines, event sourcing, log aggregation, messaging systems, and more.

Why Apache Kafka and GenAI?

Generative AI (GenAI) is the next-generation NLP engine that helps many projects in the real world for service desk automation, customer conversation with a chatbot, content moderation in social networks, and many other use cases.

Apache Kafka became the predominant orchestration layer in these machine learning platforms for integrating various data sources, processing at scale, and real-time model inference.

Data streaming with Kafka already powers many GenAI infrastructures and software products. Very different scenarios are possible:

Data streaming as data fabric for the entire machine learning infrastructure
Model scoring with stream processing for real-time productions
Generation of streaming data pipelines with input text or speech
Real-time online training of large language models

Let’s explore these opportunities for data streaming with Kafka and GenAI in more detail.

Real-time Kafka Data Hub for GenAI and other Microservices in the Enterprise Architecture

I already explored in 2017 (!) how “How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka“. At that time, real-world examples came from tech giants like Uber, Netflix, and Paypal.

Today, Apache Kafka is the de facto standard for building scalable and reliable machine learning infrastructures across any enterprise and industry, including:

Data integration from various sources (sensors, logs, databases, message brokers, APIs, etc.) using Kafka Connect connectors, fully-managed SaaS integrations, or any kind of HTTP REST API or programming language.
Data processing leveraging stream processing for cost-efficient streaming ETL such as filtering, aggregations, and more advanced calculations while the data is in motion (so that any downstream application gets accurate information)
Data ingestion for near real-time data sharing with various data warehouses and data lakes so that each analytics platform can use its product and tools.

Building scalable and reliable end-to-end pipelines is today’s sweet spot of data streaming with Apache Kafka in the AI and Machine Learning space.

Model Scoring with Stream Processing for Real-Time Predictions at any Scale

Deploying an analytic model in a Kafka application is the solution to provide real-time predictions at any scale with low latency. This is one of the biggest problems in the AI space, as data scientists primarily focus on historical data and batch model training in data lakes.

However, the model scoring for predictions needs to provide much better SLAs regarding scalability, reliability, and latency. Hence, more and more companies separate model training from model scoring and deploy the analytic model within a stream processor such as Kafka Streams, KSQL, or Apache Flink:

Check out my article “Machine Learning and Real-Time Analytics in Apache Kafka Applications” for more details.

Dedicated model servers usually only support batch and request-response (e.g., via HTTP or gRPC). Fortunately, many solutions now also provide native integration with the Kafka protocol.

I explored this innovation in my blog post “Streaming Machine Learning with Kafka-native Model Deployment“.

Development Tools for Generating Kafka-native Data Pipelines from Input Text or Speech

Almost every software vendor discusses GenAI to enhance its development environments and user interfaces.

For instance, GitHub is a platform and cloud-based service for software development and version control using Git. But their latest innovation is “the AI-Powered Developer Platform to Build, Scale, and Deliver Secure Software”: Github CoPilot X. Cloud providers like AWS provide similar tools.

Similarly, look at any data infrastructure vendor like Databricks or Snowflake. The latest conferences and announcements focus on embedded capabilities around large language models and GenAI in their solutions.

The same will be true for many data streaming platforms and cloud services. Low-code/no-code tools will add capabilities to generate data pipelines from input text. One of the most straightforward applications that I see coming is generating SQL code out of user text.

For instance, “Consume data from Oracle table customer, aggregate the payments by customer, and ingest it into Snowflake”. This could create SQL code for stream processing technologies like KSQL or FlinkSQL.

Developer experience, faster time-to-market, and support less technical personas are enormous advantages for embedding GenAI into Kafka development environments.

Real-time Training of Large Language Models (LLM)

AI and Machine Learning are still batch-based systems almost all of the time. Model training takes at least hours. This is not ideal, as many GenAI use cases require accurate and updated information. Imagine googling for information today, and you could not find data from the past week. Impossible to use such a service in many scenarios!

Similarly, if I ask ChatGPT today (July 2023): “What is GenAI?” – I get the following response:

As of my last update in September 2021, there is no specific information on an entity called “GenAi.” It’s possible that something new has emerged since then. Could you provide more context or clarify your question so I can better assist you?

The faster your machine learning infrastructure ingests data into model training, the better. My colleague Michael Drogalis wrote an excellent deep-technical blog post: “GPT-4 + Streaming Data = Real-Time Generative AI” to explore this topic more thoroughly.

This architecture is compelling because the chatbot will always have your latest information whenever you prompt it. For instance, if your flight gets delayed or your terminal changes, the chatbot will know about it during your chat session. This is entirely distinct from current approaches where the chat session must be reloaded or wait a few hours/days for new data to arrive.

LLM + Vector Database + Kafka = Real-Time GenAI

Real-time model training is still a novel approach. Many machine learning algorithms are not ready for continuous online model training today. But combining Kafka with a vector database enables using a batch-trained LLM together with real-time updates feeding up-to-date information into the LLM.

Nobody will accept an LLM like ChatGPT in a few years, giving you answers like “I don’t have this information; my model was trained a week ago”. It does not matter if you choose a brand new vector database like Pinecone or leverage new vector capabilities of your installed Oracle or MongoDB storage.

Feed data into the vector database in real-time with Kafka Connect and combine with with a mature LLM to enable real-time GenAI with context-specific recommendations.

Real-World Case Studies for Kafka and GenAI

This section explores how companies across different industries, such as the carmaker BMW, the online travel and booking Expedia, and the dating app Tinder leverage the combination of data streaming with GenAI for reliable real-time conversational AI, NLP and chatbots leveraging Kafka.

Two years ago, I wrote about this topic: “Apache Kafka for Conversational AI, NLP and Chatbot“. But technologies like ChatGPT make it much easier to adopt GenAI in real-world projects with much faster time-to-market and less cost and risk. Let’s explore a few of these success stories for embedding NLP and GenAI into data streaming enterprise architectures.

Disclaimer: As I want to show real-world case studies instead of visionary outlooks, I show several examples deployed in production in the last few years. Hence, the analytic models do not use GenAI, LLM, or ChatGPT as we know it from the press today. But the principles are precisely the same. The only difference is that you could use a cutting-edge model like ChatGPT with much improved and context-specific responses today.

Expedia – Conversations Platform for Better Customer Experience

Expedia is a leading online travel and booking. They have many use cases for machine learning. One of my favorite examples is their Conversations Platform built on Kafka and Confluent Cloud to provide an elastic cloud-native application.

The goal of Expedia’s Conversations Platform was simple: Enable millions of travelers to have natural language conversations with an automated agent via text, Facebook, or their channel of choice. Let them book trips, make changes or cancellations, and ask questions:

“How long is my layover?”
“Does my hotel have a pool?”
“How much will I get charged to bring my golf clubs?”

Then take all that is known about that customer across all of Expedia’s brands and apply machine learning models to immediately give customers what they are looking for in real-time and automatically, whether a straightforward answer or a complex new itinerary.

Real-time Orchestration realized in four Months

Such a platform is no place for batch jobs, back-end processing, or offline APIs. To quickly make decisions that incorporate contextual information, the platform needs data in near real-time, and it needs it from a wide range of services and systems. Meeting these needs meant architecting the Conversations Platform around a central nervous system based on Confluent Cloud and Apache Kafka. Kafka made it possible to orchestrate data from loosely coupled systems, enrich data as it flows between them so that by the time it reaches its destination, it is ready to be acted upon, and surface aggregated data for analytics and reporting.

Expedia built this platform from zero to production in four months. That’s the tremendous advantage of using a fully managed serverless event streaming platform as the foundation. The project team can focus on the business logic.

The Covid pandemic proved the idea of an elastic platform: Companies were hit with a tidal wave of customer questions, cancellations, and re-bookings. Throughout this once-in-a-lifetime event, the Conversations Platform proved up to the challenge, auto-scaling as necessary and taking off much of the load of live agents.

Expedia’s Migration from MQ to Kafka as Foundation for Real-time Machine Learning and Chatbots

As part of their conversations platform, Expedia needed to modernize their IT infrastructure, as Ravi Vankamamidi, Director of Technology at Expedia Group, explained in a Kafka Summit keynote.

Expedia’s old legacy chatbot service relied on a legacy messaging system. This service was a question-and-answer board with very limited scope for booking scenarios. This service could handle two-party conversations. It could not scale to bring all different systems into one architecture to build a powerful chatbot that is helpful for customer conversations.

I explored several times that event streaming is more than just a (scalable) message queue. Check out my old (but still accurate and relevant) Comparison between MQ and Kafka, or the newer comparison between cloud-native iPaaS and Kafka.

Expedia needed a service that was closer to travel assistance. It needed to handle context-specific, multi-party, multi-channel conversations. Hence, features such as natural language processing, translation, and real-time analytics are required. The full service needs to be scalable across multiple brands. Therefore, a fast and highly scalable platform with order guarantees, exactly-once-semantics (EOS), and real-time data processing were needed.

The Kafka-native event streaming platform powered by Confluent was the best choice and met all requirements. The new conversations platform doubled the Net Promoter Score (NPS) one year after the rollout. The new platform proved the business value of the new platform quickly.

BMW – GenAI for Contract Intelligence, Workplace Assistance and Machine Translation

The automotive company BMW presented innovative NLP services at Kafka Summit in 2019. It is no surprise that a carmaker has various NLP scenarios. These include digital contract intelligence, workplace assistance, machine translation, and customer conversations. The latter contains multiple use cases for conversational AI:

Service desk automation
Speech analysis of customer interaction center (CIC) calls to improve the quality
Self-service using smart knowledge bases
Agent support
Chatbots

The text and speech data is structured, enriched, contextualized, summarized, and translated to build real-time decision support applications. Kafka is a crucial component of BMW’s ML and NLP architecture. The real-time integration and data correlation enable interactive and interoperable data consumption and usage:

BMW explained the key advantages of leveraging Kafka and its streaming processing library Kafka Streams as the real-time integration and orchestration platform:

Flexible integration: Multiple supported interfaces for different deployment scenarios, including various machine learning technologies, programming languages, and cloud providers
Modular end-to-end pipelines: Services can be connected to provide full-fledged NLP applications.
Configurability: High agility for each deployment scenario

Tinder – Intelligent Content Moderation, Matching and Recommendations with Kafka and GenAI

The dating app Tinder is a great example where I can think of tens of use cases for NLP. Tinder talked at a past Kafka Summit about their Kafka-powered machine learning platform.

Tinder is a massive user of Kafka and its ecosystem for various use cases, including content moderation, matching, recommendations, reminders, and user reactivation. They used Kafka Streams as a Kafka-native stream processing engine for metadata processing and correlation in real-time at scale:

A critical use case in any dating or social platform is content moderation for detecting fakes, filtering sexual content, and other inappropriate things. Content moderation combines NLP and text processing (e.g., for chat messages) with image processing (e.g., selfie uploads) or processes the metadata with Kafka and stores the linked content in a data lake. Both leverage Deep Learning to process high volumes of text and images. Here is what content moderation looks like in Tinder’s Kafka architecture:

Plenty of ways exist to process text, images, and videos with the Kafka ecosystem. I wrote a detailed article about handling large messages and files with Apache Kafka to explore the options and trade-offs.

Chatbots could also play a key role “in the other way round”. More and more dating apps (and other social networks) fight against spam, fraud, and automated chatbots. Like building a chatbot, a chatbot detection system can analyze the data streams to block a dating app’s chatbot.

Kafka as Real-Time Data Fabric for Future GenAI Initiatives

Real-time data beats slow data. Generative AI only adds value if it provides accurate and up-to-date information. Data streaming technologies such as Apache Kafka and Apache Flink enable building a reliable, scalable real-time infrastructure for GenAI. Additionally, the event-based heart of the enterprise architecture guarantees data consistency between real-time and non-real-time systems (near real-time, batch, request-response).

The early adopters like BWM, Expedia, and Tinder proved that Generative AI integrated into a Kafka architecture adds enormous business value. The evolution of AI models with ChatGPT et al. makes the use case even more compelling across every industry.

How do you build conversational AI, chatbots, and other GenAI applications leveraging Apache Kafka? What technologies and architectures do you use? Are data streaming and Kafka part of the architecture? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka as Mission Critical Data Fabric for GenAI appeared first on Kai Waehner.

Top 5 Data Streaming Trends for 2023

Kai Waehner — Thu, 15 Dec 2022 06:14:03 +0000

Data Streaming is one of the most relevant buzzwords in tech to build scalable real-time applications in the cloud and innovative business models. Do you wonder about my predicted TOP 5 data streaming trends in 2023 to set data in motion? Check out the following presentation and learn what role Apache Kafka plays. Learn about decentralized Data Mesh, cloud-native lakehouse, data sharing, improved user experience, and advanced data governance.

Some followers might notice that this became a series with past posts about the top 5 data streaming trends for 2021 and the top 5 for 2022. Data streaming with Apache Kafka is a journey and evolution to set data in motion. Trends change over time, but the core value of a scalable real-time infrastructure as the central data hub stays.

Gartner Top Strategic Technology Trends for 2023

The research and consulting company Gartner defines the top strategic technology trends every year. This time, the trends are more focused on particular niche concepts. On a higher level, it is all about optimizing, scaling, and pioneering. Here is what Gartner expects for 2023:

Source: Gartner

It is funny (but not surprising): Gartner’s predictions overlap and complement the five trends I focus on for data streaming with Apache Kafka looking forward to 2023. I explore how data streaming enables better time to market with decentralized optimized architectures, cloud-native infrastructure for elastic scale, and pioneering innovative use cases to build valuable data products.

Hence, here you go with the top 5 trends in data streaming for 2023.

The top 5 data streaming trends for 2023

I see the following topics coming up more regularly in conversations with customers, prospects, and the broader Kafka community across the globe:

Cloud-native lakehouses
Decentralized data mesh
Data sharing in real-time
Improved developer and user experience
Advanced data governance and policy enforcement

The following sections describe each trend in more detail. The end of the article contains the complete slide deck. The trends are relevant for various scenarios. No matter if you use open source Apache Kafka, a commercial platform, or a fully-managed cloud service like Confluent Cloud.

Kafka as data fabric for cloud-native lakehouses

Many data platform vendors pitch the lakehouse vision today. That’s the same story as the data lake in the Hadoop era with few new nuances. Put all your data into a single data store to save the world and solve every problem and use case:

In the last ten years, most enterprises realized this strategy did not work. The data lake is great for reporting and batch analytics, but not the right choice for every problem. Besides technical challenges, new challenges emerged: data governance, compliance issues, data privacy, and so on.

Applying a best-of-breed enterprise architecture for real-time and batch data analytics using the right tool for each job is a much more successful, flexible, and future-ready approach:

Data platforms like Databricks, Snowflake, Elastic, MongoDB, BigQuery, etc., have their sweet spots and trade-offs.

Data streaming increasingly becomes the real-time data fabric between all the different data platforms and other business applications leveraging the real-time Kappa architecture instead of the much more batch-focused Lamba architecture.

Decentralized data mesh with valuable data products

Focusing on business value by building data products in independent domains with various technologies is key to success in today’s agile world with ever-changing requirements and challenges. Data mesh came to the rescue and emerged as a next-generation design pattern, succeeding service-oriented architectures and microservices.

Two main proposals exist by vendors for building a data mesh: Data integration with data streaming enables fully decentralized business products. On the other side, data virtualization provides centralized queries:

Centralized queries are simple but do not provide a clean architecture and decoupled domains and applications. It might work well to solve a single problem in a project. However, I highly recommend building a decentralized data mesh with data streaming to decouple the applications, especially for strategic enterprise architectures.

Collaboration within and across organizations in real-time

Collaborating within and outside the organization with data sharing using Open APIs, streaming data exchange, and cluster linking enable many innovative business models:

The difference between data streaming to a database, data warehouse, or data lake is crucial: All these platforms enable data sharing at rest. The data is stored on a disk before it is replicated and shared within the organization or with partners. This is not real-time. You cannot connect a real-time consumer to data at rest.

However, real-time data beats slow data. Hence, sharing data in real-time with data streaming platforms like Apache Kafka or Confluent Cloud enables accurate data as soon as a change happens. A consumer can be real-time, near real-time, or batch. A streaming data exchange puts data in motion within the organization or for B2B data sharing and Open API business models.

AsyncAPI spec for Apache Kafka API schemas

AsyncAPI allows developers to define the interfaces of asynchronous APIs. It is protocol agnostic. Features include:

Specification of OpenAPI contracts (= schemas in the data streaming world)
Documentation of APIs
Code generation for many programming languages
Data governance
And much more…

Confluent Cloud recently added a feature for generating an AsyncAPI specification for Apache Kafka clusters.

We don’t know yet where the market is going. Will AsynchAPI become the standard for OpenAPI in data streaming? Maybe. I see increasing demand for this specification by customers. Let’s review the status of AsynchAPI in a few quarters or years. But it has the potential.

Improved developer experience with low-code / no-code tools for Apache Kafka

Many analysts and vendors pitch low code/no code tools. Visual coding is nothing new. Very sophisticated, powerful, and easy-to-use solutions exist as IDE or cloud applications. The significant benefit is time-to-market for developing applications and easier maintenance. At least in theory.

These tools support various personas like developers, citizen integrators, and data scientists. At least in theory.

The reality is that:

Code is king
Development is about evolution
Open platforms win

Low code/no code is great for some scenarios and personas. But it is just one option of many. Let’s look at a few alternatives for building Kafka-native applications:

These Kafka-native technologies have their trade-offs. For instance, the Confluent Stream Designer is perfect for building streaming ETL pipelines between various data sources and sinks. Just click the pipeline and transformations together. Then deploy the data pipeline into a scalable, reliable, and fully-managed streaming application. The difference to separate tools like Apache Nifi is that the generated code run in the same streaming platform, i.e., one infrastructure end-to-end. This makes ensuring SLAs and latency requirements much more manageable and the whole data pipeline more cost-efficient.

However, the simpler a tool is, the less flexible it is. It is that easy. No matter which product or vendor you look at. This is not just true for Kafka-native tools.

And you are flexible with your tool choice per project or business problem. Add your favorite non-Kafka stream processing engine to the stack, for instance, Apache Flink. Or use a separate iPaaS middleware like Dell Boomi or SnapLogic.

Domain-driven design with dumb pipes and smart endpoints

The real benefit of data streaming is the freedom of choice for your favorite Kafka-native technology, open-source stream processing framework, or cloud-native iPaaS middleware.

Choose the proper library, tool, or SaaS for your project. Data streaming enables a decoupled domain-driven design with dumb pipes and smart endpoints:

Data streaming with Apache Kafka is perfect for domain-driven design (DDD). On the contrary, often used point-to-point microservice architecture HTTP/REST web service or push-based message brokers like RabbitMQ create much stronger dependencies between applications.

Data governance across the data streaming pipeline

An enterprise architecture powered by data streaming enables easy access to data in real-time. Many enterprises leverage Apache Kafka as the central nervous system between all data sources and sinks.

The consequence of being able to access all data easily across business domains is two conflicting pressures on organizations: Unlock the data to enable innovation versus Lock up the data to keep it safe.

Achieving data governance across the end-to-end data streams with data lineage, event tracing, policy enforcement, and time travel to analyze historical events is critical for strategic data streaming in the enterprise architecture. Data governance on top of the streaming platform is required for end-to-end visibility, compliance, and security:

Policy enforcement with schemas and API contracts

The foundation for data governance is the management of API contracts (so-called schemas in data streaming platforms like Apache Kafka). Solutions like Confluent enforce schemas along the data pipeline, including data producer, server, and consumer:

Additional data governance tools like data lineage, catalog, or police enforcement are built on this foundation. The recommendation for any serious data streaming project is to use schema from the beginning. It is unnecessary for the first pipeline. But the following producers and consumers need a trusted environment with enforced policies to establish a decentralized data mesh architecture with independent but connected data products.

Slides and Video for Data Streaming Use Cases in 2023

Here is the slide deck from my presentation:

Fullscreen Mode

And here is the free on-demand video recording:

Data streaming goes up in the maturity curve in 2023

It is still an early stage for data streaming in most enterprises. But the discussion goes beyond questions like “when to use Kafka?” or “which cloud service to use?”… In 2023, most enterprises look at more sophisticated challenges around their numerous data streaming projects.

The new trends are often related to each other. A data mesh enables the building of independent data products that focus on business value. Data sharing is a fundamental requirement for a data mesh. New personas access the data stream. Often, citizen developers or data scientists need easy tools to pioneer new projects. The enterprise architecture requires and enforces data governance across the pipeline for security, compliance, and privacy reasons.

Scalability and elasticity need to be there out of the box. Fully-managed data streaming is a brilliant opportunity for getting started in 2023 and moving up in the maturity curve from single projects to a central nervous system of real-time data.

What are your most relevant and exciting trends for data streaming and Apache Kafka in 2023 to set data in motion? What are your strategy and timeline? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Top 5 Data Streaming Trends for 2023 appeared first on Kai Waehner.