Big Data Archives - Kai Waehner

Tesla Energy Platform – The Power of Data Streaming with Apache Kafka

Kai Waehner — Fri, 14 Feb 2025 08:17:37 +0000

Tesla’s Virtual Power Plant (VPP) is revolutionizing the energy sector by connecting home batteries, solar panels, and grid-scale storage into a real-time, intelligent energy network. Powered by Apache Kafka for event streaming and WebSockets for last-mile IoT integration, Tesla’s Energy Platform enables real-time energy trading, grid stabilization, and seamless market participation. By leveraging data streaming and automation, Tesla optimizes battery efficiency, prevents blackouts, and allows homeowners to monetize excess energy—all while making renewable energy more reliable and scalable. This software-driven approach showcases the power of real-time data in building the future of sustainable energy.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases across all industries.

What is a Virtual Power Plant?

A Virtual Power Plant (VPP) is a network of decentralized energy resources—such as home batteries, solar panels, and smart grid systems—that function as a single unit. Unlike a traditional power plant that generates electricity from a centralized location, a VPP aggregates power from many small, distributed sources. This allows energy to be dynamically stored and shared, helping to balance supply and demand in real time.

VPPs are crucial in the shift to renewable energy. The traditional power grid was designed around fossil fuel plants that could easily adjust output. Renewable energy sources like solar and wind are intermittent—they don’t generate power on demand. By connecting thousands of batteries and solar panels in homes and businesses, a VPP can smooth out fluctuations in power generation and consumption. This prevents blackouts, reduces energy costs, and enables homes and businesses to participate in energy markets.

How Tesla’s Virtual Power Plant Fits Its Business Model

Tesla is not just an automaker. It is a sustainable energy company. Tesla’s product ecosystem includes electric vehicles, solar panels, home batteries (Powerwall), grid-scale energy storage (Megapack), and energy management software (Autobidder).

The Tesla Virtual Power Plant (VPP) ties all these elements together. Homeowners with Powerwalls store excess solar power during the day and feed it back to the grid when needed. Tesla’s Autobidder software automatically optimizes energy use and market participation, turning home batteries into revenue-generating assets.

For Tesla, the VPP strengthens its energy business, creating a scalable model that maximizes battery efficiency, stabilizes grids, and expands the role of software in energy markets. Tesla is not just selling batteries; it is selling energy intelligence.

Tesla’s energy platform is a perfect example of how data streaming and real-time decision-making align with ESG principles:

Environmental Impact: VPPs reduce reliance on fossil fuels by making renewable energy more reliable.
Social Benefit: By enabling energy independence, VPPs provide power during outages and extreme weather conditions.
Governance & Regulation: VPPs allow consumers to participate in energy markets, fostering decentralized energy ownership.

Tesla’s approach is smart grid innovation at scale—real-time data makes the grid more dynamic, efficient, and resilient.

My article “Green Data, Clean Insights: How Apache Kafka and Flink Power ESG Transformations” covers other real-world data streaming deployments in the energy sector like EON.

Tesla’s Energy Platform: A Network of Connected Home Energy Systems

Tesla’s VPP connects thousands of homes with Powerwalls, solar panels, and grid services. These systems work together to provide electricity on demand, reacting to supply fluctuations in real-time.

Key Functions of Tesla’s VPP:

Energy Storage & Redistribution: Batteries store solar energy during the day and discharge at night or during peak demand.
Grid Stabilization: The VPP balances energy supply and demand to prevent outages and fluctuations.
Market Participation: Homeowners can sell excess power back to the grid, monetizing their batteries.
Disaster Resilience: The VPP provides backup power during blackouts, storms, and grid failures.

This requires real-time data processing at massive scale—something traditional batch-based data architectures cannot handle.

Apache Kafka and Real-Time Data Streaming at Tesla

Tesla operates in many domains—automotive, energy, and AI. Across all these areas, Apache Kafka plays a critical role in enabling real-time data movement and stream processing.

In 2018, Tesla already processed trillions of IoT messages with Apache Kafka:

Source: Tesla

Tesla leverages stream processing to handle trillions of IoT events daily, using Apache Kafka to ingest, process, and analyze data from its vehicle fleet in real time. By implementing efficient data partitioning, fast and slow data lanes, and scalable infrastructure, Tesla optimizes vehicle performance, predicts failures, and enhances manufacturing efficiency.

These strategies demonstrate how real-time data streaming is essential for managing large-scale IoT ecosystems, ensuring low-latency insights while maintaining operational stability. To learn more about these use cases read Tesla’s blog post “Stream Processing with IoT Data: Challenges, Best Practices, and Techniques“.

The following sections explore Tesla’s innovation for its virtual power plant, as discussed in an excellent presentation at QCon.

Tesla Energy Platform: Architecture of the Virtual Power Plant Powered by Apache Kafka

Tesla’s VPP uses Apache Kafka for:

Telemetry Ingestion: Streaming data from millions of Powerwalls, solar panels, and Megapacks into the cloud.
Command & Control: Sending real-time control commands to batteries and grid services.
Market Participation: Autobidder analyzes real-time data and adjusts energy prices dynamically.

The event-driven architecture allows Tesla to react to energy demand in milliseconds—critical for balancing the grid.

Tesla’s Energy Platform is the software foundation of the VPP. It integrates OT (Operational Technology), IoT (Internet of Things), and IT (Information Technology) to control distributed energy assets.

Tesla Applications Built on the Energy Platform

Tesla’s Energy Platform powers a suite of applications that optimize energy management, market participation, and grid stability through real-time data streaming and automation.

Autobidder

Optimizes energy trading in real time.
Automatically bids into energy markets.

I wrote about about other data streaming success stories for energy trading with Apache Kafka and Flink, including Uniper, re.alto and Powerledger.

Distributed Virtual Power Plant

Aggregates thousands of Powerwalls into a single energy asset.
Provides grid stabilization and peak load balancing.

If you are interested in other smart grid infrastructures, check out “Apache Kafka for Smart Grid, Utilities and Energy Production“. The articles covers how data streaming realizes IT/OT integration. And some hybrid cloud IoT deployments.

Battery Control (Command & Control)

Ensures optimal charging and discharging of batteries.
Minimizes costs while maximizing energy efficiency.

Market Participation

Allows homeowners and businesses to profit from energy markets.
Ensures seamless grid integration of Tesla’s energy products.

Key Components of Tesla’s Energy Platform: Apache Kafka, WebSockets, Akka Streams

The combination of data streaming with Apache Kafka and the last-mile IoT integration via WebSockets builds the central nervous system of Tesla’s Energy Platform:

Apache Kafka (Event Streaming):
- Streams telemetry data from Powerwalls every second.
- Ensures durability and reliability of data streams.
- Allows real-time energy aggregation across thousands of homes.
WebSockets (Last-Mile IoT Integration):
- Provides low-latency bidirectional communication with Powerwalls.
- Used to send real-time commands to home batteries.
Pub/Sub (Command & Control):
- Enables distributed energy resource coordination.
- Ensures resilient messaging between systems.
Business Logic (Applications & Microservices):
- Tesla’s services are built with Scala and Python.
- Uses gRPC & HTTP for inter-service communication.
Digital Twins (Real-Time State Management):
- Digital models of physical assets ensure real-time decision-making.
- Tesla uses Akka Streams for stateful event processing.
Kubernetes (Cloud Infrastructure):
- Ensures scalability and resilience of Tesla’s energy microservices.

Source: Tesla

Interesting side note: While most energy companies I have seen rely on Kafka Streams or Apache Flink for stateful event processing, Tesla takes an interesting approach by leveraging Akka Streams (based on Akka’s Actor Model) to manage real-time digital twins of its energy assets. This choice provides fine-grained control over streaming workflows, but unlike Kafka Streams or Flink, Akka lacks widespread community adoption, making it a less common choice for many large-scale energy platforms. Kafka and Flink are a match made in heaven for most data streaming use cases.

Best Practice: Shift Left Architecture with Data Products for High-Volume IoT Data

Tesla leverages several data processing best practices to improve efficiency and consistency:

Canonical Kafka Topics: Data is filtered and structured at the source.
Consistent Downstream Services: Every consumer gets clean, structured data.
Real-Time Aggregation of Thousands of Batteries: A unique challenge that forms the foundation of the virtual power plant.

This data-first approach ensures Tesla’s energy platform can scale to millions of distributed assets.

Today, many people refer to the Shift Left Architecture when applying these best practices for processing data efficiently and continuously to provide data product in real-time and good quality:

In Tesla’s Energy Platform, the data comes from IoT interfaces. WebSockets provide the last-mile integration and feed the events into the data streaming platform for continuous processing before the ingestion into the operational and analytical applications.

Tesla’s Energy Vision: How Streaming Data Will Shape Tomorrow’s Power Grids

Tesla’s Virtual Power Plant is not just about batteries—it’s about software, real-time data, and automation.

Why Data Streaming Matters for Tesla’s Energy Platform:

Scalability: Can handle millions of energy devices.
Resilience: Works even when devices go offline.
Real-Time Decision Making: Adjusts energy distribution within milliseconds.
Market Optimization: Autobidder ensures maximum revenue for battery owners.

Tesla’s VPP is a blueprint for the future of energy—one where real-time data streaming and intelligent software optimize renewable energy. By leveraging Apache Kafka, WebSockets, and stream processing, Tesla is redefining how energy is generated, distributed, and consumed.

This is not just an innovation in power generation—it’s an AI-driven energy revolution.

How do you leverage data streaming in the energy and automotive sector? Follow me on LinkedIn or X (former Twitter) to stay in touch and discuss. Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter. And make sure to download my free book about data streaming use cases across all industries.

The post Tesla Energy Platform – The Power of Data Streaming with Apache Kafka appeared first on Kai Waehner.

When NOT to Use Apache Kafka? (Lightboard Video)

Kai Waehner — Tue, 26 Mar 2024 06:45:11 +0000

Apache Kafka is the de facto standard for data streaming to process data in motion. With its significant adoption growth across all industries, I get a very valid question every week: When NOT to use Apache Kafka? What limitations does the event streaming platform have? When does Kafka simply not provide the needed capabilities? How to qualify Kafka out as it is not the right tool for the job? This blog post contains a lightboard video that gives you a twenty-minute explanation of the DOs and DONTs.

DisclaimeR: This blog post shares a lightboard video to watch an explanation about when NOT to use Apache Kafka. For a much more detailed and technical blog post with various use cases and case studies, check out this blog post from 2022 (which is still valid today – whenever you read it).

What is Apache Kafka, and what is it NOT?

Kafka is often misunderstood. For instance, I still hear way too often that Kafka is a message queue. Part of the reason is that some vendors only pitch it for a specific problem (such as data ingestion into a data lake or data warehouse) to sell their products. So, in short:

Kafka is…

a scalable real-time messaging platform to process millions of messages per second.
a data streaming platform for massive volumes of big data analytics and small volumes of transactional data processing.
a distributed storage provides true decoupling for backpressure handling, support of various communication protocols, and replayability of events with guaranteed ordering.
a data integration framework (Kafka Connect) for streaming ETL.
a data processing framework (Kafka Streams) for continuous stateless or stateful stream processing.

This combination of characteristics in a single platform makes Kafka unique (and successful).

Kafka is NOT…

a proxy for millions of clients (like mobile apps) – but Kafka-native proxies (like REST or MQTT) exist for some use cases.
an API Management platform – but these tools are usually complementary and used for the creation, life cycle management, or the monetization of Kafka APIs.
a database for complex queries and batch analytics workloads – but good enough for transactional queries and relatively simple aggregations (especially with ksqlDB).
an IoT platform with features such as device management – but direct Kafka-native integration with (some) IoT protocols such as MQTT or OPC-UA is possible and the appropriate approach for (some) use cases.
a technology for hard real-time applications such as safety-critical or deterministic systems – but that’s true for any other IT framework, too. Embedded systems are a different software!

For these reasons, Kafka is complementary, not competitive, to these other technologies. Choose the right tool for the job and combine them!

Lightboard Video: When NOT to use Apache Kafka

The following video explores the key concepts of Apache Kafka. Afterwards, the DOs and DONTs of Kafka show how to complement data streaming with other technologies for analytics, APIs, IoT, and other scenarios.

Data Streaming Vendors and Cloud Services

The research company Forrester defines data streaming platforms as a new software category in a new Forrester Wave. Apache Kafka is the de facto standard used by over 100,000 organizations.

Plenty of vendors offer Kafka platforms and cloud services. Many complementary open source stream processing frameworks like Apache Flink and related cloud offerings emerged. And competitive technologies like Pulsar, Redpanda, or WarpStream try to get market share leveraging the Kafka protocol. Check out the data streaming landscape of 2024 to summarize existing solutions and market trends. The end of the article gives an outlook to potential new entrants in 2025.

Apache Kafka is a Data Streaming Platform: Combine it with other Platforms when needed!

Over 150,000 organizations use Apache Kafka in the meantime. The Kafka protocol is the de facto standard for many open source frameworks, commercial products and serverless cloud SaaS offerings.

However, Kafka is not an allrounder for every use case. Many projects combine Kafka with other technologies, such as databases, data lakes, data warehouses, IoT platforms, and so on. Additionally, Apache Flink is becoming the de facto standard for stream processing (but Kafka Streams is not going away and is the better choice for specific use cases).

Where do you (not) use Apache Kafka? What other technologies do you combine Kafka with? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post When NOT to Use Apache Kafka? (Lightboard Video) appeared first on Kai Waehner.

Why Tiered Storage for Apache Kafka is a BIG THING…

Kai Waehner — Tue, 05 Dec 2023 05:38:36 +0000

Apache Kafka added Tiered Storage to separate compute and storage. The capability enables more scalable, reliable and cost-efficient enterprise architectures. This blog post explores the architecture, use cases, benefits, and a case study for storing Petabytes of data in the Kafka commit log. The end discusses why Tiered Storage does NOT replace other databases and how Apache Iceberg might change future Kafka architectures even more.

If you prefer watching a ten minute video, check out this summary about the “Evolution of Storage for Apache Kafka covering Tiered Storage, Direct Write to Object Storage and the relation to Open Table Formats such as Apache Iceberg”:

Now, let’s explore why Tiered Storage for Apache Kafka is a BIG THING:

Compute vs. Storage vs. Tiered Storage

Let’s define the terms compute, storage, and tiered storage to have the same understanding when exploring this in the context of the data streaming platform Apache Kafka.

Compute and Storage

Two fundamental components of a computing system are compute and storage. They serve different purposes in information processing.

Compute refers to the processing power and capability of a computer system to perform tasks, execute instructions, and carry out computations. The compute component includes the CPU (Central Processing Unit) and GPU (Graphics Processing Unit).

Storage refers to the components and systems that store and retrieving data over the long term. It is where data is persistently maintained for later use. Storage includes devices such as hard disk drives (HDDs), solid-state drives (SSDs), and other types of non-volatile memory, such as databases that keep data even when the power is turned off.

Tiered Storage

Tiered storage refers to a storage architecture that uses different classes or tiers of storage (e.g., Object Storage on S3) to efficiently manage and store data based on its access patterns, performance requirements, and cost considerations.

The goal of tiered storage is to optimize the use of storage resources, balancing performance and cost, by placing data on the most suitable storage media based on its characteristics and the organization’s policies.

Data placement and movement between these tiers can be automated based on policies and algorithms that analyze usage patterns, access frequency, and other factors. This ensures that the most critical and frequently accessed data lives in high-performance storage, while less critical or infrequently accessed data is moved to lower-cost, lower-performance storage.

Long-Term Storage in Apache Kafka

Apache Kafka is an open-source distributed streaming platform that is used for building real-time data pipelines and streaming applications. Kafka is the established de facto standard for data streaming. The event streaming platform handles large volumes of data, providing a scalable and fault-tolerant architecture.

Applications and data stores use Kafka for ingesting, storing, and processing real-time data streams, making it a fundamental component in building event-driven architectures and systems that require the processing of continuous data flows. Additionally, many use cases leverage Kafka not just for real-time data but to ensure data consistency across real-time, batch, and request-response APIs.

Use Cases for Apache Kafka as Storage System

While most people think about Kafka as a message broker, real-time analytics platform, or big data ingestion system, the distributed commit log with ordering guarantees and timestamps enables plenty of use cases for accessing data long after its creation or replaying historical data.

Here are a few examples for use cases that leverage long-term storage of data in Kafka:

New consumer: Deploy a new application / database / data warehouse, data lake and synchronize the state of the business objects.
Offloading: Reducing cost significantly by NOT consuming again and again from expensive or non-scalable systems (e.g. mainframe and MIPS)
Error-handling: Re-process historical data after fixing an issue in the business logic.
Compliance / regulatory processing: Replay historical data to analyze an incident.
Query and analyze existing events: Consume data from a notebook for data engineering, analytics, or reporting.
Schema changes in analytics platform: Re-process data after updating data contracts.
Model training: Batch ingestion into an AI framework to apply a machine learning algorithm
Disaster recovery: Operational data stores replay data again from the persistent commit log in the case of a failure.
…

Objections for Storing Data Long-Term in Kafka

Storing data long term in Kafka has a few drawbacks. The following arguments are valid concerns:

Cost: Storing large volumes of data on attached disks is much more expensive than external storage systems like an object store.
Scalability: Operating Kafka brokers with lots of data (say many gigabytes, or even terabytes, and more) is challenging, especially in the case of failures when you need to rebalance partitions.
Risk: Downtime or data inconsistencies happen if operations struggle with large volumes or when hardware needs to be migrated.

Therefore, you should NOT store big data sets in Kafka without Tiered Storage! With this in mind, let’s explore how Tiered Storage for Kafka solves these problems.

Introducing Tiered Storage for Apache Kafka

Apache Kafka’s backend is a distributed system running Kafka brokers. Each Kafka broker has processing and storage capabilities.

The applications are producers and consumers of events. Many interfaces communicate with Kafka brokers:

An application written in Java, Python, C++, Go, or any other programming language
A Kafka Connect source or sink connector connecting to IBM MQ, Spark, Snowflake, or any other data store or SaaS application
A stream processor built with Kafka-native Kafka Streams, KSQL, or external infrastructures like Apache Flink
Any other endpoint, like a HTTP interface or an out-of-the-box integration of another middleware or data platform

What is Tiered Storage for Kafka?

Tiered storage for Apache Kafka refers to the capability of configuring different storage tiers to optimize the storage infrastructure based on the access patterns and requirements of the data stored in Kafka brokers.

A Kafka cluster stores data in Kafka Topics. These topics can have different characteristics in terms of importance, access frequency, and retention policies.

The concept is like the general idea of tiered storage in storage systems, but it’s adapted to the specific needs of Kafka. Tiered Storage is one critical making the Kafka architecture cloud-native.

Kafka Architecture without Tiered Storage

Kafka applications communicate with logical Kafka Topics to produce messages to or consume messages from partitions:

The storage is a disk attached to the broker. This can be HDD or SDD disks on-premise or e.g. EBS volumes on AWS cloud.

Kafka Architecture with Tiered Storage

Tiered Storage for Kafka does NOT change how applications communicate with Kafka brokers. Tiered Storage is an implementation detail:

Besides the disks attached to the broker, Kafka offloads data to an external storage. Most times, this is an object storage like Amazon S3, Azure Blog Storage, Google Cloud Storage, or MinIO for Kubernetes.

Serverless cloud offerings handle the offloading for the operator. Self-managed solutions allow operators to configure hot and cold storage durations for each Kafka Topic.

Benefits of Tiered Storage for Apache Kafka

Let’s review the above-discussed objections to storing big data sets long-term in Kafka and how Tiered Storage helps:

Reduced cost: Most data is offloaded to an external storage. This reduces the storage cost significantly.
Improved scalability: Only data on the disks attached to the Kafka brokers must be rebalanced. As most data is offloaded, rebalancing only takes seconds or minutes; even if the external storage saves petabytes.
Reduced risk: Better scalability and separation of compute and storage makes operations much easier and significantly reduces the risk of downtime or data inconsistency.

The Implementation of Tiered Storage in Apache Kafka

Tiered Storage for Apache Kafka is available. However, be aware that different implementations exist with different features, maturity, and support levels.

And open source Apache Kafka only provides the interface for tiered storage. You must choose an open source implementation, build your own integration into an external storage system, or leverage a commercial product or cloud service that embeds tiered storage into its offering.

Keep in mind that the interface alone is not helpful. The implementation needs to be battle-tested and guarantee data consistency across the hot storage on the broker and cold storage in the external storage; even in the case of failure, network issues, etc.

Kafka consumers do not see the implementation details of Kafka’s Tiered Storage. They just consumed as if there was no tiered storage implementation (and still expect the same behavior). There are no API or code changes needed in Kafka client applications. Hence, you can easily migrate an existing deployment to a Kafka cluster leveraging Tiered Storage.

Many people ask about the performance impact of tiered storage for Kafka. The short answer: There is no performance impact for most scenarios. Real-time consumers consume from the memory / page cache as before. And replaying historical data from the event log does not differ much from the local disk or the remote object-store.

AK 3.6 Release Makes Tiered Storage Available

When writing this blog post (December 2023), KIP-405: Kafka Tiered Storage is available as early access in Apache Kafka 3.6. This release introduces Tiered Storage to Kafka. This release is only for non-production environments (see the early access notes for more information).

GA of this feature is just a foreseeable matter of time. The bulk of KIP-405 was part of early access in release 3.6. But there are a few additional features that are slated for 3.7. And GA likely comes after that in 3.8+.

KIP-405 Provides a Pluggable Storage API for Tiering

KIP-405 separates computation and storage in the Kafka broker for pluggable storage tiering natively in Kafka Tiered Storage, bringing a seamless storage extension to remote objects with minimal operational changes.

Apache Kafka’s LocalTieredStorage default implementation is a local file-based RemoteStorageManager. LocalTieredStorage facilitates the simulation of remote storage behavior in a controlled and isolated environment during testing. This is not meant for production use cases! Enterprises need to write their own implementation, embed an open-source alternative, or trust a software vendor respectively cloud service.

How Confluent, Uber, and Others use Tiered Storage

KIP-405 is only available in preview with Kafka 3.6. But some proprietary implementations already exist for years in production. This also helped to define the KIP with lessons learned from running Kafka in production with tiered storage under the hood.

Implementation details of tiered storage for Kafka vary, and there may be different approaches or tools available to achieve this, depending on the specific Kafka distribution or storage infrastructure being used. Organizations might also use external systems or cloud storage solutions to implement tiered storage strategies for Kafka.

Confluent pioneered tiered storage for Kafka and has provided the capability for several years already. It is available for the self-managed Confluent Platform and the fully managed Confluent Cloud in AWS, Azure, and GCP. Confluent chose the S3 interface to implement storage support for the cloud providers (AWS, Azure, GCP) and several on-premise solutions like PureStorage Flash Blade, Nutanix Objects, Netapp Object Storage, Dell EMC ECS, Hitachi Content Platform Object Storage, or MinIO for Kubernetes.

Uber, who had the lead in implementing the KIP-405 in open source Apache Kafka, runs its tiered storage against HDFS. Confluent and AWS contributed to refactoring, best practices and performance / integration testing. Satish Duggana, tech lead for Data and Streaming Infrastructure at Uber, presented the details of their implementations and deployment in a talk at Current 2023.

Other vendors like AWS with MSK and Aiven are adopting KIP-405 and provide their own tiered storage implementations these days.

Case Study: KOR Financial stores 160 Petabytes in Kafka for Regulatory Reporting

KOR is a cloud-native family of global trade repositories and regulatory reporting services that has adopted Confluent Cloud and a data streaming architecture to improve compliance processes.

Regulatory reporting is obviously a perfect use case for Tiered Storage in Kafka to replay historical data. As the Kafka log provides guaranteed ordering and timestamps, there is no need for another database or data lake besides Kafka.

Daan Gerits, Chief Data Officer, KOR Financial, explains at Diginomica: “At KOR Financial, we have a very specific problem that we are trying to solve, which is collecting trading information for regulators. And we decided to do it in a totally different way to the way that most people are doing it. Where others would be using data storage or big data technologies, we decided to go all in on Kafka. We are building our system to store 160 petabytes in Confluent Cloud and then work on top of that. We don’t have any other database. So it’s a long retention use case.”

Kafka is NOT a Database (Replacement)

Apache Kafka is a database. It provides ACID guarantees. Hundreds of companies for deploy Kafka for mission-critical deployments including transactional workloads. However, most times, Kafka is NOT competitive to other databases.

Kafka is an event streaming platform for messaging, storage, processing and integration at scale in real-time with zero downtime and zero data loss. Almost all deployments connect Kafka to database sources and sinks for data integration, decoupling and data consistency, where the heart of the cloud-native enterprise architecture is real-time, scalable and reliable.

Apache Kafka is complementary to database, data warehouse, data lake and Lakehouse architectures. I wrote a blog series about use cases and architectures for data streaming other storage platforms.

The Future: Apache Iceberg for Kafka?

The adoption of Tiered Storage for Apache Kafka is just getting started. Many teams will store (some) data longer in Kafka to offload data from expensive systems or replay historical data without needing another database.

However, most analytics platforms do NOT use the Kafka protocol to consume and query data. The trend across most data platforms goes towards Apache Iceberg as a standardized abstraction layer for storing and querying (non-real-time) data in an objects store or other storage.

Apache Iceberg is an open-source table format and processing framework for big data. It aims to provide the best of both worlds: the performance of a traditional table format with the flexibility of a schema-on-read approach. Iceberg addresses solves in managing large-scale and evolving data sets in distributed storage environments.

Apache Iceberg supports popular data processing frameworks, such as Apache Spark, Apache Flink, Apache Hive, Presto, and more. With Kafka’s Tiered Storage and especially the S3 support by some vendors, I can see how this can be an entire game changer for storing and processing events in real-time with the Kafka protocol or with other analytics engines and databases in near-real-time or batch.

The future will show us. For now, let’s be excited about how Tiered Storage for Kafka is the next big thing around data streaming.

Tiered Storage makes Kafka more Scalable, Cost-Efficient and Reliable

Tiered Storage for Apache Kafka makes event-driven architectures more scalable, cost-efficient and reliable. It enables new use cases that require another database or data lake in the past.

However, Kafka’s goal is still NOT to replace other data and analytics platforms. Design patterns like microservices and data mesh enable a true decoupling of applications and data stores. Kafka provides this decoupling. With tiered storage in mind for various use cases such as offloading, new consumers, or error-handling, you can consider new approaches for your cloud-native enterprise architecture.

Are you excited about Tiered Storage for Apache Kafka? How will you use it? Or do you already use an existing implementation, like Confluent Cloud? Let’s connect on LinkedIn and discuss it! Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

The post Why Tiered Storage for Apache Kafka is a BIG THING… appeared first on Kai Waehner.

Data Warehouse and Data Lake Modernization: From Legacy On-Premise to Cloud-Native Infrastructure

Kai Waehner — Fri, 15 Jul 2022 06:03:28 +0000

The concepts and architectures of a data warehouse, a data lake, and data streaming are complementary to solving business problems. Storing data at rest for reporting and analytics requires different capabilities and SLAs than continuously processing data in motion for real-time workloads. Many open-source frameworks, commercial products, and SaaS cloud services exist. Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let’s explore this dilemma in a blog series. Learn how to build a modern data stack with cloud-native technologies. This is part 3: Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure.

Blog Series: Data Warehouse vs. Data Lake vs. Data Streaming

This blog series explores concepts, features, and trade-offs of a modern data stack using data warehouse, data lake, and data streaming together:

Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
THIS POST: Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
Lessons Learned from Building a Cloud-Native Data Warehouse

Stay tuned for a dedicated blog post for each topic as part of this blog series. I will link the blogs here as soon as they are available (in the next few weeks). Subscribe to my newsletter to get an email after each publication (no spam or ads).

Data warehouse modernization: From legacy on-premise to cloud-native infrastructure

Many people talk about data warehouse modernization when they move to a cloud-native data warehouse. Though, what does data warehouse modernization mean? Why do people move away from their on-premise data warehouse? What are the benefits?

Many projects I have seen in the wild went through the following steps:

Select a cloud-native data warehouse
Get data into the new data warehouse
[Optional] Migrate from the old to the new data warehouse

Let’s explore these steps in more detail and understand the technology and architecture options.

1. Selection of a cloud-native data warehouse

Many years ago, cloud computing was a game-changer for operating infrastructure. AWS innovated by providing not just EC2 virtual machines but also storage, like AWS S3 as a service.

Cloud-native data warehouse offerings are built on the same fundamental change. Cloud providers brought their analytics cloud services, such as AWS Redshift, Azure Synapse, or GCP BigQuery. Independent vendors rolled out a cloud-native data warehouse or data lake SaaS such as Snowflake, Databricks, and many more. While each solution has its trade-offs, a few general characteristics are true for most of them:

Cloud-native: A modern data warehouse is elastic, scales for small up to extreme workloads, and automates most business processes around development, operations, and monitoring.
Fully managed: The vendor takes over the operations burden. This includes scaling, failover handling, upgrades, and performance tuning. Some offerings are truly serverless, while many services require capacity planning and manual or automated scaling up and down.
Consumption-based pricing: Pay-as-you-go enables getting started in minutes and scaling costs with broader software usage. Most enterprise deployments allow commitment to getting price discounts.
Data sharing: Replicating data sets across regions and environments is a common feature to offer data locality, privacy, lower latency, and regulatory compliance.
Multi-cloud and hybrid deployments: While cloud providers usually only offer the 1st party service on their cloud infrastructure, 3rd party vendors provide a multi-cloud strategy. Some vendors even offer hybrid environments, including on-premise and edge deployments.

Plenty of comparisons exist in the community, plus analyst research from Gartner, Forrester, et al. Looking at vendor information and trying out the various cloud products using free credits is crucial, too. Finding the right cloud-native data warehouse is its own challenge and not in this blog post.

2. Data streaming as (near) real-time and hybrid integration layer

Data ingestion into data warehouses and data lakes was already covered in part two of this blog series. The more real-time, the better for most business applications. Near real-time ingestion is possible with specific tools (like AWS Kinesis or Kafka) or as part of the data fabric (the streaming data hub where a tool like Kafka plays a bigger role than just data ingestion).

The often more challenging part is data integration. Most data warehouse and data lake pipelines require ETL to ingest data. As the next-generation analytics platform is crucial for making the right business decisions, the data ingestion and integration platform must also be cloud-native! Tools like Kafka provide the reliable and scalable integration layer to get all required data into the data warehouse.

Integration of legacy on-premise data into the cloud-native data warehouse

In a greenfield project, the project team is lucky. Data sources run in the same cloud, using open and modern APIs, and scale as well as the cloud-native data warehouse.

Unfortunately, the reality is brownfield almost always, even if all applications run in public cloud infrastructure. Therefore, the integration and replication of data from legacy and on-premise applications is a general requirement.

Data is typically consumed from legacy databases, data warehouses, applications, and other proprietary platforms. The replication into the cloud data warehouse usually needs to be near real-time and reliable.

A data streaming platform like Kafka is perfect for replicating data across data centers, regions, and environments because of its elastic scalability and true decoupling capabilities. Kafka enables connectivity to modern AND legacy systems via connectors, proprietary APIs, programming languages, or open REST interfaces:

A common scenario in such a brownfield project is the clear separation of concerns and true decoupling between legacy on-premise and modern cloud workloads. Here, Kafka is deployed on-premise to connect to legacy applications.

Tools like MirrorMaker, Replicator, or Confluent Cluster Linking replicate events in real-time into the Kafka cluster in the cloud. The Kafka brokers provide access to the incoming events. Downstream consumers read the data into the data sinks at their own pace; real-time, near real-time, batch, or request-response via API. Streaming ETL is possible at any site – where it makes the most sense from a business or security perspective and is the most cost-efficient.

Example: Confluent Cloud + Snowflake = Cloud-native Data Warehouse Modernization

Here is a concrete example of data warehouse modernization using cloud-native data streaming and data warehousing with Confluent Cloud and Snowflake:

For modernizing the data warehouse, data is ingested from various legacy and modern data sources using different communication paradigms, APIs, and integration strategies. The data is transmitted in motion from data sources via Kafka (and optional preprocessing) into the Snowflake data warehouse. The whole end-to-end pipeline is scalable, reliable, and fully managed, including the connectivity and ingestion between the Kafka and Snowflake clusters.

However, there is more to the integration and ingestion layer: The data streaming platform stores the data for true decoupling and slow downstream applications; not every consumer is or can be real-time. Most enterprise architectures do not ingest data into a single data warehouse or data lake or lakehouse. The reality is that different downstream applications need access to the same information; even though vendors of data warehouses and data lakes tell you differently, of course

By consuming events from the streaming data hub, each application domain decides by itself if it

processes events within Kafka with stream processing tools like Kafka Streams or ksqlDB
builds own downstream applications with its code and technologies (like Java, .NET, Golang, Python)
integrates with 3rd party business applications like Salesforce or SAP
ingests the raw or preprocessed and curated data from Kafka into the sink system (like a data warehouse or data lake)

3. Data warehouse modernization and migration from legacy to cloud-native

An often misunderstood concept is the buzz around data warehouse modernization: Companies rarely take the data of the existing on-premise data warehouse or data lake, write a few ETL jobs, and put the data into a cloud-native data warehouse for the sake of doing it.

If you think about a one-time lift-and-shift from an on-premise data warehouse to the cloud, then a traditional ETL tool or a replication cloud service might be the easiest. However, usually, data warehouse modernization is more than that!

What is data warehouse modernization?

A data warehouse modernization can mean many things, including replacing and migrating the existing data warehouse, building a new cloud-native data warehouse from scratch, or optimizing a legacy ETL pipeline of a cloud-native data warehouse.

In all these cases, data warehouse modernization requires business justification, for instance:

Application issues in the legacy data warehouse, such as too slow data processing with legacy batch workloads, result in wrong or conflicting information for the business people.
Scalability issues in the on-premise data warehouse as the data volume grows too much.
Cost issues because the legacy data warehouse does not offer reasonable pricing with pay-as-you-go models.
Connectivity issues as legacy data warehouses were not built with an open API and data sharing in mind. Cloud-native data warehouses run on cost-efficient and scalable object storage, separate storage from computing, and allow data consumption and sharing. (but keep in mind that Reverse ETL is often an anti-pattern!)
A strategic move to the cloud with all infrastructure. The analytics platform is no exception if all old and new applications go to the cloud.

Cloud-native applications usually come with innovation, i.e., new business processes, data formats, and data-driven decision-making. From a data warehouse perspective, the best modernization is to start from scratch. Consume data directly from the existing data sources, ETL it, and do business intelligence on top of the new data structures.

I have seen many more projects where customers use change data capture (CDC) from Oracle databases (i.e., the leading core system) instead of trying to replicate data from the legacy data warehouse (i.e., the analytics platform) as scalability, cost, and later shutdown of legacy infrastructure benefits from this approach.

Data warehouse migration: Continuous vs. cut-over

The project is usually a cut-over when you need to do a real modernization (i.e., migration) from a legacy data warehouse to a cloud-native one. This way, the first project phase integrates the legacy data sources with the new data warehouse. The old and new data warehouse platforms operate in parallel, so that old and new business processes go on. After some time (months or years later), when the business is ready to move, the old data warehouse will be shut down after legacy applications are either migrated to the new data warehouse or replaced with new applications:

My article “Mainframe Integration, Offloading and Replacement with Apache Kafka” illustrates this offloading and long-term migration process. Just scroll to the section “Mainframe Offloading and Replacement in the Next 5 Years” in that post and replace the term ‘mainframe’ with ‘legacy data warehouse’ in your mind.

A migration and cut-over is its project and can include the legacy data warehouse; or not. Data lake modernization (e.g., from a self- or partially managed Cloudera cluster running on-premise in the data center to a fully managed Databricks or Snowflake cloud infrastructure) follows the same principles. And mixing the data warehouse (reporting) and data lake (big data analytics) into a single infrastructure does not change this either.

Data warehouse modernization is NOT a big bang and NOT a single tool approach!

Most data warehouse modernization projects are ongoing efforts over a long period. You must select a cloud-native data warehouse, get data into the new data warehouse from various sources, and optionally migrate away from legacy on-premise infrastructure.

Data streaming for data ingestion, business applications, or data sharing in real-time should always be a separate component in the enterprise architecture. It has very different requirements regarding SLAs, uptime, through, latency, etc. Putting all real-time and analytical workloads into the same cluster makes little sense from a cost, risk, or business value perspective. The idea of a modern data flow and building a data mesh is the separation of concerns with domain-driven design and focusing on data products (using different, independent APIs, technologies, and cloud services).

For more details, browse other posts of this blog series:

Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
THIS POST: Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
Lessons Learned from Building a Cloud-Native Data Warehouse

What cloud-native data warehouse(s) do you use? How does data streaming fit into your journey? Did you integrate or replace your legacy on-premise data warehouse(s); or start from greenfield in the cloud? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Data Warehouse and Data Lake Modernization: From Legacy On-Premise to Cloud-Native Infrastructure appeared first on Kai Waehner.

Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?

Kai Waehner — Mon, 27 Jun 2022 12:35:22 +0000

The concepts and architectures of a data warehouse, a data lake, and data streaming are complementary to solving business problems. Storing data at rest for reporting and analytics requires different capabilities and SLAs than continuously processing data in motion for real-time workloads. Many open-source frameworks, commercial products, and SaaS cloud services exist. Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let’s explore this dilemma in a blog series. Learn how to build a modern data stack with cloud-native technologies. This is part 1: Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?

Blog Series: Data Warehouse vs. Data Lake vs Data Streaming

This blog series explores concepts, features, and trade-offs of a modern data stack using a data warehouse, data lake, and data streaming together:

THIS POST: Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
Lessons Learned from Building a Cloud-Native Data Warehouse

I also created a short ten minute video that explains the differences between data streaming and a lakehouse (and why they are complementary):

The value of data: Transactional vs. analytical workloads

The last decade offered many articles, blogs, and presentations about data becoming the new oil. Today, nobody questions that data-driven business processes change the world and enable innovation across industries.

Data-driven business processes require both real-time data processing and batch processing. Think about the following flow of events across applications, domains, and organizations:

An event is business information or technical information. Events happen all the time. A business process in the real world requires the correlation of various events.

How critical is an event?

The criticality of an event defines the outcome. Potential impacts can be increased revenue, reduced risk, reduced cost, or improved customer experience.

Business transaction: Ideally, zero downtime and zero data loss. Example: Payments need to be processed exactly once.
Critical analytics: Ideally, zero downtime. Data loss of a single sensor event might be okay. Alerting on the aggregation of events is more critical. Example: Continuous monitoring of IoT sensor data and a (predictive) machine failure alert.
Non-critical analytics: Downtime and data loss are not good but do not kill the whole business. It is an accident, but not a disaster. Example: Reporting and business intelligence to forecast demand.

When to process an event?

Real-time usually means end-to-end processing within milliseconds or seconds. If you don’t need real-time decisions, batch processing (i.e., after minutes, hours, days) or on-demand (i.e., request-reply) is sufficient.

Business transactions are often real-time: A transaction like a payment usually requires real-time processing (e.g., before the customer leaves the store; before you ship the item; before you leave the ride-hailing car).
Critical analytics is usually real-time: Critical analytics very often requires real-time processing (e.g., detecting the fraud before it happens; predicting a machine failure before it breaks; upselling to a customer before he leaves the store).
Non-critical analytics is usually not real-time: Finding insights in historical data is usually done in a batch process using paradigms like complex SQL queries, map-reduce, or complex algorithms (e.g., reporting; model training with machine learning algorithms; forecasting).

With these basics about processing events, let’s understand why storing all events in a single, central data lake is not the solution to all problems.

Flexibility through decentralization and best-of-breed

The traditional data warehouse respectively data lake approach is to ingest all data from all sources into a central storage system for centralized data ownership. The sky (and your budget) is the limit with current big data and cloud technologies.

However, architectural concepts like domain-driven design, microservices, and data mesh show that decentralized ownership is the right choice for modern enterprise architecture:

No worries. The data warehouse and the data lake are not dead but more relevant than ever before in a data-driven world. Both make sense for many use cases. Even in one of these domains, larger organizations don’t use a single data warehouse or data lake. Selecting the right tool for the job (in your domain or business unit) is the best way to solve the business problem.

There are good reasons people are pleased with Databricks for batch ETL, machine learning, and now even data warehouse, but still prefer a lightweight cloud SQL database like AWS RDS (fully managed PostgreSQL) for some use cases.

There are good reasons happy Splunk users also ingest some data into Elasticsearch instead. And why Cribl is getting more and more traction in this space, too.

There are good reasons some projects leverage Apache Kafka as the database. Storing data long-term in Kafka makes only sense for some specific use cases (like compacted topics, key/value queries, streaming analytics). Kafka does not replace other databases or data lakes.

Choose the right tool for the job with decentralized data ownership!

With this in mind, let’s explore the use cases and added value of a modern data warehouse (and how it relates to data lakes and the new buzz lakehouse).

Data warehouse: Reporting and business intelligence with data at rest

A data warehouse (DWH) provides capabilities for reporting and data analysis. It is considered a core component of business intelligence.

Use cases for data at rest

No matter if you use a product called a data warehouse, data lake, or lakehouse. The data is stored at rest for further processing:

Reporting and Business Intelligence: Fast and flexible availability of reports, statistics, and key figures, for example, to identify correlations between market and service offering
Data Engineering: Integration of data from differently structured and distributed data sets to enable the identification of hidden relationships between data
Big Data Analytics and AI / Machine Learning: Global view of the source data and thus overarching evaluations to find unknown insights to improve business processes and interrelationships.

Some readers might say: Only the first is a use case for a data warehouse, and the other two are for a data lake or lakehouse! It all depends on the definition.

Data warehouse architecture

DWHs are central repositories of integrated data from disparate sources. They store historical data in one storage system. The data is stored at rest, i.e., saved for later analysis and processing. Business users analyze the data to find insights.

The data is uploaded from operational systems, such as IoT data, ERP, CRM, and many other applications. Data cleansing and data quality assurance are crucial pieces in the DWH pipeline. Extract, Transform, Load (ETL) or Extract, Load, Transform (ELT) are the two main approaches to building a data warehouse system. Data marts help to focus on a single subject or line of business within the data warehouse ecosystem.

The relation of the data warehouse to data lake and lakehouse

The focus of a data warehouse is reporting and business intelligence using structured data. Contrarily, the data lake is a synonym for storing and processing raw big data. A data lake was built with technologies like Hadoop, HDFS, and Hive in the past. Today, the data warehouse and data lake merged into a single solution. A cloud-native DWH supports big data. Similarly, a cloud-native data lake needs business intelligence with traditional tools.

Databricks: The evolution from the data lake to the data warehouse

That’s true for almost all vendors. For example, look at the history of one of the leading big data vendors: Databricks, known for being THE Apache Spark company. The company started as a commercial vendor behind Apache Spark, a big data batch processing platform. The platform was enhanced with (some) real-time workloads using micro-batching. A few milestones later, Databricks is an entirely different company today, focusing on cloud, data analytics, and data warehousing. Databricks’ strategy changed from:

open source to the cloud
self-managed software to fully-managed serverless offerings
focus on Apache Spark to AI / Machine Learning and later added data warehousing capabilities
a single product to a vast product portfolio around data analytics, including standardized data formats (“Delta Lake”), governance, ETL tooling (Delta Live Tables), and more,

Vendors like Databricks and AWS also coined a new buzzword for this merge of the data lake, data warehouse, business intelligence, and real-time capabilities: The Lakehouse.

The Lakehouse (sometimes called Data Lakehouse) is nothing new. It combines characteristics of separate platforms. I wrote an article about building a cloud-native serverless lakehouse on AWS using Kafka in conjunction with AWS analytics platforms.

Snowflake: The evolution from the data warehouse to the data lake

Snowflake came from the other direction. It was the first genuinely cloud-native data warehouse available on all major clouds. Today, Snowflake provides many more capabilities beyond the traditional business intelligence spectrum. For instance, data and software engineers have features to interact with Snowflake’s data lake through other technologies and APIs. The data engineer requires a Python interface to analyze historical data, while the software engineer prefers real-time data ingestion and analysis at any scale.

No matter if you build a data warehouse, data lake, or lakehouse: The crucial point is understanding the difference between data in motion and data at rest to find the right enterprise architecture and components for your solution. The following sections explore why a good data warehouse architecture requires both and how they complement each other very well.

Transactional real-time workloads should not run within the data warehouse or data lake! Separation of concerns is critical due to different uptime SLAs, regulatory and compliance laws, and latency requirements.

Data streaming: Supplementing the modern data warehouse with data in motion

Let’s clarify: Data streaming is NOT the same as data ingestion! You can use a data streaming technology like Apache Kafka for data ingestion into a data warehouse or data lake. Most companies do this. Fine and valuable.

BUT: A data streaming platform like Apache Kafka is so much more than just an ingestion layer. Hence, it differs significantly from ingestion engines like AWS Kinesis, Google Pub/Sub, and similar tools.

Data streaming is NOT the same as data ingestion

Data streaming provides messaging, persistence, integration, and processing capabilities. High scalability for millions of messages per second, high availability including backward compatibility and rolling upgrades for mission-critical workloads, and cloud-native features are some of the built-in features.

The de facto standard for data streaming is Apache Kafka. Therefore, I mainly use Kafka for data streaming architectures and use cases.

Transactional and analytics use cases for data streaming with Apache Kafka

The number of different use cases for data streaming is almost endless. Please remember that data streaming is NOT just a message queue for data ingestion! While data ingestion into a data lake was the first prominent use case, this implies <5% of actual Kafka deployments. Business applications, streaming ETL middleware, real-time analytics, and edge/hybrid scenarios are some of the other examples:

The persistence layer of Kafka enables decentralized microservice architectures for agile and truly decoupled applications.

Keep in mind that Apache Kafka supports transactional and analytics workloads. Both usually have very different uptime, latency, and data loss SLAs. Check out this post and slide deck to learn more about data streaming use cases across industries powered by Apache Kafka.

The next post of this blog series explores why data streaming with tools like Apache Kafka is the prevalent technology for data ingestion into data warehouses and data lakes.

Don’t (try to) use the data warehouse or data lake for data streaming

This blog post explored the difference between data at rest and data in motion:

A data warehouse is excellent for reporting and business intelligence.
A data lake is perfect for big data analytics and AI / Machine Learning.
Data streaming enables real-time use cases.
A decentralized, flexible enterprise architecture is required to build a modern data stack around microservices and data mesh.

None of the technologies is a silver bullet. Choose the right tool for the problem. Monolithic architectures do not solve today’s business problems. Storing all data only at rest does not help with the demand for real-time use cases.

The Kappa architecture is a modern approach for real-time and batch workloads to avoid a much more complex infrastructure using the Lambda architecture. Data streaming complements the data warehouse and data lake. The connectivity between these systems is available out-of-the-box if you choose the right vendors (that are often strategic partners, not competitors, as some people think).

For more details, browse to other posts of this blog series:

THIS POST: Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
Lessons Learned from Building a Cloud-Native Data Warehouse

How do you combine data warehouse and data streaming today? Is Kafka just your ingestion layer into the data lake? Do you already leverage data streaming for additional real-time use cases? Or is Kafka already the strategic component in the enterprise architecture for decoupled microservices and a data mesh? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies? appeared first on Kai Waehner.

Analytics vs. Transactions in Data Streaming with Apache Kafka

Kai Waehner — Wed, 09 Mar 2022 14:01:10 +0000

Workloads for analytics and transactions have very unlike characteristics and requirements. The use cases differ significantly. SLAs are very different, too. Many people think that Apache Kafka is not built for transactions and should only be used for big data analytics. This blog post explores when and how to use Kafka in resilient, mission-critical architectures and when to use the built-in Transaction API.

Analytical and transactional workloads

Let’s begin by defining the terms. The YouTube channel ‘Databases Demystified’ has a great episode: Analytical vs. Transactional. I use and enhance its explanation in the following subsections.

Some people refer to this as an “OLTP vs. OLAP” discussion:

In OLTP (online transaction processing), information systems typically facilitate and manage transaction-oriented applications.
In OLAP (online analytical processing), information systems generally execute much more complex queries, in a smaller volume, for the purpose of business intelligence or reporting rather than to process transactions.

There are some overlaps in some use cases and products. Hence, I use the more generic terms “transactions” and “analytics” in this blog post.

Analytical workloads

Analytical workloads have the following characteristics:

Processing large amounts of information for creating aggregates
Read-only queries and (usually) batch-write data loads
Supporting complex queries with multiple steps of data processing, join conditions, and filtering
Highly variable ad hoc queries, many of which may only be run once, ever
Not mission-critical, meaning downtime or data loss is not good, but in most cases not a disaster for the core business

Analytics solutions

Analytics solutions exist on-premises and in all major clouds. The tools differ regarding their capabilities and sweet spots. Examples include:

Redshift (Amazon Web Services)
BigQuery (Google Cloud)
Snowflake
Hive / HDFS / Spark
And many more!

Transactional workloads

Transactional workloads have unique characteristics and SLAs compared to analytical workloads:

Manipulating one object at a time (often across different systems)
Create Read Update and Delete (CRUD) operations inserting data one object at a time or updating existing data (often across different systems)
Precisely managing state with guarantees about what has or hasn’t been written to disk
Supporting many operations per second in real-time with high throughput
Mission-critical SLAs for uptime, availability, and latency of the end-to-end data communication

Transactional solutions

Transactional solutions include applications, databases, messaging systems, and integration middleware:

IBM Mainframe (including CICS, IMS, DB2)
TIBCO EMS
PostgreSQL
Oracle Database
MongoDB
And many more!

Often, a transactional workload has to guarantee ACID principles (i.e., all or nothing writes to different applications and technologies).

A mix of transactional and analytical workloads

Many solutions support a mix of transactional and analytical workloads.

For instance, many enterprises store transactional data in MongoDB but also process complex queries for analytics use cases in the same database. MongoDB started as document-based NoSQL database. In the meantime, it is a general-purpose database platform that also supports other forms of database queries like MongoDB provides graph and tree traversal capabilities:

Hence, focus on the business problem first. Then, you can decide if your existing infrastructure can solve the problem or if you need yet another one. But there is no silver bullet. A vendor-independent best of breed approach works best in most enterprise architectures I see in the success stories from the field.

Data at Rest vs. Data in Motion

Batch vs. real-time data processing is an important discussion you should have in every project. Statements like “batch processing is for analytics, real-time processing is for transactions” are not always correct. Real-time beats slow data in almost all use cases from a business value perspective. Nevertheless, batch processing is the better approach for some specific use cases.

Analytics platforms for batch processing

Data at Rest means to store data in a database, data warehouse, or data lake. This means that the data is processed too late in many use cases – even if a real-time streaming component (like Kafka) ingests the data. The data processing is still a web service call, SQL query, or map-reduce batch process away from providing a result to your problem.

Don’t get me wrong. Data at Rest is not a bad thing. Several use cases such as reporting (business intelligence), analytics (batch processing), and model training (machine learning) require this approach… If you do it right! Data at Rest can be used for transactional workloads, too!

Apache Kafka for real-time data streaming

The Kafka API is the De Facto Standard API for Data in Motion like Amazon S3 for object storage. Why is Kafka so successful? Real-time beats slow data in most use cases across industries.

The same cloud-native approach is required for event streaming as for the modern data lake. Building a scalable and cost-efficient infrastructure is crucial for the success of a project. Event streaming and data lake technologies are complementary, not competitive.

I will not explore the reasons and use cases for the success of Kafka in this post. Instead, check out my overview about Kafka use cases across industries for more details. Or read some of my vertical-specific blog posts.

In short, most added value comes from processing Data in Motion while it is relevant instead of storing Data at Rest and processing it later (or too late). Many analytical and transactional workloads use Kafka for this reason.

Apache Kafka for analytics

Even in 2022, many people think about Kafka as a data ingestion layer into data stores. This is still a critical use case. Enterprises use Kafka as the ingestion layer for different analytics platforms:

Batch reporting and dashboards
Interactive queries (using Tableau, Qlik, and similar tools)
Data preparation for batch calculations, model training, and other analytics
Connectivity into different data warehouses, data lakes, and other data sinks using a best of breed approach

But Kafka is much more than a messaging and ingestion layer. Here are a few analytics examples using Kafka for analytics (often with other analytics tools to solve a specific problem together):

Data integration for various source systems using Kafka Connect and pre-built connectors (including real-time, near real-time, batch, web service, file, and proprietary interfaces)
Decoupling and backpressure handling as the sink systems are often not ready for vast volumes of real-time data. Domain-driven design (DDD) for true decoupling is a crucial differentiator of Kafka compared to other middleware and message queues.
Data processing at scale in real-time filters, transforms, generalizes, or aggregates incoming data sets before ingesting them into sink systems.
Real-time analytics applied within the Kafka application. Many analytics platforms were designed for near real-time or batch workloads but not for resilient model scoring with low latency – especially at scale). An example could be an analytic model trained with batch machine learning algorithms in a data lake with Spark MLlib or TensorFlow and then deployed into a Kafka Streams or ksqlDB application.
Replay historical events in cases such as onboarding a new consumer application, error-handling, compliance or regulatory processing, schema changes in an analytics platform. This becomes especially relevant if Tiered Storage is used under the hood of Kafka for cost-efficient and scalable long-term storage.

Analytics example with Confluent Cloud and AWS services

Here is an illustration from an AWS architecture combining Confluent and its ecosystem including connectors, stream processing capabilities, and schema management together with several 1st party AWS cloud services:

As you can see, Kafka is an excellent tool for analytical workloads. It is not a silver bullet but used for appropriate parts of the overall data management architecture. I have another blog post that explores the relationship between Kafka and other serverless analytics platforms.

However, Kafka is NOT just used for analytical workloads!

Apache Kafka for transactions

Around 60 to 70% of use cases and deployments I see at customers across the globe leverage the Kafka ecosystem for transactional workloads. Enterprises use Kafka for:

core banking platforms
fraud detection
global replication of order and inventory information
integration with business-critical platforms like CRM, ERP, MES, and many other transactional systems
supply chain management
customer communication like point-of-sale integration or context-specific upselling
and many other use cases where every single event counts.

Kafka is a distributed, fault-tolerant system that is resilient by nature (if you deploy and operate it correctly). No downtime and no data loss can be guaranteed, like in your favorite database, mainframe, or other core platforms.

Elastic scalability and rolling upgrades allow building a flexible and reliable data streaming infrastructure for transactional workloads to guarantee business continuity. The architect can even stretch a cluster across regions to ensure zero data loss and zero downtime even in case of a disaster where a data center is completely down. The post “Global Kafka Deployments” explores the different deployment options and their trade-offs in more detail.

Kafka Transactions API example

And even better: Kafka’s Transaction API, i.e., Exactly-Once Semantics (EOS), has been available since Kafka 0.11 (that GA’ed a long time ago). EOS makes building transactional workloads even easier as you don’t need to handle duplicates anymore.

Kafka now supports atomic writes across multiple partitions through the transactions API. This allows a producer to send a batch of messages to multiple partitions. Either all messages in the batch are eventually visible to any consumer, or none are ever visible to consumers. Here is an example:

Kafka provides a built-in transactions API. And the performance impact (that many people are worried about) is minimal. Here is a simple rule of thumb: If you care about exactly-once semantics, simply activate it! If performance issues force you to disable it, you can still fine-tune your application or disable it. But most projects are fine with the minimal performance trade-offs versus the enormous benefit of handling transactional behavior out-of-the-box.

Nevertheless, to be clear: You don’t need to use Kafka’s Transaction API to build mission-critical, transactional workloads.

SAGA design pattern for transactional data in Kafka without transactions

The Kafka Transactions API is optional. As discussed above, Kafka is resilient without transactions. Though eliminating duplicates is your task then. Exactly-once semantics solve this problem out-of-the-box across all Kafka components. Kafka Connect, Kafka Streams, ksqlDB, and different clients like Java, C++, .NET, Go support EOS.

However, I am also not saying that you should always use the Kafka Transaction API or that it solves every transactional problem. Keep in mind that scalable distributed systems require other design patterns than a traditional “Oracle to IBM MQ transaction”.

Some business transactions span multiple services. Hence, you need a mechanism to implement transactions that span services. A familiar design pattern and implementation for such a transactional workload is the SAGA pattern with a stateful orchestration application.

Swisscom’s Custodigit is an excellent example of such an implementation leveraging Kafka Streams. It is a modern banking platform for digital assets and cryptocurrencies that provides crucial features and guarantees for seriously regulated crypto investments – more details in my blog post about Blockchain, Crypto, NFTs, and Kafka.

And yes, there are always trade-offs between the Kafka Transaction API and exactly-once semantics, stateful orchestration in a separate application, and two-phase-commit transactions like Oracle DB and IBM MQ use it. Choose the right tool to define your appropriate enterprise architecture!

Kafka with other data stores and streaming engines

Most enterprises use Kafka as the central scalable real-time data hub. Hence, use cases include analytical and transactional workloads.

Most Kafka projects I see today also leverage Kafka Connect for data integration, Kafka Streams/ksqlDB for continuous data processing, and Schema Registry for data governance.

Thus, with Kafka, one (distributed and scalable) infrastructure enables messaging, storage, integration, and data processing. But of course, most Kafka clusters connect to other applications (like SAP or Salesforce) and data management systems (like MongoDB, Snowflake, Databricks, et al.) for analytics:

I explored in detail why Kafka is a database for some specific use cases but will NOT replace other databases and data lakes in its own blog post.

In addition to Kafka-native stream processing engines like Kafka Streams or ksqlDB, other streaming analytics frameworks like Apache Flink or Spark Streaming can easily be connected for transactional or analytical workloads. Just keep in mind that especially transactional workloads get harder end-to-end with every additional system and infrastructure you add to the enterprise architecture.

Kappa architecture for analytics AND transactions with Kafka as the data hub

Real-time data beats slow data. That’s true for almost every use case. Yet, enterprise architects build new infrastructures with the Lambda architecture that includes a separate batch layer for analytics and a real-time layer for transactional workloads.

A single real-time pipeline, called Kappa architecture, is the better fit. Real-world examples from companies such as Disney, Shopify, Uber, and Twitter explore the benefits of Kappa but also show how batch processing fits into this discussion positively with no Lambda. In its dedicated post, learn how a Kappa architecture can revolutionize how you built analytical and transactional workloads with the same scalable real-time data hub powered by Kafka.

How do you leverage data streaming for analytical or transactional workloads? Do you use exactly-once semantics to ease the implementation of transactions? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Analytics vs. Transactions in Data Streaming with Apache Kafka appeared first on Kai Waehner.

Apache Kafka Landscape for Automotive and Manufacturing

Kai Waehner — Wed, 12 Jan 2022 12:07:20 +0000

Before the Covid pandemic, I had the pleasure of visiting “Motor City” Detroit in November 2019. I met with several automotive companies, suppliers, startups, and cloud providers to discuss use cases and architectures around Apache Kafka. A lot has happened. Since then, I have also met several OEMs and suppliers in Europe and Asia. As I finally go back to Detroit this January 2022 to meet customers again, I thought it would be a good time to update the status quo of event streaming and Apache Kafka in the automotive and manufacturing industry.

Today, in 2022, Apache Kafka is the central nervous system of many applications in various areas related to the automotive and manufacturing industry for processing analytical and transactional data in motion across edge, hybrid, and multi-cloud deployments. This article explores the automotive event streaming landscape, including connected vehicles, smart manufacturing, supply chain optimization, aftersales, mobility services, and innovative new business models.

The Event Streaming Landscape for Automotive and Manufacturing

Every business domain leverages Event Streaming with Apache Kafka in the automotive and manufacturing industry. Data in motion helps everywhere. The infrastructure and deployment differ depending on the use case and requirements. I have seen everything at carmakers and manufacturers across the globe:

Cloud-first strategy with all new business applications in the public cloud deployed and connected across regions and even continents
Hybrid integration scenarios between legacy applications in the data center and modern cloud-native services the public cloud
Edge computing in a smart factory for low latency, cost-efficient data processing, and cybersecurity
Embedded Kafka brokers in machines and vehicles at the disconnected edge

This spread of use cases is impressive. The following diagram depicts a high-level overview:

The following sections describe the automotive and manufacturing landscape for event streaming in more detail:

Manufacturing 4.0
Supply Chain Optimization
Mobility Services
New Business Models

If you are mainly interested in real-world Kafka deployments with examples from BMW, Porsche, Audi, Tesla, and other OEMs, check out the article “Real-World Deployments of Kafka in the Automotive Industry“.

If you want to understand why Kafka makes such a difference in automotive and manufacturing, check out the article “Apache Kafka in the Automotive Industry“. This article explores the business motivation for these game-changing concepts of data in motion for the digitalization of the automotive industry.

Before you start reading the below section, I want to clearly emphasize that Kafka is not the silver bullet for every problem. “When NOT to use Apache Kafka?” digs deep into this discussion.

I keep the following sections relatively short to give a high-level overview. Each section contains links to more deep-dive articles about the topics.

Manufacturing 4.0

Industrial IoT (IIoT) respectively Industry 4.0 changes how the shop floor and production lines produce goods. Automation, process efficiency, and a much better Overall Equipment Effectiveness (OEE) enable cost reduction and flexibility in the production process:

Smart Factory

A smart factory is not necessarily a newly built building like a Tesla Gigafactory. Many enterprises install smart technology like networked sensors for temperature or vibrations measurements into old factories. Improving the Overall Equipment Effectiveness (OEE) is the primary goal of most use cases. Many scenarios leverage Kafka for continuously processing sensor and telemetry data in motion:

Connectivity to machines with modern, open technologies such as MQTT
Visualization and monitoring of equipment and assets (often called Digital Twin or Digital Threat)
Quality assurance such as condition monitoring with stateless stream processing
Predictive maintenance of machines, robots, productions lines with stateful streaming analytics and machine learning
Surveillance and safety-critical video monitoring by processing images and videos
Cybersecurity for situational awareness and threat intelligence
Smart Buildings for maintenance and operations, smarter energy consumptions, optimized space usage, better employee experience

Legacy Modernization with Open APIs and Hybrid Cloud

Factories exist for decades after they are built. Digitalization and the modernization of legacy technologies are some of the biggest challenges in IIoT projects. Such an initiative usually includes several tasks:

Complex integration with proprietary legacy protocols such as Siemens S7, Allan Bradley, Modbus, et al., for instance, with a dedicated open-source framework such as Apache PLC4X running on Kafka Connect
Simple integration with open standards such as HTTP and REST-based web services and a REST Proxy for Kafka
Deployment of a modern, open, scalable data historian replacing or complementing monolithic, proprietary data historians
Postmodern MES and ERP architectures upgrading or replacing legacy proprietary non-scalable MES and ERP systems, for instance, the integration between legacy SAP systems and Kafka
Lift and shift from on-premise data centers to the public cloud with hybrid synchronization and replication using the Kafka protocol and cluster linking

Continuous Data-driven Engineering and Product Development

Last but not least, an opportunity many people underestimate: Continuous data streaming with Kafka enables new possibilities in software engineering and product development for IoT and automotive projects.

For instance, developing and deploying the “big loop” for machine learning of advanced driver-assistance systems (ADAS) or self-driving functions based on sensor data from the fleet is a new way of software engineering. Tesla’s Kafka-based data platform is a fantastic example. A related use case in engineering is the ingest of sensor data during and after test drives.

Supply Chain Optimization

Supply chain processes and solutions are very complex. The Covid pandemic showed how only flexible enterprises could survive, stay profitable, and provide a great customer experience, even in disastrous external events.

Here are the top 5 critical challenges of supply chains:

Time Frames are Shorter
Rapid Change
Zoo of Technologies and Products
Historical Models are No Longer Viable
Lack of visibility

Only real-time data streaming and correlation solve these supply chain challenges end-to-end across regions and companies:

In its detailed blog post, I covered Supply Chain Optimization (SCM) with Apache Kafka. Check it out to learn about real-world supply chain use cases from Bosch, BMW, Walmart, and other companies.

Intra-logistics and Global Distribution Networks

Logistics and supply chains within a factory, distribution center, or store require real-time data integration and processing to provide efficient processing of goods and a great customer experience. Batch processes or manual interaction by human workers cannot implement these use cases. Examples include:

Real-Time Locating System (RTLS) for transportation and logistics within a building to monitor robots, driverless transport systems (DTS), and manual processes for improved safety, controlled security, and optimized operations and productivity
Inventory management for optimized and customized production processes with a better B2B integration in real-time – the same story and benefits as in a modern omnichannel retail architecture
Globally distributed networks powered by a global Kafka infrastructure
Augmented reality for the digitalization and automation of worker tasks

Track & Trace and Fleet Management

Real-time logistics is a game-changer for fleet management and track & trace use cases.

Commercial motor vehicles such as cars, vans, trucks, specialist vehicles (such as mobile construction machinery), forklifts, and trailers
Private vehicles used for work (the ‘grey fleet’)
Aviation machinery such as aircraft (planes and helicopters)
Ships
Rail cars
Non-powered assets such as generators, tanks, gearboxes

All the following aspects are not new. The difference is that event streaming allows to continuously execute these tasks in real-time to act on new information in motion:

Visualization
Location-based services
Routing and navigation
Estimated time of arrival
Alerting
Proactive recalculation
Monitoring of the assets and mechanical components of a vehicle

Most companies have a cloud-first strategy for building such a platform. However, some cases require edge computing either via local 5G location for low latency use cases or embedded Kafka brokers for disconnected data collection and analytics within the vehicles.

Streaming Data Exchange for B2B Collaboration with Partners

Real-time data is not just relevant without a company. OEMs and Tier 1 and Tier 2 suppliers benefit in the same way from data streams. The same is true for car dealerships, end customers, and any other consumer of the data. Hence, a clear trend in the market is the emergence of a Kafka-based streaming data exchange across companies to build a data mesh.

I have often seen this situation in the past: The OEM leverages event streaming. The Tier 1 supplier leverages event streaming. The used ERP solution is built on Kafka, too. All leverage the capabilities of scalable real-time data streaming. It makes little sense to integrate with partners and software vendors via web service APIs, such as SOAP or HTTP/REST. Instead, a streaming interface is a natural choice to hand streaming data to partners.

The following example from the automotive industry shows how independent stakeholders (= domains in different enterprises) use a cross-company streaming data exchange:

Mobility Services

Every OEM, supplier, or innovative startup in the automotive space thinks about providing a mobility service either on top of the goods they sell or as an independent service.

Most mobility services on your mobile apps used today for business or privately are only possible because of a scalable real-time backbone powered by event streaming:

The possibilities for mobility services are endless. A few examples that are mainstream today already:

Omnichannel retail and aftersales to buy additional car features online, for instance, more power, seat heater, up-to-date navigation, self-driving software (okay, the latter one is not mainstream yet, but Tesla shows where it goes)
Connected Cars for ride-hailing, scooter rental, taxi services, food delivery
3rd party integration for adding services that a company does not want to build by themselves

Today’s most successful and widely adopted mobility services are independent of a specific carmaker or supplier.

Examples of prominent Kafka-powered consumer mobility services are Uber and Lyft in the US, Grab in Asia, and FREENOW in Europe. Here Technologies is an excellent example for a B2B mobility service providing mapping information so that companies can build new or improve existing applications on top of it.

A good starting point to learn more is my blog post about Apache Kafka and MQTT for mobility services and transportation.

New Business Models

The access to real-time data enables companies to build entirely new business models on top of their existing products:

A few examples:

Next-generation car rental with excellent customer experience, context-specific coupons, loyalty platform, and car rental fleets with other services from the carmaker.
Reinventing car insurance based on real-time driving information about each driver to build driver-specific pricing based on real-time analysis of the driver behavior instead of legacy approaches using statistical models with attributes like driver age, number of accidents in the past, etc.
Data provider for monetization enables other companies to build new business models with your car data – for instance, working with a government to make a smart city traffic system or a mobility service startup to analyze and correlate car data across OEMs.

This evolution is just the beginning of the usage of streaming data. I have seen many customers build a first streaming pipeline for one use case. However, new business divisions will leverage the data for innovations when the platform is there.

The Data is in Motion in Automotive and Manufacturing

The landscape for Apache Kafka in the automotive and manufacturing industry showed that Apache Kafka is the central nervous system of many applications in various areas for processing analytical and transactional data in motion.

This article explored use cases such as connected vehicles, smart manufacturing, supply chain optimization, aftersales, mobility services, and innovative new business models. The possibilities for data in motion are almost endless. The automotive and manufacturing industry is still in the very early stages of leveraging data in motion.

Where do you use Apache Kafka and its ecosystem in the automotive and manufacturing industry? Do you deploy in the public cloud, in your data center, or at the edge outside a data center? What other technologies do you combine with Kafka? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka Landscape for Automotive and Manufacturing appeared first on Kai Waehner.

Apache Kafka for Conversational AI, NLP and Chatbot

Kai Waehner — Wed, 08 Dec 2021 00:48:54 +0000

Natural Language Processing (NLP) helps many projects in the real world for service desk automation, customer conversation with a chatbot, content moderation in social networks, and many other use cases. Apache Kafka became the predominant orchestration layer in these machine learning platforms for integrating various data sources, processing at scale, and real-time model inference. This article shows how companies across different industries such as the carmaker BMW, the online travel and booking Expedia, and the dating app Tinder leverage the combination of event streaming with machine learning for reliable real-time conversational AI, NLP, and chatbots.

Natural Language Processing (NLP)

Natural language processing is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, mainly how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of “understanding” the contents of the text, documents, and speech, including the contextual nuances of the language within them. The technology can then accurately extract information and insights in the text and categorize and organize the text itself.

Use cases for natural language processing frequently involve speech recognition, natural language understanding, translation, and natural language generation.

Like most other ML concepts, NLP can use an ocean of different algorithms. In the 2010s, NLP kicked off when representation learning and deep neural network-style machine learning methods became widespread in NLP. Modern ML frameworks such as TensorFlow, in conjunction with the elastic compute power in the public cloud, enabled NLP usage for any company of any size.

Conversational AI has become mainstream because of Deep Learning

A chatbot is a software application used to conduct an online chat conversation via text or text-to-speech instead of direct contact with a live human agent. Designed to simulate how a human behaves as a conversational partner convincingly, chatbot systems typically require continuous tuning and testing, and many in production remain unable to converse adequately. Chatbots are used in dialog systems for various purposes, including customer service, request routing, or information gathering.

discover.bot published an excellent article explaining how a chatbot works behind the scenes using NLP. It uses a combination of Natural Language Understanding (NLU) and Natural Language Generation (NLG):

Like any machine learning/deep learning application, a chatbot requires model training (= teaching a chatbot) and model scoring (=applying the chatbot in a dialog with a human). Therefore, building a chatbot is a machine learning problem with related tools, APIs, and cloud services. How does event streaming with Kafka fit into this story?

Machine Learning / NLP and Kafka – An Impedance Mismatch

I wrote about Machine Learning and Kafka a lot in the past. TL;DR: Machine Learning requires data integration and processing at scale, and model predictions often require reliable and robust real-time applications. That’s where Kafka and its ecosystem fit into the story:

For more details:

These articles describe different workarounds to solve the impedance mismatch. A few examples:

Some teams let the data science teams deploy Python in Docker containers in production and integrate it with other applications and programming platforms like Java, Go, or .NET / C++.
A few projects use Faust as Kafka-native streaming Python library (with several limitations compared to Kafka Streams or ksqlDB).
Embedding NLP models (trained with any machine learning framework using any programming language, including Python) into a native Kafka application for robust model scoring is a well-known option.
Last but not least, many model servers added Kafka-native streaming interfaces directly using the Kafka protocol as an alternative to RPC communication via HTTP or gRPC.

Let’s now look at a few real-world examples for machine learning, NLP, chatbots, and the Kafka ecosystem in companies such as BMW, Expedia, and Tinder.

BMW – Kafka as Orchestration Layer for NLP and Chatbots

The automotive company BMW presented innovative NLP services at Kafka Summit in 2019 already. It is no surprise that a carmaker has various NLP scenarios. These include digital contract intelligence, workplace assistant, machine translation, and customer conversations. The latter contains multiple use cases for conversational AI:

Service desk automation
Speech analysis of customer interaction center (CIC) calls to improve the quality
Self-service using smart knowledge bases
Agent support
Chatbots

The text and speech data is structured, enriched, contextualized, summarized, and translated to build real-time decision support applications. Kafka is a crucial component of BMW’s ML and NLP architecture. The real-time integration and data correlation enable interactive and interoperable data consumption and usage:

BMW explained the key advantages of leveraging Kafka and its streaming processing library Kafka Streams as the real-time integration and orchestration platform:

Flexible integration: Multiple supported interfaces for different deployment scenarios, including various machine learning technologies, programming languages, and cloud providers
Modular end-to-end pipelines: Services can be connected to provide full-fledged NLP applications.
Configurability: High agility for each deployment scenario

Expedia – Conversations Platform powered by Cloud-native Kafka

Expedia is a leading online travel and booking. They have many use cases for machine learning. One of my favorite examples is their Conversations Platform built on Kafka and Confluent Cloud to provide an elastic cloud-native application.

The goal of Expedia’s Conversations Platform was simple: Enable millions of travelers to have natural language conversations with an automated agent via text, Facebook, or their channel of choice. Let them book trips, make changes or cancellations, and ask questions:

“How long is my layover?”
“Does my hotel have a pool?”
“How much will I get charged if I want to bring my golf clubs?”

Then take all that is known about that customer across all of Expedia’s brands and apply machine learning models to give customers what they are looking for immediately in real-time and automatically, whether a straightforward answer or a complex new itinerary.

Real-time Orchestration realized in four Months

Such a platform is no place for batch jobs, back-end processing, or offline APIs. To quickly make decisions that incorporate contextual information, the platform needs data in near real-time, and it needs it from a wide range of services and systems. Meeting these needs meant architecting the Conversations Platform around a central nervous system based on Confluent Cloud and Apache Kafka. Kafka made it possible to orchestrate data from loosely coupled systems, enrich data as it flows between them so that by the time it reaches its destination, it is ready to be acted upon, and surface aggregated data for analytics and reporting.

Expedia built this platform from zero to production in four months. That’s the tremendous advantage of using a fully managed serverless event streaming platform as the foundation. The project team can focus on the business logic.

The Covid pandemic proved the idea of an elastic platform: Companies were hit with a tidal wave of customer questions, cancellations, and re-bookings. Throughout this once-in-a-lifetime event, the Conversations Platform proved up to the challenge, auto-scaling as necessary and taking off much of the load of live agents.

Expedia’s Migration from MQ to Kafka as Foundation for Real-time Machine Learning and Chatbots

As part of their conversations platform, Expedia needed to modernize their IT infrastructure, as Ravi Vankamamidi, Director of Technology at Expedia Group, explained in a Kafka Summit keynote.

Expedia’s old legacy chatbot service relied on a legacy messaging system. This service was a question-and-answer board with very limited scope for booking scenarios. This service could handle two-party conversations. It could not scale to bring all different systems into one architecture to build a powerful chatbot that is helpful for customer conversations.

I explored several times that event streaming is more than just a (scalable) message queue. Check out my old (but still accurate and relevant) Comparison between MQ and Kafka, or the newer comparison between cloud-native iPaaS and Kafka.

Expedia needed a service that was closer to travel assistance. It needed to handle context-specific, multi-party, multi-channel conversations. Hence, features such as natural language processing, translation, and real-time analytics are required. The full service needs to be scalable across multiple brands. Therefore, a fast and highly scalable platform with order guarantees, exactly-once-semantics (EOS), and real-time data processing were needed.

The Kafka-native event streaming platform powered by Confluent was the best choice and met all requirements. One year after the rollout, the new conversations platform doubled the Net Promoter Score (NPS). The new platform proved the business value of the new platform quickly.

Tinder – Content Moderation with Kafka Streams and ML

The dating app Tinder is a great example where I can think of tens of use cases for NLP. Tinder talked at a past Kafka Summit about their Kafka-powered machine learning platform.

Tinder is a massive user of Kafka and its ecosystem for various use cases, including content moderation, matching, recommendations, reminders, and user reactivation. They used Kafka Streams as a Kafka-native stream processing engine for metadata processing and correlation in real-time at scale:

A critical use case in any dating or social platform is content moderation for detecting fakes, filtering sexual content, and other inappropriate things. Content moderation combines NLP and text processing (e.g., for chat messages) with image processing (e.g., selfie uploads) or processes the metadata with Kafka and stores the linked content in a data lake. Both leverage Deep Learning to process high volumes of text and images. Here is what content moderation looks like in Tinder’s Kafka architecture:

Plenty of ways exist to process text, images, and videos with the Kafka ecosystem. I wrote a detailed article about handling large messages and files with Apache Kafka to explore the options and trade-offs.

Chatbots could also play a key role “the other way round”. More and more dating apps (and other social networks) fight against spam, fraud, and automated chatbots. Like building a chatbot, a chatbot detection system can analyze the data streams to block a dating app’s chatbot.

Let’s now explore how a developer can build a Kafka-native NLP application.

Telegram Bot API – ML in a Streaming App

Many project teams build their chatbot or other NLP services. Unfortunately, this is a considerable effort and often not cost-efficient. Another simplified and more cost-efficient alternative is integrating an NLP or chatbot API as a service. My colleague Robin Moffat wrote a great post about building a Telegram powered chatbot with Kafka and ksqlDB where the chatbot API is integrated into the real-time Kafka application. This way, like in the Expedia example above, the NLP application integrates in real-time in a truly decoupled fashion with other applications in the enterprise architecture.

This example uses Telegram. That is a messaging platform, similar in concept to WhatsApp, Facebook Messenger, etc. A nice Bot API offers integration via a REST API. While the example is using Telegram, the same approach would work just fine with a bot on your platform of choice (Slack, etc.) or within your standalone application that wants to look up the state that’s being populated and maintained from a stream of events in Kafka.

Reddit Text Processing – NLP with Streaming Model Predictions

The drawback of the above Telegram example is integrating a REST API for the chatbot. Remote procedure calls (RPC) are still predominant in machine learning. However, RPC is an anti-pattern in the event streaming world as it creates challenges concerning robustness, latency, scalability, and error handling. RPC integration is fine for many use cases. But because of other available options, there is no need to do RPC calls.

Kafka-native streaming machine learning is an alternative for model predictions. The two deployment options are embedding an analytic model into the Kafka application or using a model server that supports event streaming besides RPC (HTTP/gRPC) calls. I wrote a detailed article with the pros and cons of both approaches for model predictions using Kafka applications.

The following shows an example of a Kafka-native integration between a model server and other applications using the Kafka protocol:

The Seldon model server is an example that already supports the Kafka interface. The Seldon team demoed how they train and deploy a machine learning model leveraging a scalable stream processing architecture for an automated text prediction use-case. They use Scikit-learn (Sklearn) and SpaCy to train an ML model from the Reddit Content Moderation dataset and deploy that model using Seldon Core for real-time processing of text data from Kafka real-time streams:

Kafka-native NLP to build the next Conversational AI and Chatbot

Apache Kafka became the de facto standard for event streaming. One pillar of Kafka use cases includes ML platforms including various NLP-related concepts such as conversational AI, chatbots, and speech translation for improving and automating service desks, content moderation, and plenty of other use cases.

A Kafka-based orchestration and integration layer provides true decoupling in a scalable real-time platform. The benefits for ML platforms include back-pressure handling, pre-processing, and aggregations at any scale in real-time. Another benefit is the capabilities to connect different communication paradigms and technologies. The examples from BMW, Expedia, and Tinder showed how Kafka-based NLP infrastructure could look.

How do you build conversational AI, chatbots, and other NLP applications? What technologies and architectures do you use? Are event streaming and Kafka part of the architecture? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka for Conversational AI, NLP and Chatbot appeared first on Kai Waehner.

Apache Kafka for Supply Chain Management (SCM) Optimization

Kai Waehner — Tue, 15 Dec 2020 09:43:46 +0000

Supply Chain Management (SCM) involves planning and coordinating all the people, processes, and technology involved in creating value for a company. This includes cross-cutting processes, including purchasing/procurement, logistics, operations/manufacturing, and others. Automation, robustness, flexibility, real-time, and hybrid deployment (edge + cloud) are key for future success, no matter what industry. This blog explores how Apache Kafka helps optimize a supply chain providing decoupled microservices, data integration, real-time analytics, and more…

The following topics are covered:

Definition of supply chain management
Challenges of current supply chains
Event streaming to optimize the supply chain
Use cases and real-world enterprise examples for Apache Kafka deployments

Supply Chain Management (SCM)

Supply chain management (SCM) covers the management of the flow of goods and services. It involves the movement and storage of raw materials, work-in-process inventory, and finished goods, and an end to end order fulfillment from the point of origin to the point of consumption. Interconnected, interrelated, or interlinked networks, channels, and node businesses combine to provide products and services required by end customers in a supply chain.

SCM is often defined as the design, planning, execution, control, and monitoring of supply chain activities to create net value, build a competitive infrastructure, leverage worldwide logistics, synchronize supply with demand, and measure performance globally.

Supply chain management is the broad range of activities required to plan, control, and execute a product’s flow from materials to production to distribution in the most economical way possible. SCM encompasses the integrated planning and execution of processes required to optimize the flow of materials, information, and capital in functions that broadly include demand planning, sourcing, production, inventory management, and logistics.

SCOR (Supply-Chain Operations Reference Model)

There are a variety of supply-chain models, which address both the upstream and downstream elements of SCM. The SCOR (Supply-Chain Operations Reference) model was developed by a consortium of industry and the non-profit Supply Chain Council (now part of APICS). SCOR became the cross-industry de facto standard defining the scope of supply-chain management.

SCOR measures total supply-chain performance. It is a process reference model for supply-chain management, spanning from the supplier’s supplier to the customer’s customer. It includes delivery and order fulfillment performance, production flexibility, warranty, returns processing costs, inventory and asset turns, and other factors in evaluating a supply chain’s overall effective performance.

Here is an example of the SCOR framework levels:

Challenges within the Evolving Supply Chain Processes in a Digital Era

The above definition of SCM and the related SCOR model shows how complex supply chain processes and solutions are. Here are the top 5 key challenges of supply chains:

Time Frames are Shorter
Rapid Change
Zoo of Technologies and Products
Historical Models are No Longer Viable
Lack of visibility

Let’s explore the challenges in more detail…

Challenge 1: Time Frames are Shorter

Challenge 2: Rapid Change

Challenge 3: Historical Models are No Longer Viable

Challenge 4: Lack of visibility

Lack of plan visibility leads to inventory and resource utilization imbalances. Imbalances mean waste (overproduction) and uncaptured revenue (underproduction). Here are some stats:

69% of companies lack complete visibility of their supply chains (BCI Supply Chain Resilience Report)
78% of companies lack a proactive response supply chain network (Logistics Bureau)
85% of companies report inadequate visibility into production operations (GEODIS Supply Chain Worldwide Survey)

Challenge 5: Zoo of Technologies and Products

A zoo of supply chain technologies and products needs to be integrated and modernized. Here are a few examples:

Check out my blog post about “integration alternatives and connectors for Apache Kafka and SAP standard software” to explore how complex such an integration environment typically looks like.

Are more detailed explanation of these supply chain challenges (and the related solutions) is discussed in the Confluent webinar recording done by me with experts from Expero: Supply Chain Optimization with Event Streaming and the Apache Kafka Ecosystem.

Consequences of the Supply Chain Challenges

The consequences of all these challenges are horrible for an enterprise supply chain:

Missed orders
Lost revenue
Expediting fees
Contract penalties
Frustrated customers
…

So let’s talk about how event streaming with Apache Kafka can help to fix these problems.

Why Apache Kafka for Supply Chain Optimization?

Solving the requirements described above usually requires various of the characteristics and features Kafka and its ecosystem provide with one single technology and infrastructure:

Real-time messaging (at scale, mission-critical)
Global Kafka (edge, data center, multi-cloud)
Cloud-native (open, flexible, elastic)
Data integration (legacy + modern protocols, applications, communication paradigms)
Data correlation (real-time + historical data, omni-channel)
Real decoupling (not just messaging, but also infinite storage + replayability of events)
Real-time monitoring
Transactional data and analytics data (MES, ERP, CRM, SCM, …)
Applied machine learning (model training and scoring)
Cybersecurity
Complementary to legacy and cutting-edge technology (Mainframe, PLCs, 3D printing, augmented reality, …)

Use Cases for Apache Kafka in the Supply Chain

The supply chain is obviously a huge topic. Plenty of different use cases leverage Apache Kafka. Here are just a few examples to give a feeling about the width of possibilities:

Examples for Supply Chain Optimization with Apache Kafka Across Industries

This section explores very different use cases at enterprises across industries from carmakers (Audi, BMW, Porsche), retailers (Walmart), and food manufacturing (Baader). All content comes directly from the public talks and blog posts which their employees created and published. Exciting to see how many different problems event streaming can solve!

Manufacturing of Food Machinery @ Baader

BAADER is a worldwide manufacturer of innovative machinery for the food processing industry. They run an IoT-based and data-driven food value chain on Confluent Cloud.

The Kafka-based infrastructure provides a single source of truth across the factories and regions across the food value chain. Business-critical operations are available 24/7 for tracking, calculations, alerts, etc.

Integrated Sales, Manufacturing, Connected Vehicles and Charging Stations @ Porsche

Kafka provides real decoupling between applications. Hence, Kafka became the defacto standard for microservices and Domain-driven Design (DDD). It allows to build independent and loosely coupled, but scalable, highly available, and reliable applications.

That’s exactly what Porsche describes for their usage of Apache Kafka through its supply chain:

“The recent rise of data streaming has opened new possibilities for real-time analytics. At Porsche, data streaming technologies are increasingly applied across a range of contexts, including warranty and sales, manufacturing and supply chain, connected vehicles, and charging stations” writes Sridhar Mamella (Platform Manager Data Streaming at Porsche).

The following picture shows the event hub which Heiko Scholtes from PorscheDev explored in one of their blog posts:

This architecture was published in the mid of 2017 already. Hence, Porsche already uses Kafka for a long time in their projects. That’s a pretty common pattern for Kafka: Build one pipeline. Then let more and more consumers use the data. Some real-time or near real-time, some others via batch processes or request-response interfaces.

The Confluent Podcast also features the story around Porsche’s event streaming platform Streamzilla, built on top of Kafka. Check out this podcast from December 2020 to hear directly from Porsche.

Real-Time Inventory System @ Walmart

A real-time inventory is a key piece of a modern supply chain. Many companies even require it to stay competitive and to provide a good customer experience. Business models such as “order online, pick up in the store” are impossible without real-time inventory and supply chain.

Walmart is a great example. They leverage Apache Kafka as the heart of their supply chain:

Let’s quote Suman Pattnaik, Big Data Architect @ Walmart:

“Retail shopping experiences have evolved to include multiple channels, both online and offline, and have added to a unique set of challenges in this digital era. Having an up to date snapshot of inventory position on every item is an essential aspect to deal with these challenges. We at Walmart have solved this at scale by designing an event-streaming-based, real-time inventory system leveraging Apache Kafka… Like any supply chain network, our infrastructure involved a plethora of event sources with all different types of data”.

The real-time infrastructure around Apache Kafka includes the whole supply chain, including distribution centers, stores, vendors, and customers:

Please find out more details about Walmart’s Kafka usage in their fantastic Kafka Summit talk.

Supply Chain Purchasing using Deep Learning @ BMW

BMW leverages real-time Natural Language Processing (NLP) in various use cases. For example, the implementation of digital contract intelligence enables the automation of the processing and analysis of legal documents.

BMW built an industry-ready NLP service framework based on Kafka for smart information extraction and search, automated risk assessment, plausibility checks and negotiation support:

Check out the details in BMW’s Kafka Summit talk about their use cases for Kafka and Deep Learning / NLP.

Connected Cars for Aftersales and Customer 360 @ Audi

Audi has built a connected car infrastructure with Apache Kafka. Their Kafka Summit keynote explored the use cases and architecture:

Use cases include:

Real-Time Data Analysis
Swarm Intelligence
Collaboration with Partners
Predictive AI
…

Depending on how you define the term and buzzword “Digital Twin”, this is a perfect example: All sensor data from the connected cars are processed in real-time and stored for historical analysis and reporting.

Track&Trace for Construction Management @ Bosch

The global supplier Bosch has another impressive use case for a “Digital Twin” leveraging Apache Kafka and Confluent Cloud: Construction site management analyzing sensors, machines, and workers. Use cases include collaborative planning, inventory and asset management, and track, manage, and locate tools and equipment anytime and anywhere:

If you are interested in more details about building a digital twin with the Apache Kafka ecosystem, check out more material here: “Apache Kafka for Building a Digital Twin IoT Infrastructure“.

Kafka and Blockchain for Supply Chain Management

If there is one use case where blockchain really makes sense, then it is supply chain management. Blockchain provides features to support cross-company interaction securely. However, blockchain is also very complex and immature. I have not seen many projects where the added value is bigger than the added cost and risk. Often, Kafka is “good enough”. But let’s be clear: Kafka is NOT a Blockchain:

Having said this, please be aware:

Many blockchain products are not really a blockchain, but just a distributed ledger.
Many projects don’t require all the features of a blockchain.
Tamper-proof storage on disk and end-to-end payload encryption (often applied on field/attribute level) are not part of Kafka but can be added with some nice add-ons).
Cross-company integration with non-trusted parties is the only real reason when a blockchain is needed and adds value.

Hence, make sure to define all requirements and then evaluate if you need Kafka, a blockchain, or a combination of both:

If you want to learn more about the relation between Apache Kafka and blockchain projects, check out this material: “Apache Kafka as Part of a Blockchain Project and its Relation to Frameworks like Hyperledger and Ethereum“.

Slides and Video Recording

Here are the slides and video recording exploring the optimization of supply chains with the Apache Kafka ecosystem in more detail:

Slide Deck

Video Recording

What are your experiences with Supply Chain Management architectures, applications, and optimization? Did you already implement a more automated, scalable, real-time supply chain? Which approach works best for you? What is your strategy? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka for Supply Chain Management (SCM) Optimization appeared first on Kai Waehner.

Streaming Machine Learning with Kafka-native Model Deployment

Kai Waehner — Tue, 27 Oct 2020 15:07:50 +0000

Apache Kafka became the de facto standard for event streaming across the globe and industries. Machine Learning (ML) includes model training on historical data and model deployment for scoring and predictions. While training is mostly batch, scoring usually requires real-time capabilities at scale and reliability. Apache Kafka plays a key role in modern machine learning infrastructures. The next-generation architecture leverages a Kafka-native streaming model server instead of RPC (HTTP/gRPC) calls:

This blog post explores the architectures and trade-offs between three options for model deployment with Kafka: Embedded model into the Kafka application, model server and RPC, model server, and Kafka-native communication.

Kafka and Machine Learning Architecture

Model deployment is usually completely separated from model training (from the process and the technology perspective). The model training is often executed in elastic cloud infrastructure or data lakes such as Hadoop or Spark. Model scoring (i.e., doing the predictions) is usually a mission-critical workload with different uptime SLAs and latency requirements:

Learn more about this architecture and the relation to modern ML approaches such as Hybrid ML architectures or AutoML in the blog post “Using Apache Kafka to Drive Cutting-Edge Machine Learning“.

Two alternatives for model deployment in Kafka infrastructures: The model can either be embedded into the Kafka application, or it can be deployed into a separate model server. The blog post “Model Server and Model Deployment Options in a Kafka Infrastructure” covers the use cases and architectures in detail and explores some code examples.

The following explores both options’ trade-offs and introduces a third option: A Kafka-native streaming model server.

RPC between Kafka App and Model Server

The analytic model is deployed into a model server. The streaming Kafka application does a request-response call to send the input data to the model and to receive the prediction in return:

Almost every ML product or framework provides a model server. This includes open-source frameworks such as TensorFlow and the related model server TF Serving, but also proprietory tools such as SAS, AutoML vendors such as DataRobot, and cloud ML services from all major cloud providers such as AWS, Azure, GCP.

Pros and Cons of a Model Server with RPC

Trade-offs using Kafka in conjunction with an RPC-based model server and HTTP/gRPC:

Simple integration with existing technologies and organizational processes
Easiest to understand if you come from a non-streaming world
Tight coupling of the availability, scalability, and latency/throughput between application and model server
Separation of concerns (e.g. Python model + Java streaming app)
Limited scalability and robustness
Later migration to real streaming is also possible
Model management built-in for different models, versioning, and A/B testing
Model monitoring built-in

Example: TensorFlow Serving with GRPC and Kafka Streams

An example of this approach is available on Github: “TensorFlow Serving + gRPC + Java + Kafka Streams“. In this case, the TensorFlow model is exported and deployed into a TensorFlow Serving infrastructure.

Embedded Model into Kafka Application

An analytic model is just a binary, i.e., a file stored in memory or persistent storage.

The data type differs and depends on the ML framework. But it does not matter if it is Java (e.g., with H2O), Protobuf (e.g., with TensorFlow), or proprietary (with many commercial ML tools).

Any model can be embedded into a Kafka application (if the ML solution provides programmatic APIs – but almost every tool does):

The Kafka application for embedding the model can either be a Kafka-native stream processing engine such as Kafka Streams or ksqlDB, or a “regular” Kafka application using any Kafka client such as Java, Scala, Python, Go, C, C++, etc.

Pros and Cons of Embedding an Analytic Model into a Kafka Application

Trade-offs of embedding analytic models into a Kafka application:

Best latency as local inference instead of remote call
No coupling of the availability, scalability, and latency/throughput of your Kafka Streams application
Offline inference (devices, edge processing, etc.)
No side-effects (e.g., in case of failure), all covered by Kafka processing (e.g., exactly once)
No built-in model management and monitoring

Example: Kafka Python Application with Embedded TensorFlow Model

A robust and scalable example of the embedded model approach is presented in Github project “Streaming Machine Learning with Kafka, MQTT, and TensorFlow for 100000 Connected Cars“. This demo uses Python for both model training and model deployment in separate, scalable containers in a Kubernetes infrastructure.

Several other (more simple) demos to try out this approach are available here: “Machine Learning + Kafka Streams Examples“. The examples use TensorFlow, H2O.ai, and DeepLearning4J (DL4J) in conjunction with Kafka Streams and ksqlDB.

Kafka-native Streaming Model Server

A Kafka-native streaming model server combines some characteristics from both approaches discussed above. It enables the separation of concerns by providing a model server with all the expected features. But the model server does not use RPC communication via HTTP/gRPC and all the drawbacks this creates for a streaming architecture. Instead, the model server communicates via the native Kafka protocol and Kafka topics with the client application:

Pros and Cons of a Kafka-native Streaming Model Server

Trade-offs of a Kafka-native streaming model server:

Good latency via Kafka-native streaming communication
Some coupling of the availability, scalability, and latency/throughput of your Kafka Streams application
Some side-effects (e.g., in case of failure), but most potential issues covered by Kafka processing (e.g., decoupling and persistence via Kafka topics)
Model management built-in for different models, versioning, and A/B testing
Model monitoring built-in
Separation of concerns (e.g. Python model + Java streaming app)
Scalability and robustness of the model server not necessarily Kafka-like (because the underlying implementation is often not Kafka-native yet)

A Kafka-native streaming model server provides many advantages of a streaming architecture and the features of a model server. Just be aware that a Kafka-native interface does NOT mean that the model server itself is implemented with Kafka under the hood. Hence, test your scalability, robustness, and latency requirements to decide if an embedded model might be a better approach.

Example: Streaming Model Deployment with Kafka and Seldon

Seldon is a robust and scalable open-source model server. It allows us to manage, serve, and scale models in any language or framework on Kubernetes.

In mid of 2020, Seldon added a key differentiator compared to many other model servers on the market: Seldon added support for Kafka. Hence, Seldon combines the advantages of a separate model server with the streaming paradigm of Kafka:

The Jupyter notebook demonstrates this example using scikit-learn, the NLP framework spaCy, Seldon, and Kafka-native integration with producer and consumer applications. Check out the blog post “Real-Time Machine Learning at Scale using SpaCy, Kafka & Seldon Core” for more details.

All Model Deployment Options have Trade-Offs in Streaming Machine Learning Architectures

This blog post covered three alternatives for model deployment in a Kafka infrastructure: A model server with RPC, embedding models into the Kafka application, and a Kafka-native model server. All three have their trade-offs. Know them, and evaluate the right choice for your project. The good news is that it is also pretty straightforward to change from one approach to another one.

UPDATE May 2021: Dataiku also provides a native Kafka interface in the meantime (including support for Schema Registry). Great to see different model servers adopting this architecture.

If you want to learn more details about Kafka-native model deployment, check out the following video recording and slide deck from Kafka Summit:

The talk does not cover the “streaming model server” approach (because no model server provided a Kafka-native interface in 2019). But you can still learn a lot about the different architectures and best practices.

If you want to learn more about “Streaming Machine Learning with Kafka – without another Data Lake“, check out the following video recording and slide deck. It explores a simplified architecture and the advantages of Tiered Storage for Kafka:

What are your experiences with Kafka and model deployment? What are your use cases? Which approach works best for you? What is your strategy? Let’s connect on LinkedIn and discuss it!

The post Streaming Machine Learning with Kafka-native Model Deployment appeared first on Kai Waehner.