Shift Left Architecture Archives - Kai Waehner https://www.kai-waehner.de/blog/category/shift-left-architecture/ Technology Evangelist - Big Data Analytics - Middleware - Apache Kafka Sun, 18 May 2025 15:48:57 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.2 https://www.kai-waehner.de/wp-content/uploads/2020/01/cropped-favicon-32x32.png Shift Left Architecture Archives - Kai Waehner https://www.kai-waehner.de/blog/category/shift-left-architecture/ 32 32 Shift Left Architecture for AI and Analytics with Confluent and Databricks https://www.kai-waehner.de/blog/2025/05/09/shift-left-architecture-for-ai-and-analytics-with-confluent-and-databricks/ Fri, 09 May 2025 06:03:07 +0000 https://www.kai-waehner.de/?p=7774 Confluent and Databricks enable a modern data architecture that unifies real-time streaming and lakehouse analytics. By combining shift-left principles with the structured layers of the Medallion Architecture, teams can improve data quality, reduce pipeline complexity, and accelerate insights for both operational and analytical workloads. Technologies like Apache Kafka, Flink, and Delta Lake form the backbone of scalable, AI-ready pipelines across cloud and hybrid environments.

The post Shift Left Architecture for AI and Analytics with Confluent and Databricks appeared first on Kai Waehner.

]]>
Modern enterprise architectures are evolving. Traditional batch data pipelines and centralized processing models are being replaced by more flexible, real-time systems. One of the driving concepts behind this change is the Shift Left approach. This blog compares Databricks’ Medallion Architecture with a Shift Left Architecture popularized by Confluent. It explains where each concept fits best—and how they can work together to create a more complete, flexible, and scalable architecture.

Shift Left Architecture with Confluent Data Streaming and Databricks Lakehouse Medallion

About the Confluent and Databricks Blog Series

This article is part of a blog series exploring the growing roles of Confluent and Databricks in modern data and AI architectures:

Learn how these platforms will affect data use in businesses in future articles. Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including more details about the shift left architecture with data streaming and lakehouses.

Medallion Architecture: Structured, Proven, but Not Always Optimal

The Medallion Architecture, popularized by Databricks, is a well-known design pattern for organizing and processing data within a lakehouse. It provides structure, modularity, and clarity across the data lifecycle by breaking pipelines into three logical layers:

  • Bronze: Ingest raw data in its original format (often semi-structured or unstructured)
  • Silver: Clean, normalize, and enrich the data for usability
  • Gold: Aggregate and transform the data for reporting, dashboards, and machine learning
Databricks Medallion Architecture for Lakehouse ETL
Source: Databricks

This layered approach is valuable for teams looking to establish governed and scalable data pipelines. It supports incremental refinement of data and enables multiple consumers to work from well-defined stages.

Challenges of the Medallion Architecture

The Medallion Architecture also introduces challenges:

  • Pipeline delays: Moving data from Bronze to Gold can take minutes or longer—too slow for operational needs
  • Infrastructure overhead: Each stage typically requires its own compute and storage footprint
  • Redundant processing: Data transformations are often repeated across layers
  • Limited operational use: Data is primarily at rest in object storage; using it for real-time operational systems often requires inefficient reverse ETL pipelines.

For use cases that demand real-time responsiveness and/or critical SLAs—such as fraud detection, personalized recommendations, or IoT alerting—this traditional batch-first model may fall short. In such cases, an event-driven streaming-first architecture, powered by a data streaming platform like Confluent, enables faster, more cost-efficient pipelines by performing validation, enrichment, and even model inference before data reaches the lakehouse.

Importantly, this data streaming approach doesn’t replace the Medallion pattern—it complements it. It allows you to “shift left” critical logic, reducing duplication and latency while still feeding trusted, structured data into Delta Lake or other downstream systems for broader analytics and governance.

In other words, shifting data processing left (i.e., before it hits a data lake or Lakehouse) is especially valuable when the data needs to serve multiple downstream systems—operational and analytical alike—because it avoids duplication, reduces latency, and ensures consistent, high-quality data is available wherever it’s needed.

Shift Left Architecture: Process Earlier, Share Faster

In a Shift Left Architecture, data processing happens earlier—closer to the source, both physically and logically. This often means:

  • Transforming and validating data as it streams in
  • Enriching and filtering in real time
  • Sharing clean, usable data quickly across teams AND different technologies/applications

Shift Left Architecture with Data Streaming into Data Lake Warehouse Lakehouse

This is especially useful for:

  • Reducing time to insight
  • Improving data quality at the source
  • Creating reusable, consistent data products
  • Operational workloads with critical SLAs

How Confluent Enables Shift Left with Databricks

In a Shift Left setup, Apache Kafka provides scalable, low-latency, and truly decoupled ingestion of data across operational and analytical systems, forming the backbone for unified data pipelines.

Schema Registry and data governance policies enforce consistent, validated data across all streams, ensuring high-quality, secure, and compliant data delivery from the very beginning.

Apache Flink enables early data processing — closer to where data is produced. This reduces complexity downstream, improves data quality, and allows real-time decisions and analytics.

Shift Left Architecture with Confluent Databricks and Delta Lake

Data Quality Governance via Data Contracts and Schema Validation

Flink can enforce data contracts by validating incoming records against predefined schemas (e.g., using JSON Schema, Apache Avro or Protobuf with Schema Registry). This ensures structurally valid data continues through the pipeline. In cases where schema violations occur, records can be automatically routed to a Dead Letter Queue (DLQ) for inspection.

Confluent Schema Registry for good Data Quality, Policy Enforcement and Governance using Apache Kafka

Additionally, data contracts can enforce policy-based rules at the schema level—such as field-level encryption, masking of sensitive data (PII), type coercion, or enrichment defaults. These controls help maintain compliance and reduce risk before data reaches regulated or shared environments.

Flink can perform the following tasks before data ever lands in a data lake or warehouse:

Filtering and Routing

Events can be filtered based on business rules and routed to the appropriate downstream system or Kafka topic. This allows different consumers to subscribe only to relevant data, optimizing both performance and cost.

Metric Calculation

Use Flink to compute rolling aggregates (e.g., counts, sums, averages, percentiles) over windows of data in motion. This is useful for business metrics, anomaly detection, or feeding real-time dashboards—without waiting for batch jobs.

Real-Time Joins and Enrichment

Flink supports both stream-stream and stream-table joins. This enables real-time enrichment of incoming events with contextual information from reference data (e.g., user profiles, product catalogs, pricing tables), often sourced from Kafka topics, databases, or external APIs.

Streaming ETL with Apache Flink SQL

By shifting this logic to the beginning of the pipeline, teams can reduce duplication, avoid unnecessary storage and compute costs in downstream systems, and ensure that data products are clean, policy-compliant, and ready for both operational and analytical use—as soon as they are created.

Example: A financial application might use Flink to calculate running balances, detect anomalies, and enrich records with reference data before pushing to Databricks for reporting and training analytic models.

In addition to enhancing data quality and reducing time-to-insight in the lakehouse, this approach also makes data products immediately usable for operational workloads and downstream applications—without building separate pipelines.

Learn more about stateless and stateful stream processing in real-time architectures using Apache Flink in this in-depth blog post.

Combining Shift Left with Medallion Architecture

These architectures are not mutually exclusive. Shift Left is about processing data earlier. Medallion is about organizing data once it arrives.

You can use Shift Left principles to:

  • Pre-process operational data before it enters the Bronze layer
  • Ensure clean, validated data enters Silver with minimal transformation needed
  • Reduce the need for redundant processing steps between layers

Confluent’s Tableflow bridges the two worlds. It converts Kafka streams into Delta tables, integrating cleanly with the Medallion model while supporting real-time flows.

Shift Left with Delta Lake, Iceberg, and Tableflow

Confluent Tableflow makes it easy to publish Kafka streams into Delta Lake or Apache Iceberg formats. These can be discovered and queried inside Databricks via Unity Catalog.

This integration:

  • Simplifies integration, governance and discovery
  • Enables live updates to AI features and dashboards
  • Removes the need to manage Spark streaming jobs

This is a natural bridge between a data streaming platform and the lakehouse.

Confluent Tableflow to Unify Operational and Analytical Workloads with Apache Iceberg and Delta Lake
Source: Confluent

AI Use Cases for Shift Left with Confluent and Databricks

The Shift Left model benefits both predictive and generative AI:

  • Model training: Real-time data pipelines can stream features to Delta Lake
  • Model inference: In some cases, predictions can happen in Confluent (via Flink) and be pushed back to operational systems instantly
  • Agentic AI: Real-time event-driven architectures are well suited for next-gen, stateful agents

Databricks supports model training and hosting via MosaicML. Confluent can integrate with these models, or run lightweight inference directly from the stream processing application.

Data Warehouse Use Cases for Shift Left with Confluent and Databricks

  • Batch reporting: Continue using Databricks for traditional BI
  • Real-time analytics: Flink or real-time OLAP engines (e.g., Apache Pinot, Apache Druid) may be a better fit for sub-second insights
  • Hybrid: Push raw events into Databricks for historical analysis and use Flink for immediate feedback

Where you do the data processing depends on the use case.

Architecture Benefits Beyond Technology

Shift Left also brings architectural benefits:

  • Cost Reduction: Processing early can lower storage and compute usage
  • Faster Time to Market: Data becomes usable earlier in the pipeline
  • Reusability: Processed streams can be reused and consumed by multiple technologies/applications (not just Databricks teams)
  • Compliance and Governance: Validated data with lineage can be shared with confidence

These are important for strategic enterprise data architectures.

Bringing in New Types of Data

Shift Left with a data streaming platform supports a wider range of data sources:

  • Operational databases (like Oracle, DB2, SQL Server, Postgres, MongoDB)
  • ERP systems (SAP et al)
  • Mainframes and other legacy technologies
  • IoT interfaces (MQTT, OPC-UA, proprietary IIoT gateway, etc.)
  • SaaS platforms (Salesforce, ServiceNow, and so on)
  • Any other system that does not directly fit into the “table-driven analytics perspective” of a Lakehouse

With Confluent, these interfaces can be connected in real time, enriched at the edge or in transit, and delivered to analytics platforms like Databricks.

This expands the scope of what’s possible with AI and analytics.

Shift Left Using ONLY Databricks

A shift left architecture only with Databricks is possible, too. A Databricks consultant took my Shift Left slide and adjusted it that way:

Shift Left Architecture with Databricks and Delta Lake

 

Relying solely on Databricks for a “Shift Left Architecture” can work if all workloads (should) stay within the platform — but it’s a poor fit for many real-world scenarios.

Databricks focuses on ELT, not true ETL, and lacks native support for operational workloads like APIs, low-latency apps, or transactional systems. This forces teams to rely on reverse ETL tools – a clear anti-pattern in the enterprise architecture – just to get data where it’s actually needed. The result: added complexity, latency, and tight coupling.

The Shift Left Architecture is valuable, but in most cases it requires a modular approach, where streaming, operational, and analytical components work together — not a monolithic platform.

That said, shift left principles still apply within Databricks. Processing data as early as possible improves data quality, reduces overall compute cost, and minimizes downstream data engineering effort. For teams that operate fully inside the Databricks ecosystem, shifting left remains a powerful strategy to simplify pipelines and accelerate insight.

Meesho: Scaling a Real-Time Commerce Platform with Confluent and Databricks

Many high-growth digital platforms adopt a shift-left approach out of necessity—not as a buzzword, but to reduce latency, improve data quality, and scale efficiently by processing data closer to the source.

Meesho, one of India’s largest online marketplaces, relies on Confluent and Databricks to power its hyper-growth business model focused on real-time e-commerce. As the company scaled rapidly, supporting millions of small businesses and entrepreneurs, the need for a resilient, scalable, and low-latency data architecture became critical.

To handle massive volumes of operational events — from inventory updates to order management and customer interactions — Meesho turned to Confluent Cloud. By adopting a fully managed data streaming platform using Apache Kafka, Meesho ensures real-time event delivery, improved reliability, and faster application development. Kafka serves as the central nervous system for their event-driven architecture, connecting multiple services and enabling instant, context-driven customer experiences across mobile and web platforms.

Alongside their data streaming architecture, Meesho migrated from Amazon Redshift to Databricks to build a next-generation analytics platform. Databricks’ lakehouse architecture empowers Meesho to unify operational data from Kafka with batch data from other sources, enabling near real-time analytics at scale. This migration not only improved performance and scalability but also significantly reduced costs and operational overhead.

With Confluent managing real-time event processing and ingestion, and Databricks providing powerful, scalable analytics, Meesho is able to:

  • Deliver real-time personalized experiences to customers
  • Optimize operational workflows based on live data
  • Enable faster, data-driven decision-making across business teams

By combining real-time data streaming with advanced lakehouse analytics, Meesho has built a flexible, future-ready data infrastructure to support its mission of democratizing online commerce for millions across India.

Shift Left: Reducing Complexity, Increasing Value for the Lakehouse (and other Operational Systems)

Shift Left is not about replacing Databricks. It’s about preparing better data earlier in the pipeline—closer to the source—and reducing end-to-end complexity.

  • Use Confluent for real-time ingestion, enrichment, and transformation
  • Use Databricks for advanced analytics, reporting, and machine learning
  • Use Tableflow and Delta Lake to govern and route high-quality data to the right consumers

This architecture not only improves data quality for the lakehouse, but also enables the same real-time data products to be reused across multiple downstream systems—including operational, transactional, and AI-powered applications.

The result: increased agility, lower costs, and scalable innovation across the business.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including more details about the shift left architecture with data streaming and lakehouses.

The post Shift Left Architecture for AI and Analytics with Confluent and Databricks appeared first on Kai Waehner.

]]>
Confluent Data Streaming Platform vs. Databricks Data Intelligence Platform for Data Integration and Processing https://www.kai-waehner.de/blog/2025/05/05/confluent-data-streaming-platform-vs-databricks-data-intelligence-platform-for-data-integration-and-processing/ Mon, 05 May 2025 03:47:21 +0000 https://www.kai-waehner.de/?p=7768 This blog explores how Confluent and Databricks address data integration and processing in modern architectures. Confluent provides real-time, event-driven pipelines connecting operational systems, APIs, and batch sources with consistent, governed data flows. Databricks specializes in large-scale batch processing, data enrichment, and AI model development. Together, they offer a unified approach that bridges operational and analytical workloads. Key topics include ingestion patterns, the role of Tableflow, the shift-left architecture for earlier data validation, and real-world examples like Uniper’s energy trading platform powered by Confluent and Databricks.

The post Confluent Data Streaming Platform vs. Databricks Data Intelligence Platform for Data Integration and Processing appeared first on Kai Waehner.

]]>
Many organizations use both Confluent and Databricks. While these platforms serve different primary goals—real-time data streaming vs. analytical processing—there are areas where they overlap. This blog explores how the Confluent Data Streaming Platform (DSP) and the Databricks Data Intelligence Platform handle data integration and processing. It explains their different roles, where they intersect, and when one might be a better fit than the other.

Confluent and Databricks for Data Integration and Stream Processing

About the Confluent and Databricks Blog Series

This article is part of a blog series exploring the growing roles of Confluent and Databricks in modern data and AI architectures:

Learn how these platforms will affect data use in businesses in future articles. Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including technical architectures and the relation to analytical platforms like Databricks.

Data Integration and Processing: Shared Space, Different Strengths

Confluent is focused on continuous, event-based data movement and processing. It connects to hundreds of real-time and non-real-time data sources and targets. It enables low-latency stream processing using Apache Kafka and Flink, forming the backbone of an event-driven architecture. Databricks, on the other hand, combines data warehousing, analytics, and machine learning on a unified, scalable architecture.

Confluent: Event-Driven Integration Platform

Confluent is increasingly used as modern operational middleware, replacing traditional message queues (MQ) and enterprise service buses (ESB) in many enterprise architectures.

Thanks to its event-driven foundation, it supports not just real-time event streaming but also integration with request/response APIs and batch-based interfaces. This flexibility allows enterprises to standardize on the Kafka protocol as the data hub—bridging asynchronous event streams, synchronous APIs, and legacy systems. The immutable event store and true decoupling of producers and consumers help maintain data consistency across the entire pipeline, regardless of whether data flows in real-time, in scheduled batches or via API calls.

Batch Processing vs Event-Driven Architecture with Continuous Data Streaming

Databricks: Batch-Driven Analytics and AI Platform

Databricks excels in batch processing and traditional ELT workloads. It is optimized for storing data first and then transforming it within its platform, but it’s not built as a real-time ETL tool for directly connecting to operational systems or handling complex, upstream data mappings.

Databricks enables data transformations at scale, supporting complex joins, aggregations, and data quality checks over large historical datasets. Its Medallion Architecture (Bronze, Silver, Gold layers) provides a structured approach to incrementally refine and enrich raw data for analytics and reporting. The engine is tightly integrated with Delta Lake and Unity Catalog, ensuring governed and high-performance access to curated datasets for data science, BI, and machine learning.

For most use cases, the right choice is simple.

  • Confluent is ideal for building real-time pipelines and unifying operational systems.
  • Databricks is optimized for batch analytics, warehousing, and AI development.

Together, Confluent and Databricks cover both sides of the modern data architecture—streaming and batch, operational and analytical. And Confluent’s Tableflow and a shift-left architecture enable native integration with earlier data validation, simplified pipelines, and faster access to AI-ready data.

Data Ingestion Capabilities

Databricks recently introduced LakeFlow Connect and acquired Arcion to strengthen its capabilities around Change Data Capture (CDC) and data ingestion into Delta Lake. These are good steps toward improving integration, particularly for analytical use cases.

However, Confluent is the industry leader in operational data integration, serving as modern middleware for connecting mainframes, ERP systems, IoT devices, APIs, and edge environments. Many enterprises have already standardized on Confluent to move and process operational data in real time with high reliability and low latency.

Introducing yet another tool—especially for ETL and ingestion—creates unnecessary complexity. It risks a return to Lambda-style architectures, where separate pipelines must be built and maintained for real-time and batch use cases. This increases engineering overhead, inflates cost, and slows time to market.

Lambda Architecture - Separate ETL Pipelines for Real Time and Batch Processing

In contrast, Confluent supports a Kappa architecture model: a single, unified event-driven data streaming pipeline that powers both operational and analytical workloads. This eliminates duplication, simplifies the data flow, and enables consistent, trusted data delivery from source to sink.

Kappa Architecture - Single Data Integration Pipeline for Real Time and Batch Processing

Confluent for Data Ingestion into Databricks

Confluent’s integration capabilities provide:

  • 100+ enterprise-grade connectors, including SAP, Salesforce, and mainframe systems
  • Native CDC support for Oracle, SQL Server, PostgreSQL, MongoDB, Salesforce, and more
  • Flexible integration via Kafka Clients for any relevant programming language, REST/HTTP, MQTT, JDBC, and other APIs
  • Support for operational sinks (not just analytics platforms)
  • Built-in governance, durability, and replayability

A good example: Confluent’s Oracle CDC Connector uses Oracle’s XStream API and delivers “GoldenGate-level performance”, with guaranteed ordering, high throughput, and minimal latency. This enables real-time delivery of operational data into Kafka, Flink, and downstream systems like Databricks.

Bottom line: Confluent offers the most mature, scalable, and flexible ingestion capabilities into Databricks—especially for real-time operational data. For enterprises already using Confluent as the central nervous system of their architecture, adding another ETL layer specifically for the lakehouse integration with weaker coverage and SLAs only slows progress and increases cost.

Stick with a unified approach—fewer moving parts, faster implementation, and end-to-end consistency.

Real-Time vs. Batch: When to Use Each

Batch ETL is well understood. It works fine when data does not need to be processed immediately—e.g., for end-of-day reports, monthly audits, or historical analysis.

Streaming ETL is best when data must be processed in motion. This enables real-time dashboards, live alerts, or AI features based on the latest information.

Confluent DSP is purpose-built for streaming ETL. Kafka and Flink allow filtering, transformation, enrichment, and routing in real time.

Databricks supports batch ELT natively. Delta Live Tables offers a managed way to build data pipelines on top of Spark. Delta Live Tables lets you declaratively define how data should be transformed and processed using SQL or Python. On the other side, Spark Structured Streaming can handle streaming data in near real-time. But it still requires persistent clusters and infrastructure management. 

If you’re already invested in Spark, Structured Streaming or Delta Live Tables might be sufficient. But if you’re starting fresh—or looking to simplify your architecture — Confluent’s Tableflow provides a more streamlined, Kafka-native alternative. Tableflow represents Kafka streams as Delta Lake tables. No cluster management. No offset handling. Just discoverable, governed data in Databricks Unity Catalog.

Real-Time and Batch: A Perfect Match at Walmart for Replenishment Forecasting in the Supply Chain

Walmart demonstrates how real-time and batch processing can work together to optimize a large-scale, high-stakes supply chain.

At the heart of this architecture is Apache Kafka, powering Walmart’s real-time inventory management and replenishment system.

Kafka serves as the central data hub, continuously streaming inventory updates, sales transactions, and supply chain events across Walmart’s physical stores and digital channels. This enables real-time replenishment to ensure product availability and timely fulfillment for millions of online and in-store customers.

Batch processing plays an equally important role. Apache Spark processes historical sales, seasonality trends, and external factors in micro-batches to feed forecasting models. These forecasts are used to generate accurate daily order plans across Walmart’s vast store network.

Replenishment Supply Chain Logistics at Walmart Retail with Apache Kafka and Spark
Source: Walmart

This hybrid architecture brings significant operational and business value:

  • Kafka provides not just low latency, but true decoupling between systems, enabling seamless integration across real-time streams, batch pipelines, and request-response APIs—ensuring consistent, reliable data flow across all environments
  • Spark delivers scalable, high-performance analytics to refine predictions and improve long-term planning
  • The result: reduced cycle times, better accuracy, increased scalability and elasticity, improved resiliency, and substantial cost savings

Walmart’s supply chain is just one of many use cases where Kafka powers real-time business processes, decisioning and workflow orchestration at global scale—proof that combining streaming and batch is key to modern data infrastructure.

Apache Flink supports both streaming and batch processing within the same engine. This enables teams to build unified pipelines that handle real-time events and batch-style computations without switching tools or architectures. In Flink, batch is treated as a special case of streaming—where a bounded stream (or a complete window of events) can be processed once all data has arrived.

This approach simplifies operations by avoiding the need for parallel pipelines or separate orchestration layers. It aligns with the principles of the shift-left architecture, allowing earlier processing, validation, and enrichment—closer to the data source. As a result, pipelines are more maintainable, scalable, and responsive.

That said, batch processing is not going away—nor should it. For many use cases, batch remains the most practical solution. Examples include:

  • Daily financial reconciliations
  • End-of-day retail reporting
  • Weekly churn model training
  • Monthly compliance and audit jobs

In these cases, latency is not critical, and workloads often involve large volumes of historical data or complex joins across datasets.

This is where Databricks excels—especially with its Delta Lake and Medallion architecture, which structures raw, refined, and curated data layers for high-performance analytics, BI, and AI/ML training.

In summary, Flink offers the flexibility to consolidate streaming and batch pipelines, making it ideal for unified data processing. But when batch is the right choice—especially at scale or with complex transformations—Databricks remains a best-in-class platform. The two technologies are not mutually exclusive. They are complementary parts of a modern data stack.

Streaming CDC and Lakehouse Analytics

Streaming CDC is a key integration pattern. It captures changes from operational databases and pushes them into analytics platforms. But CDC isn’t limited to databases. CDC is just as important for business applications like Salesforce, where capturing customer updates in real time enables faster, more responsive analytics and downstream actions.

Confluent is well suited for this. Kafka Connect and Flink can continuously stream changes. These change events are sent to Databricks as Delta tables using Tableflow. Streaming CDC ensures:

  • Data consistency across operational and analytical workloads leveraging a single data pipeline
  • Reduced ETL / ELT lag
  • Near real-time updates to BI dashboards
  • Timely training of AI/ML models

Streaming CDC also avoids data duplication, reduces latency, and minimizes storage costs.

Reverse ETL: An (Anti) Pattern to Avoid with Confluent and Databricks

Some architectures push data from data lakes or warehouses back into operational systems using reverse ETL. While this may appear to bridge the analytical and operational worlds, it often leads to increased latency, duplicate logic, and fragile point-to-point workflows. These tools typically reprocess data that was already transformed once, leading to inefficiencies, governance issues, and unclear data lineage.

Reverse ETL is an architectural anti-pattern. It violates the principles of an event-driven system. Rather than reacting to events as they happen, reverse ETL introduces delays and additional moving parts—pushing stale insights back into systems that expect real-time updates.

Data at Rest and Reverse ETL

With the upcoming bidirectional integration of Tableflow with Delta Lake, these issues can be avoided entirely. Insights generated in Databricks—from analytics, machine learning, or rule-based engines—can be pushed directly back into Kafka topics.

This approach removes the need for reverse ETL tools, reduces system complexity, and ensures that both operational and analytical layers operate on a shared, governed, and timely data foundation.

It also brings lineage, schema enforcement, and observability into both directions of data flow—streamlining feedback loops and enabling true event-driven decisioning across the enterprise.

In short: Don’t pull data back into operational systems after the fact. Push insights forward at the speed of events.

Multi-Cloud and Hybrid Integration with an Event-Driven Architecture

Confluent is designed for distributed data movement across environments in real-time for operational and analytical use cases:

  • On-prem, cloud, and edge
  • Multi-region and multi-cloud
  • Support for SaaS, BYOC, and private networking

Features like Cluster Linking and Schema Registry ensure consistent replication and governance across environments.

Databricks runs only in the cloud. It supports hybrid access and partner integrations. But the platform is not built for event-driven data distribution across hybrid environments.

In a hybrid architecture, Confluent acts as the bridge. It moves operational data securely and reliably. Then, Databricks can consume it for analytics and AI use cases. Here is an example architecture for industrial IoT use cases:

Data Streaming and Lakehouse with Confluent and Databricks for Hybrid Cloud and Industrial IoT

Uniper: Real-Time Energy Trading with Confluent and Databricks

Uniper, a leading international energy company, leverages Confluent and Databricks to modernize its energy trading operations.

Uniper - The beating of energy

I covered the value of data streaming with Apache Kafka and Flink for energy trading in a dedicated blog post already.

Confluent Cloud with Apache Kafka and Apache Flink provides a scalable real-time data streaming foundation for Uniper, enabling efficient ingestion and processing of market data, IoT sensor inputs, and operational events. This setup supports the full trading lifecycle, improving decision-making, risk management, and operational agility.

Apache Kafka and Flink integrated into the Uniper IT landscape

Within its Azure environment, Uniper uses Databricks to empower business users to rapidly build trading decision-support tools and advanced analytics applications. By combining a self-service data platform with scalable processing power, Uniper significantly reduces the lead time for developing data apps—from weeks to just minutes.

To deliver real-time insights to its teams, Uniper also leverages Plotly’s Dash Enterprise, creating interactive dashboards that consolidate live data from Databricks, Kafka, Snowflake, and various databases. This end-to-end integration enables dynamic, collaborative workflows, giving analysts and traders fast, actionable insights that drive smarter, faster trading strategies.

By combining real-time data streaming, advanced analytics, and intuitive visualization, Uniper has built a resilient, flexible data architecture that meets the demands of today’s fast-moving energy markets.

From Ingestion to Insight: Modern Data Integration and Processing for AI with Confluent and Databricks

While both platforms can handle integration and processing, their roles are different:

  • Use Confluent when you need real-time ingestion and processing of operational and analytical workloads, or data delivery across systems and clouds.
  • Use Databricks for AI workloads, analytics and data warehousing.

When used together, Confluent and Databricks form a complete data integration and processing pipeline for AI and analytics:

  1. Confluent ingests and processes operational data in real time.
  2. Tableflow pushes this data into Delta Lake in a discoverable, secure format.
  3. Databricks performs analytics and model development.
  4. Tableflow (bidirectional) pushes insights or AI models back into Kafka for use in operational systems.

This is the foundation for modern data and AI architectures—real-time pipelines feeding intelligent applications.

Stay tuned for deep dives into how these platforms are shaping the future of data-driven enterprises. Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including technical architectures and the relation to analytical platforms like Databricks.

The post Confluent Data Streaming Platform vs. Databricks Data Intelligence Platform for Data Integration and Processing appeared first on Kai Waehner.

]]>
Shift Left Architecture at Siemens: Real-Time Innovation in Manufacturing and Logistics with Data Streaming https://www.kai-waehner.de/blog/2025/04/11/shift-left-architecture-at-siemens-real-time-innovation-in-manufacturing-and-logistics-with-data-streaming/ Fri, 11 Apr 2025 12:32:50 +0000 https://www.kai-waehner.de/?p=7475 Industrial enterprises face increasing pressure to move faster, automate more, and adapt to constant change—without compromising reliability. Siemens Digital Industries addresses this challenge by combining real-time data streaming, modular design, and Shift Left principles to modernize manufacturing and logistics. This blog outlines how technologies like Apache Kafka, Apache Flink, and Confluent Cloud support scalable, event-driven architectures. A real-world example from Siemens’ Modular Intralogistics Platform illustrates how this approach improves data quality, system responsiveness, and operational agility.

The post Shift Left Architecture at Siemens: Real-Time Innovation in Manufacturing and Logistics with Data Streaming appeared first on Kai Waehner.

]]>
Industrial enterprises are under pressure to modernize. They need to move faster, automate more, and adapt to constant change—without sacrificing reliability or control. Siemens Digital Industries is meeting this challenge head-on by combining software, edge computing, and cloud-native technologies into a new architecture. This blog explores how Siemens is using data streaming, modular design, and Shift Left thinking to enable real-time decision-making, improve data quality, and unlock scalable, reusable data products across manufacturing and logistics operations. A real-world example for industrial IoT, intralogistics and shop floor manufacturing illustrates the architecture and highlights the business value behind this transformation.

Shift Left Architecture at Siemens with Stream Processing using Apache Kafka and Flink

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including customer stories across all industries.

The Data Streaming Use Case Show: Episode #1 – Manufacturing and Automotive

These Siemens success stories are part of The Data Streaming Use Case Show, a new industry webinar series hosted by me.

In the first episode, we focus on the manufacturing and automotive industries. It features:

  • Experts from Siemens Digital Industries and Siemens Healthineers
  • The Founder of ‘IoT Use Case, a content and community platform focused on real-world industrial IoT applications
  • Deep insights into how industrial companies combine OT, IT, cloud, and data streaming with the shift left architecture.

The Data Streaming Industry Use Case Show by Confluent with Host Kai Waehner

The series explores real-world solutions across industries, showing how leaders turn data into action through open architectures and real-time platforms.

Siemens Digital Industries: Company and Vision

Siemens Digital Industries is the technology and software arm of Siemens AG, focused on advancing industrial automation and digitalization. It empowers manufacturers and machine builders to become more agile, efficient, and resilient through intelligent software and integrated systems.

Its business model bridges the physical and digital worlds—combining operational technology (OT) with modern information technology (IT). From programmable logic controllers to industrial IoT, Siemens delivers end-to-end solutions across industries.

Today, the company is transforming itself into a software- and cloud-driven organization, focusing strongly on edge computing, real-time analytics, and data streaming as key enablers of modern manufacturing.

With edge and cloud working in harmony, Siemens helps industrial enterprises break up monoliths and develop toward modular, flexible architectures. These software-driven approaches make plants and factories more adaptive, intelligent, and autonomous.

Data Streaming at Industrial Companies

In industrial settings, data is continuously generated by machines, production systems, robots, and logistics processes. But traditional batch-oriented IT systems are not designed to handle this in real time.

To make smarter, faster decisions, companies need to process data as it is generated. That’s where data streaming comes in.

Apache Kafka and Apache Flink enable event-driven architectures. These allow industrial data to flow in real time, from edge to cloud, across hybrid environments.

Event-driven Architecture with Data Streaming using Kafka and Flink in Industrial IoT and Manufacturing

Check out my other blogs about use cases and architecture for manufacturing and Industrial IoT powered by data streaming.

Edge and Hybrid Cloud as a Standard

Modern industrial use cases are increasingly hybrid by design. Machines and controllers produce data at the edge. Decisions must be made close to the source. However, cloud platforms offer powerful compute and AI capabilities.

Industrial IoT Data Streaming Everywhere Edge Hybrid Cloud with Apache Kafka and Flink

Siemens leverages edge devices to capture and preprocess data on-site. Data streaming with Confluent provides Siemens a real-time backbone for integrating this data with cloud-based systems, including Snowflake, SAP, Salesforce, and others.

This hybrid architecture supports low latency, high availability, and full control over data processing and analytics workflows.

The Shift Left Architecture for Industrial IoT

In many industrial architectures, Kafka has traditionally been used to ingest data into analytics platforms like Snowflake or Databricks. Processing, transformation, and enrichment happened late in the data pipeline.

ETL and ELT Data Integration to Data Lake Warehouse Lakehouse in Batch

But Siemens is shifting that model.

The Shift Left Architecture moves processing closer to the source, directly into the streaming layer. Instead of waiting to transform data in a data warehouse, Siemens now applies stream processing in real time, using Confluent Cloud and Kafka topics.

Shift Left Architecture with Data Streaming into Data Lake Warehouse Lakehouse

This shift enables faster decision-making, better data quality, and broader reuse of high-quality data across both analytical and operational systems.

For a deeper look at how Shift Left is transforming industrial architectures, read the full article about the Shift Left Architecture with Data Streaming.

Siemens Data Streaming Success Story: Modular Intralogistics Platform

A key example of this new architecture is Siemens’ Modular Intralogistics Platform, used in manufacturing plants for material handling and supply chain optimization. I explored the shift left architecture in our data streaming use case show with Stefan Baer, Senior Key Expert – Data Streaming at Siemens IT.

Traditionally, intralogistic systems were tightly coupled, with rigid integrations between

  • Enterprise Resource Planning (ERP): Order management, master data
  • Manufacturing Operations Management (MOM): Production scheduling, quality, maintenance
  • Warehouse Execution System (EWM): Inventory, picking, warehouse automation
  • Execution Management System (eMS): Transport control, automated guided vehicle (AGV) orchestration, conveyor logic

The new approach breaks this down into package business capabilities—each one modular, orchestrated, and connected through Confluent Cloud.

Key benefits:

  • Real-time orchestration of logistics operations
  • Automated material delivery—no manual reordering required
  • ERP and MOM systems integrated flexibly via Kafka
  • High adaptability through modular components
  • GenAI used for package station load optimization

Stream processing with Apache Flink transforms events in motion. For example, when a production order changes or material shortages occur, the system reacts instantly—adjusting delivery routes, triggering alerts, or rebalancing station loads using AI.

Architecture: Data Products + Shift Left

At the heart of the solution is a combination of data products and stream processing:

  • Kafka Topics serve as real-time interfaces and persistency layer between business domains.
  • Confluent Cloud hosts the event streaming infrastructure as a fully-managed service with low latency, elasticity, and critical SLAs.
  • Stream processing with serverless Flink logic enriches and transforms data in motion.
  • Snowflake receives curated, ready-to-use data for analytics.
  • Other operational and analytical downstream consumers—such as GenAI modules or shop floor dashboards—access the same consistent data in real time.
Siemens Digital Industries - Modular Intralogistics Platform 
Source: Siemens Digital Industries

This reuse of data products ensures consistent semantics, reduces duplication, and simplifies governance.

By processing data earlier in the pipeline, Siemens improves both data quality and system responsiveness. This model replaces brittle, point-to-point integrations with a more sustainable, scalable platform architecture.

Siemens Shift Left Architecture and Data Products with Data Streaming using Apache Kafka and Flink
Source: Siemens Digital Industries

Business Value of Data Streaming and Shift Left at Siemens Digital Industries

The combination of real-time data streaming, modular data products, and Shift Left design principles unlocks significant value:

  • Faster response to dynamic events in production and logistics
  • Improved operational resilience and agility
  • Higher quality data for both analytics and AI
  • Reuse across multiple consumers (analytics, operations, automation)
  • Lower integration costs and easier scaling

This approach is not just technically superior—it supports measurable business outcomes like shorter lead times, lower stock levels, and increased manufacturing throughput.

Siemens Healthineers: Shift Left with IoT, Data Streaming, AI/ML, Confluent and Snowflake in Manufacturing and Healthcare

In a recent blog post, I explored how Siemens Healthineers uses Apache Kafka and Flink to transform both manufacturing and healthcare with a wide range of data streaming use cases. From predictive maintenance to real-time logistics, their approach is a textbook example of how to modernize complex environments with an event-driven architecture and data streamingeven if they don’t explicitly label it “shift left.”

Siemens Healthineers Data Cloud Technology Stack with Apache Kafka and Snowflake
Source: Siemens Healthineers

Their architecture enables proactive decision-making by pushing real-time insights and automation earlier in the process. Examples include telemetry streaming from medical devices, machine integration with SAP and KUKA robots, and logistics event streaming from SAP for faster packaging and delivery. Each use case shows how real-time data—combined with cloud-native platforms like Confluent and Snowflake—improves efficiency, reliability, and responsiveness.

Just like the intralogistics example from Siemens Digital Industries, Healthineers applies shift-left thinking by enabling teams to act on data sooner, reduce latency, and prevent costly delays. This approach enhances not only operational workflows but also outcomes that matter, like patient care and regulatory compliance.

This is shift left in action: embedding intelligence and quality controls early, where they have the greatest impact.

Rethinking Industrial Data Architectures with Data Streaming and Shift Left Architecture

Siemens Digital Industries is demonstrating what’s possible when you rethink the data architecture beyond just analytics in a data lake.

With data streaming leveraging Confluent Cloud, data products for modular software, and a Shift Left approach, Siemens is transforming traditional factories into intelligent, event-driven operations. A data streaming platform based on Apache Kafka is no longer just an ingestion layer. It is a central nervous system for real-time processing and decision-making.

This is not about chasing trends. It’s about building resilient, scalable, and future-proof industrial systems. And it’s just the beginning.

To learn more, watch the on-demand industry use case show with Siemens Digital Industries and Siemens Healthineers or connect with us to explore what data streaming can do for your organization.

Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter. And download my free book about data streaming use cases.

The post Shift Left Architecture at Siemens: Real-Time Innovation in Manufacturing and Logistics with Data Streaming appeared first on Kai Waehner.

]]>
Online Model Training and Model Drift in Machine Learning with Apache Kafka and Flink https://www.kai-waehner.de/blog/2025/02/23/online-model-training-and-model-drift-in-machine-learning-with-apache-kafka-and-flink/ Sun, 23 Feb 2025 05:08:20 +0000 https://www.kai-waehner.de/?p=4971 The rise of real-time AI and machine learning is reshaping the competitive landscape. Traditional batch-trained models struggle with model drift, leading to inaccurate predictions and missed opportunities. Platforms like Apache Kafka and Apache Flink enable continuous model training and real-time inference, ensuring up-to-date, high-accuracy predictions.

This blog explores TikTok’s groundbreaking AI architecture, its use of data streaming for real-time recommendations, and how businesses can leverage Kafka and Flink to modernize their ML pipelines. I also examine how data streaming complements platforms like Databricks, Snowflake, and Microsoft Fabric to create scalable, adaptive AI systems.

The post Online Model Training and Model Drift in Machine Learning with Apache Kafka and Flink appeared first on Kai Waehner.

]]>
The landscape of artificial intelligence (AI) and machine learning (ML) is transforming rapidly. Online model training and model drift management become essential for businesses to maintain competitive edges. Data streaming with Apache Kafka and Apache Flink plays crucial roles in this evolution, enabling real-time updates and seamless integration into modern data infrastructures. This blog explores the challenges of model drift, investigates TikTok’s groundbreaking architecture, and highlights the business value and complementary nature of data streaming with other platforms.

Online Model Training and Model Drift in Machine Learning with Apache Kafka and Flink

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases.

Understanding Model Drift: The Achilles’ Heel of Static Models

Real-time model inference with a data streaming platform using Apache Kafka and Flink is a powerful solution for delivering fast and accurate predictions, as detailed in my model inference blog post, but it’s not enough to sustain long-term model accuracy.

Machine learning models degrade in accuracy over time due to shifts in data or concepts—a phenomenon known as model drift.

Model Drift in AI Machine Learning Over Time without Real Time Data Streaming

This can take several forms:

  1. Concept Drift: Changing relationships between input and output variables, such as shifting user behavior.
  2. Data Drift: Variations in data distribution, e.g., demographic shifts.
  3. Upstream Data Changes: Pipeline modifications, e.g., new logging formats or unavailable sources.

Unchecked, model drift leads to poor predictions and missed opportunities. Addressing it requires continuous updates, which online machine learning enables through data streaming platforms like Kafka and Flink.

TikTok’s recommendation system, detailed in ByteDance’s whitepaper, leverages a cutting-edge, real-time machine learning architecture powered by data streaming technologies like Kafka and Flink to deliver personalized content at scale, seamlessly integrating user behavior data, dynamic feature processing, and online model updates for unparalleled user engagement and platform efficiency.

What is ByteDance and TikTok?

ByteDance, TikTok’s parent company, is a Chinese technology giant renowned for its innovative use of AI and real-time ML. TikTok, its most famous product, has redefined user engagement through hyper-personalized video recommendations. TikTok employs real-time online machine learning, ensuring recommendations are dynamic, accurate, and engaging.

Why TikTok Outshines Competitors

While other social video platforms also leverage advanced machine learning for recommendations, TikTok’s architecture distinguishes itself by prioritizing real-time adaptability and hyper-personalization, ensuring it can respond to user behavior faster and more effectively than its competitors.

  • User Engagement: TikTok’s recommendation engine adapts in real-time, delivering hyper-relevant content that increases user retention.
  • Scalability: Unlike many platforms relying on periodic retraining, TikTok continuously updates its models, handling massive data streams with ease.
  • Speed: Real-time processing reduces latency in adapting to user behavior, a stark contrast to Facebook or YouTube’s delayed batch processes.

TikTok’s real-time recommendation system is built on a robust streaming data architecture:

Bytedance TikTok Real Time AI ML Recommender System powered by Apache Kafka and Flink
Source: Bytedance

Data Ingestion:

  • User interactions like views, likes, and shares are streamed in real-time via Kafka.
  • Kafka ensures reliable collection and distribution of high-velocity event data.

Feature Engineering:

  • Flink processes raw data streams, performing real-time feature extraction and enrichment.
  • Techniques like point-in-time lookups prevent training-inference skew, ensuring the same features are used in both phases.

Online Model Training:

  • Lightweight models are continuously updated with fresh data.
  • This approach mitigates model drift, ensuring predictions stay relevant and accurate.

Real-Time Inference:

  • Updated models are deployed immediately to serve predictions.
  • TikTok’s architecture ensures latency is minimal, with recommendations delivered almost instantly.

This dynamic infrastructure has made TikTok a leader in real-time AI, setting a benchmark for others.

Apache Kafka and Flink are indispensable for organizations embracing online ML.

Data Streaming Ecosystem for AI Machine Learning with Apache Kafka and Flink

Data streaming addresses key challenges:

  • Training-Inference Data Skew: By streaming real-time features into models, Flink ensures consistency in model training and inference data.
  • Multi-Model Governance: Kafka and Flink enable the data integration with small models for enrichment and large models for complex decision-making, ensuring governance and modularity.
  • Scalability and Efficiency: Data streaming pipelines handle massive volumes with low latency, enabling real-time decision-making.

Complementing Other Data Platforms: Streaming Meets Analytics

Data streaming complements platforms like Databricks, Snowflake, and Microsoft Fabric, creating a seamless ecosystem for AI/ML workflows:

  • Databricks: While Databricks excels in large-scale batch processing and AI model training, Kafka adds real-time data ingestion and pre-processing capabilities.
  • Snowflake: Zero-ETL integration with Kafka and Flink allows for real-time analytics alongside Snowflake’s strong data warehousing and AI features.
  • Microsoft Fabric: Fabric’s AI-powered analytics gain agility from Kafka’s event-driven architecture, ensuring near-instant data availability.

Shift Left Architecture with Apache Iceberg as Open Table Format for Data Sharing

The Shift Left Architecture emphasizes moving from traditional batch processing and lakehouse-centric approaches to real-time data products, empowering businesses to act on data faster and with greater agility. Learn more about this transformative approach in my Shift Left Architecture blog post.

Shift Left Architecture with Data Streaming into Data Lake Warehouse Lakehouse

Meanwhile, Apache Iceberg, an open table format for lakehouses and streaming, ensures seamless data sharing across real-time and batch workflows by providing a unified view of data. Dive deeper into its capabilities in my Apache Iceberg blog post.

The Shift Left Architecture for Modern Data Architectures

This complementary relationship enables businesses to leverage best-in-class tools without trade-offs, providing both real-time and batch capabilities. Learn more in my comparison blog series “Data Streaming with Kafka and Flink vs. Snowflake” and “Microsoft Fabric and Apache Kafka“.

The adoption of real-time ML with Kafka and Flink drives tangible business outcomes:

  1. Enhanced User Engagement: Personalized recommendations lead to improved customer retention.
  2. Faster Time to Market: Real-time data pipelines reduce the lead time for deploying ML solutions.
  3. Improved ROI: Real-time adaptability ensures models deliver consistent business value.
  4. Freedom of Choice: Kafka acts as the backbone, enabling seamless integration with diverse tools and platforms.

This translates to a flexible, scalable, and high-performing ML infrastructure capable of handling evolving business demands.

Online machine learning with Apache Kafka and Flink is the future of adaptive, real-time AI. TikTok’s success story is a testament to the power of dynamic AI/ML systems in driving engagement and staying competitive. By complementing platforms like Snowflake, Databricks, and Microsoft Fabric, data streaming enables a holistic, future-proof data strategy.

Organizations must embrace these technologies to unlock faster time to market, unparalleled user experiences, and sustained business growth.

Let’s connect on LinkedIn and discuss how to implement these ideas in your organization. Stay informed about new developments by subscribing to my newsletter. And make sure to download my free book about data streaming use cases.

The post Online Model Training and Model Drift in Machine Learning with Apache Kafka and Flink appeared first on Kai Waehner.

]]>
Tesla Energy Platform – The Power of Data Streaming with Apache Kafka https://www.kai-waehner.de/blog/2025/02/14/tesla-energy-platform-the-power-of-data-streaming-with-apache-kafka/ Fri, 14 Feb 2025 08:17:37 +0000 https://www.kai-waehner.de/?p=7340 Tesla’s Virtual Power Plant (VPP) turns thousands of home batteries, solar panels, and energy storage systems into a coordinated, intelligent energy network. By leveraging Apache Kafka for event streaming and WebSockets for real-time IoT connectivity, Tesla enables instant energy redistribution, dynamic grid balancing, and automated market participation. This event-driven architecture ensures millisecond-level decision-making, allowing homeowners to optimize energy usage and utilities to stabilize power grids. Tesla’s approach highlights how real-time data streaming and intelligent automation are reshaping the future of decentralized, resilient, and sustainable energy systems.

The post Tesla Energy Platform – The Power of Data Streaming with Apache Kafka appeared first on Kai Waehner.

]]>
Tesla’s Virtual Power Plant (VPP) is revolutionizing the energy sector by connecting home batteries, solar panels, and grid-scale storage into a real-time, intelligent energy network. Powered by Apache Kafka for event streaming and WebSockets for last-mile IoT integration, Tesla’s Energy Platform enables real-time energy trading, grid stabilization, and seamless market participation. By leveraging data streaming and automation, Tesla optimizes battery efficiency, prevents blackouts, and allows homeowners to monetize excess energy—all while making renewable energy more reliable and scalable. This software-driven approach showcases the power of real-time data in building the future of sustainable energy.

Tesla Energy Platform - The Power of Data Streaming with Apache Kafka

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases across all industries.

What is a Virtual Power Plant?

A Virtual Power Plant (VPP) is a network of decentralized energy resources—such as home batteries, solar panels, and smart grid systems—that function as a single unit. Unlike a traditional power plant that generates electricity from a centralized location, a VPP aggregates power from many small, distributed sources. This allows energy to be dynamically stored and shared, helping to balance supply and demand in real time.

VPPs are crucial in the shift to renewable energy. The traditional power grid was designed around fossil fuel plants that could easily adjust output. Renewable energy sources like solar and wind are intermittent—they don’t generate power on demand. By connecting thousands of batteries and solar panels in homes and businesses, a VPP can smooth out fluctuations in power generation and consumption. This prevents blackouts, reduces energy costs, and enables homes and businesses to participate in energy markets.

How Tesla’s Virtual Power Plant Fits Its Business Model

Tesla is not just an automaker. It is a sustainable energy company. Tesla’s product ecosystem includes electric vehicles, solar panels, home batteries (Powerwall), grid-scale energy storage (Megapack), and energy management software (Autobidder).

The Tesla Virtual Power Plant (VPP) ties all these elements together. Homeowners with Powerwalls store excess solar power during the day and feed it back to the grid when needed. Tesla’s Autobidder software automatically optimizes energy use and market participation, turning home batteries into revenue-generating assets.

For Tesla, the VPP strengthens its energy business, creating a scalable model that maximizes battery efficiency, stabilizes grids, and expands the role of software in energy markets. Tesla is not just selling batteries; it is selling energy intelligence.

Virtual Energy Platform and ESG (Environmental, Social, and Governance) Goals

Tesla’s energy platform is a perfect example of how data streaming and real-time decision-making align with ESG principles:

  • Environmental Impact: VPPs reduce reliance on fossil fuels by making renewable energy more reliable.
  • Social Benefit: By enabling energy independence, VPPs provide power during outages and extreme weather conditions.
  • Governance & Regulation: VPPs allow consumers to participate in energy markets, fostering decentralized energy ownership.

Tesla’s approach is smart grid innovation at scalereal-time data makes the grid more dynamic, efficient, and resilient.

My article “Green Data, Clean Insights: How Apache Kafka and Flink Power ESG Transformations” covers other real-world data streaming deployments in the energy sector like EON.

Tesla’s Energy Platform: A Network of Connected Home Energy Systems

Tesla’s VPP connects thousands of homes with Powerwalls, solar panels, and grid services. These systems work together to provide electricity on demand, reacting to supply fluctuations in real-time.

Key Functions of Tesla’s VPP:

  1. Energy Storage & Redistribution: Batteries store solar energy during the day and discharge at night or during peak demand.
  2. Grid Stabilization: The VPP balances energy supply and demand to prevent outages and fluctuations.
  3. Market Participation: Homeowners can sell excess power back to the grid, monetizing their batteries.
  4. Disaster Resilience: The VPP provides backup power during blackouts, storms, and grid failures.

This requires real-time data processing at massive scale—something traditional batch-based data architectures cannot handle.

Apache Kafka and Real-Time Data Streaming at Tesla

Tesla operates in many domains—automotive, energy, and AI. Across all these areas, Apache Kafka plays a critical role in enabling real-time data movement and stream processing.

In 2018, Tesla already processed trillions of IoT messages with Apache Kafka:

Tesla Automotive Journey from RabbitMQ to Apache Kafka for IoT Events
Source: Tesla

Tesla leverages stream processing to handle trillions of IoT events daily, using Apache Kafka to ingest, process, and analyze data from its vehicle fleet in real time. By implementing efficient data partitioning, fast and slow data lanes, and scalable infrastructure, Tesla optimizes vehicle performance, predicts failures, and enhances manufacturing efficiency.

These strategies demonstrate how real-time data streaming is essential for managing large-scale IoT ecosystems, ensuring low-latency insights while maintaining operational stability. To learn more about these use cases read Tesla’s blog postStream Processing with IoT Data: Challenges, Best Practices, and Techniques“.

The following sections explore Tesla’s innovation for its virtual power plant, as discussed in an excellent presentation at QCon.

Tesla Energy Platform: Architecture of the Virtual Power Plant Powered by Apache Kafka

Tesla’s VPP uses Apache Kafka for:

  1. Telemetry Ingestion: Streaming data from millions of Powerwalls, solar panels, and Megapacks into the cloud.
  2. Command & Control: Sending real-time control commands to batteries and grid services.
  3. Market Participation: Autobidder analyzes real-time data and adjusts energy prices dynamically.

The event-driven architecture allows Tesla to react to energy demand in milliseconds—critical for balancing the grid.

Tesla’s Energy Platform is the software foundation of the VPP. It integrates OT (Operational Technology), IoT (Internet of Things), and IT (Information Technology) to control distributed energy assets.

Tesla Applications Built on the Energy Platform

Tesla’s Energy Platform powers a suite of applications that optimize energy management, market participation, and grid stability through real-time data streaming and automation.

Autobidder

  • Optimizes energy trading in real time.
  • Automatically bids into energy markets.

I wrote about about other data streaming success stories for energy trading with Apache Kafka and Flink, including Uniper, re.alto and Powerledger.

Distributed Virtual Power Plant

  • Aggregates thousands of Powerwalls into a single energy asset.
  • Provides grid stabilization and peak load balancing.

If you are interested in other smart grid infrastructures, check out “Apache Kafka for Smart Grid, Utilities and Energy Production“. The articles covers how data streaming realizes IT/OT integration. And some hybrid cloud IoT deployments.

Battery Control (Command & Control)

  • Ensures optimal charging and discharging of batteries.
  • Minimizes costs while maximizing energy efficiency.

Market Participation

  • Allows homeowners and businesses to profit from energy markets.
  • Ensures seamless grid integration of Tesla’s energy products.

Key Components of Tesla’s Energy Platform: Apache Kafka, WebSockets, Akka Streams

The combination of data streaming with Apache Kafka and the last-mile IoT integration via WebSockets builds the central nervous system of Tesla’s Energy Platform:

  1. Apache Kafka (Event Streaming):
    • Streams telemetry data from Powerwalls every second.
    • Ensures durability and reliability of data streams.
    • Allows real-time energy aggregation across thousands of homes.
  2. WebSockets (Last-Mile IoT Integration):
    • Provides low-latency bidirectional communication with Powerwalls.
    • Used to send real-time commands to home batteries.
  3. Pub/Sub (Command & Control):
    • Enables distributed energy resource coordination.
    • Ensures resilient messaging between systems.
  4. Business Logic (Applications & Microservices):
    • Tesla’s services are built with Scala and Python.
    • Uses gRPC & HTTP for inter-service communication.
  5. Digital Twins (Real-Time State Management):
    • Digital models of physical assets ensure real-time decision-making.
    • Tesla uses Akka Streams for stateful event processing.
  6. Kubernetes (Cloud Infrastructure):
    • Ensures scalability and resilience of Tesla’s energy microservices.
Tesla Virtual Power Plant Energy Architecture Using Apache Kafka WebSockets and Akka Streams
Source: Tesla

Interesting side note: While most energy companies I have seen rely on Kafka Streams or Apache Flink for stateful event processing, Tesla takes an interesting approach by leveraging Akka Streams (based on Akka’s Actor Model) to manage real-time digital twins of its energy assets. This choice provides fine-grained control over streaming workflows, but unlike Kafka Streams or Flink, Akka lacks widespread community adoption, making it a less common choice for many large-scale energy platforms. Kafka and Flink are a match made in heaven for most data streaming use cases.

Best Practice: Shift Left Architecture with Data Products for High-Volume IoT Data

Tesla leverages several data processing best practices to improve efficiency and consistency:

  • Canonical Kafka Topics: Data is filtered and structured at the source.
  • Consistent Downstream Services: Every consumer gets clean, structured data.
  • Real-Time Aggregation of Thousands of Batteries: A unique challenge that forms the foundation of the virtual power plant.

This data-first approach ensures Tesla’s energy platform can scale to millions of distributed assets.

Today, many people refer to the Shift Left Architecture when applying these best practices for processing data efficiently and continuously to provide data product in real-time and good quality:

Shift Left Architecture with Data Streaming into Data Lake Warehouse Lakehouse

 

In Tesla’s Energy Platform, the data comes from IoT interfaces. WebSockets provide the last-mile integration and feed the events into the data streaming platform for continuous processing before the ingestion into the operational and analytical applications.

Tesla’s Energy Vision: How Streaming Data Will Shape Tomorrow’s Power Grids

Tesla’s Virtual Power Plant is not just about batteries—it’s about software, real-time data, and automation.

Why Data Streaming Matters for Tesla’s Energy Platform:

  1. Scalability: Can handle millions of energy devices.
  2. Resilience: Works even when devices go offline.
  3. Real-Time Decision Making: Adjusts energy distribution within milliseconds.
  4. Market Optimization: Autobidder ensures maximum revenue for battery owners.

Tesla’s VPP is a blueprint for the future of energy—one where real-time data streaming and intelligent software optimize renewable energy. By leveraging Apache Kafka, WebSockets, and stream processing, Tesla is redefining how energy is generated, distributed, and consumed.

This is not just an innovation in power generation—it’s an AI-driven energy revolution.

How do you leverage data streaming in the energy and automotive sector? Follow me on LinkedIn or X (former Twitter) to stay in touch and discuss. Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter. And make sure to download my free book about data streaming use cases across all industries.

The post Tesla Energy Platform – The Power of Data Streaming with Apache Kafka appeared first on Kai Waehner.

]]>
Apache Flink: Overkill for Simple, Stateless Stream Processing and ETL? https://www.kai-waehner.de/blog/2025/01/14/apache-flink-overkill-for-simple-stateless-stream-processing/ Tue, 14 Jan 2025 07:48:04 +0000 https://www.kai-waehner.de/?p=7210 Discover when Apache Flink is the right tool for your stream processing needs. Explore its role in stateful and stateless processing, the advantages of serverless Flink SaaS solutions like Confluent Cloud, and how it supports advanced analytics and real-time data integration together with Apache Kafka. Dive into the trade-offs, deployment options, and strategies for leveraging Flink effectively across cloud, on-premise, and edge environments, and when to use Kafka Streams or Single Message Transforms (SMT) within Kafka Connect for ETL instead of Flink.

The post Apache Flink: Overkill for Simple, Stateless Stream Processing and ETL? appeared first on Kai Waehner.

]]>
When discussing stream processing engines, Apache Flink often takes center stage for its advanced capabilities in stateful stream processing and real-time data analytics. However, a common question arises: is Flink too heavyweight for simple, stateless stream processing  and ETL tasks? The short answer for open-source Flink is often yes. But the story evolves significantly when looking at SaaS Flink products such as Confluent Cloud’s Flink offering, with its serverless architecture, multi-tenancy, consumption-based pricing, and no-code/low-code capabilities like Flink Actions. This post explores the considerations and trade-offs to help you decide when Flink is the right tool for your data streaming needs, and when Kafka Streams or Single Message Transform (SMT) within Kafka Connect are the better choice.

Apache Flink - Overkill for Simple Stateless Stream Processing

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch.

The Nature of Stateless Stream Processing

Stateless stream processing, as the name implies, processes each event independently, with no reliance on prior events or context. This simplicity lends itself to use cases such as filtering, transformations, and simple ETL operations. Stateless tasks are:

  • Efficient: They don’t require state management, reducing overhead.
  • Scalable: Easily parallelized since there is no dependency between events.
  • Minimalistic: Often achievable with simpler, lightweight frameworks like Kafka Streams or Kafka Connect’s Single Message Transforms (SMT).

For example, filtering transactions above a certain amount or transforming event formats for downstream systems are classic stateless tasks that demand minimal computational complexity.

In these scenarios, deploying a robust and feature-rich framework like open-source Apache Flink might seem excessive. Flink’s rich API and state management features are unnecessary for such straightforward use cases. Instead, tools with smaller footprints, and simpler deployment models, such as Kafka Streams, often suffice.

Apache Flink is a powerhouse. It’s designed for advanced analytics, stateful processing, and complex event patterns. But this sophistication of the open source framework comes with complexity:

  1. Operational Overhead: Setting up and maintaining Flink in an open-source environment can require significant infrastructure and expertise.
  2. Resource Intensity: Flink’s distributed architecture and stateful processing capabilities are resource-hungry, often overkill for tasks that don’t require stateful processing.
  3. Complexity in Development: The Flink API is robust but comes with a steeper learning curve. The combination with Kafka (or another streaming engine) requires understanding two frameworks. In contrast, Kafka Streams is Kafka-native, offering a single, unified framework for stream processing, which can reduce complexity for basic tasks.

For organizations that need to perform straightforward stateless operations, investing in the full Flink stack can feel like using a sledgehammer to crack a nut. Having said this, FlinkSQL simplifies development for certain personas, providing a more accessible interface beyond just Java and Python.

The conversation shifts dramatically with Serverless Flink Cloud offerings, such as Confluent Cloud, which address many of the challenges associated with running open-source Flink. Let’s unpack how Serverless Flink makes a more attractive choice, even for simpler use cases.

1. Serverless Architecture

With a Serverless stream processing service, Flink operates on a fully serverless model, eliminating the need for heavy infrastructure management. This means:

  • No Setup Hassles: Developers focus purely on application logic, not cluster provisioning or tuning.
  • Elastic Scaling: Resources automatically scale with the workload, ensuring efficient handling of varying traffic patterns without manual intervention. One of the biggest challenges of self-managing Flink is over-provisioning resources to handle anticipated peak loads.  Elastic Scaling mitigates this inefficiency.

2. Multi-Tenancy

Multi-tenant design allows multiple applications, teams or organizations to share the same infrastructure securely. This reduces operational costs and complexity compared to managing isolated clusters for each workload.

3. Consumption-Based Pricing

One of the key barriers to adopting Flink for simple tasks is cost. A truly Serverless Flink offering mitigates this with a pay-as-you-go pricing model:

  • You only pay for the resources you use, making it cost-effective for both lightweight and high-throughput workloads.
  • It aligns with the scalability of stateless stream processing, where workloads may spike temporarily and then taper off.

4. Bridging the Gap with No-Code and Low-Code Solutions

The rise of citizen integrators and the demand for low-code/no-code solutions have reshaped how organizations approach data streaming. Less-technical users, such as business analysts or operational teams, often face challenges when trying to engage with technical platforms designed for developers.

Low-code/no-code tools address this by providing intuitive interfaces that allow users to build, deploy, and monitor pipelines without deep programming knowledge. These solutions empower business users to take charge of simple workflows and integrations, significantly reducing time-to-value while minimizing the reliance on technical teams.

For example, capabilities like Flink Actions in Confluent Cloud offer a user-friendly approach to deploying stream processing pipelines without coding. By simplifying the process and making it accessible to non-technical stakeholders, these tools enhance collaboration and ensure faster outcomes without compromising performance or scalability. For instance, you can do ETL functions such as transformation, deduplication or masking field:

Confluent Cloud - Apache Flink Action UI for No Code Low Code Streaming ETL Integration
Source: Confluent

Fully Managed (SaaS) vs. Partially Managed (PaaS) Cloud Products

When choosing between SaaS and PaaS for data streaming, it’s essential to understand the key differences.

SaaS solutions, like Confluent Cloud, offer a fully managed, serverless experience with automatic scaling, low operational overhead, and pay-as-you-go pricing.

In contrast, PaaS requires users to manage infrastructure, configure scaling policies, and handle more operational complexity.

While many products are marketed as “serverless,” not all truly abstract infrastructure or eliminate idle costs—so scrutinize claims carefully.

SaaS is ideal for teams focused on rapid deployment and simplicity, while PaaS suits those needing deep customization and control. Ultimately, SaaS ensures scalability and ease of use, making it a compelling choice for most modern streaming needs. Always dive into the technical details to ensure the platform aligns with your goals. Don’t trust the marketing slogans of the vendors!

Stateless vs. Stateful Stream Processing: Blurring the Lines

Even if your current use case is stateless, it’s worth considering the potential for future needs. Stateless pipelines often evolve into more complex systems as businesses grow, requiring features like:

  • State Management: For event correlation and pattern detection.
  • Windows and Aggregations: To derive insights from time-series data.
  • Joins: To enrich data streams with contextual information.
  • Integrating Multiple Data Sources: To seamlessly combine information from various streams for a comprehensive and cohesive analysis.
  • AI/ML Integration: Incorporating machine learning models for real-time inference, enabling intelligent decision-making directly within data streams.

Stateless and Stateful Stream Processing with Kafka Streams and Apache Flink

With a SaaS Flink service such as Confluent Cloud, you can start small with stateless tasks and seamlessly scale into stateful operations as needed, leveraging Flink’s full capabilities without a complete overhaul.

While Flink may feel like overkill for simple, stateless tasks in its open-source form, its potential is unmatched in these scenarios:

  • Enterprise Workloads: Scalable, reliable, and fault-tolerant systems for mission-critical applications.
  • Data Integration and Preparation (Streaming ETL): Flink enables preprocessing, cleansing, and enriching data at the streaming layer, ensuring high-quality data reaches downstream systems like data lakes and warehouses.
  • Complex Event Processing (CEP): Detecting patterns across events in real time.
  • Advanced Analytics: Stateful stream processing for aggregations, joins, and windowed computations.
  • AI/ML Integration: Incorporating machine learning models for real-time inference, enabling intelligent decision-making directly within data streams.

Stateless stream processing is often achieved using lightweight tools like Kafka Streams or Single Message Transforms (SMTs) within Kafka Connect. SMTs enable inline transformations, such as normalization, enrichment, or filtering, as events pass through the integration framework. This functionality is available in Kafka Connect (provided by Confluent, IBM/Red Hat, Amazon MSK and others) and tools like Benthos for Redpanda. SMTs are particularly useful for quick adjustments and filtering data before it reaches the Kafka cluster, optimizing resource usage and data flow.

While Kafka Streams and Kafka Connect’s SMTs handle many stateless workloads effectively, Apache Flink offers significant advantages for all types of workloads—whether simple or complex, stateless or stateful.

Stream processing in Flink enables true decoupling within the enterprise architecture (as it is not bound to the Kafka cluster like Kafka Streams and Kafka Connect). The benefits are separation of concerns with a domain-driven design (DDD), and improved data governance. And Flink provides interfaces for Java, Python and SQL. Something for (almost) everyone. This makes ideal Flink for ensuring clean, modular architectures and easier scalability.

Stream Processing and ETL with Apache Kafka Streams Connect SMT and Flink

By processing events from diverse sources and preparing them for downstream consumption, Flink supports both lightweight and comprehensive workflows while aligning with domain boundaries and governance requirements. This brings us to the shift left architecture.

The Shift Left Architecture

No matter what specific use cases you have in mind: The Shift Left Architecture brings data processing upstream with real-time stream processing, transforming raw data into high-quality data products early in the pipeline.

Shift Left Architecture with Data Streaming into Data Lake Warehouse Lakehouse

Apache Flink plays a key role as part of a complete data streaming platform by enabling advanced streaming ETL, data curation, and on-the-fly transformations, ensuring consistent, reliable, and ready-to-use data for both operational and analytical workloads, while reducing costs and accelerating time-to-market.

Shift Left Architecture with Apacke Kafka Flink and Iceberg

The decision to use Flink boils down to your use case, expertise, and growth trajectory:

  • For basic stateless tasks, consider lightweight options like Kafka Streams or SMTs within Kafka Connect unless you’re already invested in a SaaS such as Confluent Cloud where Flink is also the appropriate choice for simple ETL processes.
  • For evolving workloads or scenarios requiring scalability and advanced analytics, a Flink SaaS offers unparalleled flexibility and ease of use.
  • For on-premise or edge deployments, Flink’s flexibility makes it an excellent choice for environments where data processing must occur locally due to latency, security, or compliance requirements.

Understanding the deployment environment—cloud, on-premise, or edge— and the capabilities of the Flink product is crucial to choosing the right streaming technology. Flink’s adaptability ensures it can serve diverse needs across these contexts.

Kafka Streams is another excellent, Kafka-native stream processing alternative. Most importantly for this discussion, Kafka Streams is “just” a lightweight Java library, not a server infrastructure like Flink. Hence, it brings different trade-offs with it. I wrote a dedicated article about the trade-offs between Apache Flink and Kafka Streams for stream processing.

In its open-source form, Flink can seem excessive for simple, stateless tasks. However, a serverless Flink SaaS like Confluent Cloud changes the equation. Multi-tenancy and pay-as-you-go pricing make it suitable for a wider range of use cases, from basic ETL to advanced analytics. Serverless features like Confluent’s Flink Actions further reduce complexity, allowing non-technical users to harness the power of stream processing without coding.

Whether you’re just beginning your journey into stream processing or scaling up for enterprise-grade applications, Flink—as part of a complete data streaming platform such as Confluent Cloud—is a future-proof investment that adapts to your needs.

The Data Streaming Landscape 2025 highlights how data streaming has evolved into a key software category, moving from niche adoption to a fundamental part of modern data architecture.

The Data Streaming Landcape 2025 with Kafka Flink Confluent Amazon MSK Cloudera Event Hubs and Other Platforms

With frameworks like Apache Kafka and Flink at its core, the landscape now spans self-managed, BYOC, and fully managed SaaS solutions, driving real-time use cases, unifying transactional and analytical workloads, and enabling innovation across industries.

Stay ahead of the curve! Subscribe to my newsletter for insights into data streaming and connect with me on LinkedIn to continue the conversation.

The post Apache Flink: Overkill for Simple, Stateless Stream Processing and ETL? appeared first on Kai Waehner.

]]>
The Shift Left Architecture – From Batch and Lakehouse to Real-Time Data Products with Data Streaming https://www.kai-waehner.de/blog/2024/06/15/the-shift-left-architecture-from-batch-and-lakehouse-to-real-time-data-products-with-data-streaming/ Sat, 15 Jun 2024 06:12:44 +0000 https://www.kai-waehner.de/?p=6480 Data integration is a hard challenge in every enterprise. Batch processing and Reverse ETL are common practices in a data warehouse, data lake or lakehouse. Data inconsistency, high compute cost, and stale information are the consequences. This blog post introduces a new design pattern to solve these problems: The Shift Left Architecture enables a data mesh with real-time data products to unify transactional and analytical workloads with Apache Kafka, Flink and Iceberg. Consistent information is handled with streaming processing or ingested into Snowflake, Databricks, Google BigQuery, or any other analytics / AI platform to increase flexibility, reduce cost and enable a data-driven company culture with faster time-to-market building innovative software applications.

The post The Shift Left Architecture – From Batch and Lakehouse to Real-Time Data Products with Data Streaming appeared first on Kai Waehner.

]]>
Data integration is a hard challenge in every enterprise. Batch processing and Reverse ETL are common practices in a data warehouse, data lake or lakehouse. Data inconsistency, high compute cost, and stale information are the consequences. This blog post introduces a new design pattern to solve these problems: The Shift Left Architecture enables a data mesh with real-time data products to unify transactional and analytical workloads with Apache Kafka, Flink and Iceberg. Consistent information is handled with streaming processing or ingested into Snowflake, Databricks, Google BigQuery, or any other analytics / AI platform to increase flexibility, reduce cost and enable a data-driven company culture with faster time-to-market building innovative software applications.

The Shift Left Architecture

Data Products – The Foundation of a Data Mesh

A data product is a crucial concept in the context of a data mesh that represents a shift from traditional centralized data management to a decentralized approach.

McKinsey finds that “when companies instead manage data like a consumer product—be it digital or physical—they can realize near-term value from their data investments and pave the way for quickly getting more value tomorrow. Creating reusable data products and patterns for piecing together data technologies enables companies to derive value from data today and tomorrow”:

McKinsey - Why Handle Data as a Product

According to McKinsey, the benefits of the data product approach can be significant:

  • New business use cases can be delivered as much as 90 percent faster.
  • The total cost of ownership, including technology, development, and maintenance, can decline by 30 percent.
  • The risk and data-governance burden can be reduced.

Data Product from a Technical Perspective

Here’s what a data product entails in a data mesh from a technical perspective:

  1. Decentralized Ownership: Each data product is owned by a specific domain team. Applications are truly decoupled.
  2. Sourced from Operational and Analytical Systems: Data products include information from any data source, including the most critical systems and analytics/reporting platforms.
  3. Self-contained and Discoverable: A data product includes not only the raw data but also the associated metadata, documentation, and APIs.
  4. Standardized Interfaces: Data products adhere to standardized interfaces and protocols, ensuring that they can be easily accessed and utilized by other data products and consumers within the data mesh.
  5. Data Quality: Most use cases benefit from real-time data. A data product ensures data consistency across real-time and batch applications.
  6. Value-Driven: The creation and maintenance of data products are driven by business value.

In essence, a data product in a data mesh framework transforms data into a managed, high-quality asset that is easily accessible and usable across an organization, fostering a more agile and scalable data ecosystem.

Anti-Pattern: Batch Processing and Reverse ETL

The “Modern” Data Stack leverages traditional ETL tools or data streaming for ingestion into a data lake, data warehouse or lakehouse. The consequence is a spaghetti architecture with various integration tools for batch and real-time workloads mixing analytical and operational technologies:

Data at Rest and Reverse ETL

Reverse ETL is required to get information out of the data lake into operational applications and other analytical tools. As I have written about it previously, the combination of data lakes and Reverse ETL is an anti-pattern for the enterprise architecture largely due to the economic and organizational inefficiencies Reverse ETL creates. Event-driven data products enable a much simpler and more cost-efficient architecture.

One key reason for the need of batch processing and Reverse ETL patterns is the common use of the Lambda architecture: A data processing architecture that handles real-time and batch processing separately using different layers. This still widely exists in enterprise architectures. Not just for big data use cases like Hadoop/Spark and Kafka, but also for the integration with transactional systems like file-based legacy monoliths or Oracle databases.

Contrary, the Kappa Architecture handles both real-time and batch processing using a single technology stack. Learn more about “Kappa replacing Lambda Architecture” in its own article. TL;DR: The Kappa architecture is possible by bringing even legacy technologies into an event-driven architecture using a data streaming platform. Change Data Capture (CDC) is one of the most common helpers for this.

Traditional ELT in the Data Lake, Data Warehouse, Lakehouse

It seems like nobody does data warehouse anymore today. Everyone talks about a lakehouse merging data warehouse and data lake. Whatever term you use or prefer… The integration process these days looks like the following:

ETL and ELT Data Integration to Data Lake Warehouse Lakehouse in Batch

Just ingesting all the raw data into a data warehouse / data lake / lakehouse has several challenges:

  • Slower Updates: The longer the data pipeline and the more tools are used, the slower the update of the data product.
  • Longer Time-to-Market: Development efforts are repeated because each business unit needs to do the same or similar processing steps again instead of consuming from a curated data product.
  • Increased Cost: The cash cow of analytics platforms charge is compute, not storage. The more your business units use DBT, the better for the analytics SaaS provider.
  • Repeating Efforts: Most enterprises have multiple analytics platforms, including different data warehouses, data lakes, and AI platforms. ELT means doing the same processing again, again, and again.
  • Data Inconsistency: Reverse ETL, Zero ETL,  and other integration patterns make sure that your analytical and especially operational applications see inconsistent information. You cannot connect a real-time consumer or mobile app API to a batch layer and expect consistent results.

Data Integration, Zero ETL and Reverse ETL with Kafka, Snowflake, Databricks, BigQuery, etc.

These disadvantages are real! I have not met a single customer in the past months who disagreed and told me these challenges do not exist. To learn more, check out my blog series about data streaming with Apache Kafka and analytics with Snowflake:

  1. Snowflake Integration Patterns: Zero ETL and Reverse ETL vs. Apache Kafka
  2. Snowflake Data Integration Options for Apache Kafka (including Iceberg)
  3. Apache Kafka + Flink + Snowflake: Cost Efficient Analytics and Data Governance

The blog series can be applied to any other analytics engine. It is a worthwhile read, no matter if you use Snowflake, Databricks, Google BigQuery, or a combination of several analytics and AI platforms.

The solution for this data mess creating data inconsistency, outdated information, and ever-growing cost is the Shift Left Architecture

Shift Left to Data Streaming for Operational AND Analytical Data Products

The Shift Left Architecture enables consistent information from reliable, scalable data products, reduces the compute cost, and allows much faster time-to-market for operational and analytical applications with any kind of technology (Java, Python, iPaaS, Lakehouse, SaaS, “you-name-it”) and communication paradigm (real-time, batch, request-response API):

Shift Left Architecture with Data Streaming into Data Lake Warehouse Lakehouse

Shifting the data processing to the data streaming platform enables:

  • Capture and stream data continuously when the event is created
  • Create data contracts for downstream compatibility and promotion of trust with any application or analytics / AI platform
  • Continuously cleanse, curate and quality check data upstream with data contracts and policy enforcement
  • Shape data into multiple contexts on-the-fly to maximize reusability (and still allow downstream consumers to choose between raw and curated data products)
  • Build trustworthy data products that are instantly valuable, reusable and consistent for any transactional and analytical consumer (no matter if consumed in real-time or later via batch or request-response API)

While shifting to the left with some workloads, it is crucial to understand that developers/data engineers/data scientists can usually still use their favourite interface like SQL or a  programming language such as Java or Python.

Data Streaming is the core fundament of the Shift Left Architecture to enable reliable, scalable real-time data products with good data quality. The following architecture shows how Apache Kafka and Flink connect any data source, curate data sets (aka stream processing / Streaming ETL) and share the processed events with any operational or analytical data sink:

Shift Left Architecture with Apacke Kafka Flink and Iceberg

The architecture shows an Apache Iceberg table as alternative consumer. Apache Iceberg is an open table format designed for managing large-scale datasets in a highly performant and reliable way, providing ACID transactions, schema evolution, and partitioning capabilities. It optimizes data storage and query performance, making it ideal for data lakes and complex analytical workflows. Iceberg evolves to the de facto standard with support from most major vendors in the cloud and data management space, including AWS, Azure, GCP, Snowflake, Confluent, and many more coming (like Databricks after its acquisition of Tabular).

From the data streaming perspective, the Iceberg table is just a button click away from the Kafka Topic and its Schema (using Confluent’s Tableflow – I am sure other vendors will follow soon with own solutions). The big advantage of Iceberg is that data needs to be stored only once (typically in a cost-efficient and scalable object store like Amazon S3). Each downstream application can consume the data with its own technology without any need for additional coding or connectors. This includes data lakehouses like Snowflake or Databricks AND data streaming engines like Apache Flink.

Video: Shift Left Architecture

I summarized the above architectures and examples for the Shift Left Architecture in a short ten minute video if you prefer listening to content:

Apache Iceberg – The New De Facto Standard for Lakehouse Table Format?

Apache Iceberg is such a huge topic and a real game changer for enterprise architectures, end users and cloud vendors. I will write another dedicated blog, including interesting topics such as:

  • Confluent’s product strategy to embed Iceberg tables into its data streaming platform
  • Snowflake’s open source Iceberg project Polaris
  • Databricks’ acquisition of Tabular (the company behind Apache Iceberg) and the relation to Delta Lake and open sourcing its Unity Catalog
  • The (expected) future of table format standardization, catalog wars, and other additional solutions like Apache Hudi or Apache XTable for omni-directional interoperability across lakehouse table formats.

Stay tuned and subscribe to my newsletter to receive new articles.

Business Value of the Shift Left Architecture

Apache Kafka is the de facto standard for data streaming building a Kappa Architecture. The Data Streaming Landscape shows various open source technologies and cloud vendors. Data Streaming is a new software category. Forrester published “The Forrester Wave™: Streaming Data Platforms, Q4 2023“. The leaders are Microsoft, Google and Confluent, followed by Oracle, Amazon, Cloudera, and a few others.

Building data products more left in the enterprise architecture with a data streaming platform and technologies such as Kafka and Flink creates huge business value:

  • Cost Reduction: Reducing the compute cost in one or even multiple data platforms (data lake, data warehouse, lakehouse, AI platform, etc.).
  • Less Development Effort: Streaming ETL, data curation and data quality control already executed instantly (and only once) after the event creation.
  • Faster Time to Market: Focus on new business logic instead of doing repeated ETL jobs.
  • Flexibility: Best of breed approach for choosing the best and/or most cost-efficient technology per use case.
  • Innovation: Business units can choose any programming language, tool or SaaS to do real-time or batch consumption from data products to try and fail or scale fast.

The unification of transactional and analytical workloads is finally possible to enable good data quality, faster time to market for innovation and reduced cost of the entire data pipeline. Data consistency matters across all applications and databases… A Kafka Topic with a  data contract (= Schema with policies) brings data consistency out of the box!

How does your data architecture look like today? Does the Shift Left Architecture make sense to you? What is your strategy to get there? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post The Shift Left Architecture – From Batch and Lakehouse to Real-Time Data Products with Data Streaming appeared first on Kai Waehner.

]]>