Data Lake Archives - Kai Waehner

The Top 20 Problems with Batch Processing (and How to Fix Them with Data Streaming)

Kai Waehner — Tue, 01 Apr 2025 07:20:23 +0000

Batch processing has long been the default approach for moving and transforming data in enterprise systems. It works on fixed schedules, processes data in large chunks, and often relies on complex chains of jobs that run overnight. While this was acceptable in the past, today’s digital businesses operate in real time—and can’t afford to wait hours for fresh insights. Delays, errors, and inconsistencies caused by batch workflows lead to poor decisions, missed opportunities, and growing operational costs. In this post, we’ll look at common issues with batch processing and show why data streaming is the modern alternative for fast, reliable, and scalable data infrastructure.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming architectures and use cases to understand the benefits over batch processing.

The Issues of Batch Processing

While batch processing has powered data pipelines for decades, it introduces a range of problems that make it increasingly unfit for today’s real-time, scalable, and reliable data needs.

Adi Polak @ Current 2024 (Austin, USA)

Adi Polak’s keynote about the issues of batch processing at Current in Austin, USA, inspired me to explore each point with a concrete example and how data streaming with technologies such as Apache Kafka and Flink helps.

Real-time Data Streaming Beats Slow Data and Batch Processing

Across industries, companies are modernizing their data infrastructure to react faster, reduce complexity, and deliver better outcomes. Whether it’s fraud detection in banking, personalized recommendations in retail, or vehicle telemetry in mobility services—real-time data has become essential.

Let’s look at why batch processing falls short in today’s world, and how real-time data streaming changes the game. Each problem outlined below is grounded in real-world challenges seen across industries—from finance and manufacturing to retail and energy.

Corrupted Data and Null Values

Example: A bank’s end-of-day batch job fails because one transaction record has a corrupt timestamp.

In batch systems, a single bad record can poison the entire job. Often, that issue is only discovered hours later when reports are wrong or missing. In real-time streaming systems, bad data can be rejected or rerouted instantly without affecting valid records, leveraging enforcing contracts on the fly.

Thousands of Batch Jobs and Complexity

Example: A large logistics company runs 2,000+ daily batch jobs just to sync inventory and delivery status across regions.

Over time, batch pipelines become deeply entangled and hard to manage. Real-time pipelines are typically simpler and more modular, allowing teams to scale, test, and deploy independently.

Missing Data and Manual Backfilling

Example: A retailer’s point of sale (POS) system goes offline for several hours—sales data is missing from the batch and needs to be manually backfilled.

Batch systems struggle with late-arriving data. Real-time pipelines with built-in buffering and replay capabilities handle delays gracefully, without human intervention.

Data Inconsistencies and Data Copies

Example: A manufacturer reports conflicting production numbers from different analytics systems fed by separate batch jobs.

In batch architectures, multiple data copies lead to discrepancies. A data streaming platform provides a central source of truth via shared topics and schemas to ensure data consistency across real-time, batch and request-response applications.

Exactly-Once Not Guaranteed

Example: A telecom provider reruns a failed billing batch job and accidentally double-charges thousands of customers.

Without exactly-once guarantees, batch retries risk duplication. Real-time data streaming platforms support exactly-once semantics to ensure each record is processed once and only once.

Invalid and Incompatible Schemas

Example: An insurance company adds a new field to customer records, breaking downstream batch jobs that weren’t updated.

Batch systems often have poor schema enforcement. Real-time streaming with a schema registry and data contracts validates data at write time, catching errors early.

Compliance Challenges

Example: A user requests data deletion under GDPR. The data exists in dozens of batch outputs stored across systems.

Data subject requests are nearly impossible to fulfill accurately when data is copied across batch systems. In an event-driven architecture with data streaming, data is processed once, tracked with lineage, and deleted centrally.

Duplicated Data and Small Files

Example: A healthcare provider reruns a batch ETL job after a crash, resulting in duplicate patient records and thousands of tiny files in their data lake.

Data streaming prevents over-processing and file bloats by handling data continuously and appending to optimized storage formats.

High Latency and Outdated Information

Example: A rideshare platform calculates driver incentives daily, based on data that’s already 24 hours old.

By the time decisions are made, they’re irrelevant. Data streaming enables near-instant insights, powering real-time analytics, alerts, and user experiences.

Brittle Pipelines and Manual Fixes

Example: A retailer breaks their holiday sales reporting pipeline due to one minor schema change upstream.

Batch pipelines are fragile and tightly coupled. Real-time systems, with schema evolution support and observability, are more resilient and easier to debug.

Logically and Semantically Invalid Data

Example: A supermarket receives transactions with negative quantities—unnoticed until batch reconciliation fails.

Real-time systems allow inline validation and enrichment, preventing bad data from entering downstream systems.

Exhausted Deduplication and Inaccurate Results

Example: A news app batch-processes user clicks but fails to deduplicate properly, inflating ad metrics.

Deduplication across batch windows is error prone. Data streaming supports sophisticated, stateful deduplication logic in stream processing engines like Kafka Streams or Apache Flink.

Schema Evolution Compatibility Issues

Example: A SaaS company adds optional metadata to an event—but their batch pipeline breaks because downstream systems weren’t ready.

In data streaming, you evolve schemas safely with backward and forward compatibility—ensuring changes don’t break consumers.

Similar Yet Different Datasets

Example: Two teams at a FinTech startup build separate batch jobs for “transactions”, producing similar but subtly different datasets.

Data streaming architectures encourage shared schemas and centralized topics, reducing redundant logic and fragmentation.

Inaccurate Data

Example: A manufacturer bases production forecasts on batch-aggregated sensor data—too late to respond to real-time issues.

Batch introduces delay, distortion, and disconnect. Data streaming delivers accurate, granular, and current data for timely decision-making.

Data Streaming Is the New Standard to Avoid Batch Processing

The limitations of batch processing are no longer acceptable in a digital-first world. From inconsistent data and operational fragility to compliance risk and customer dissatisfaction—batch can’t keep up.

Data streaming isn’t just faster—it’s cleaner, smarter, and more sustainable.

Apache Kafka and Apache Flink make it possible to build a modern, real-time architecture that scales with your business, reduces complexity, and delivers immediate value.

Ready to Modernize?

If you’re exploring the shift from batch to real-time, check out my free book:

The Ultimate Guide to Data Streaming

It’s packed with use cases, architecture patterns, and success stories across industries—designed to help you become a data streaming champion.

Let’s leave batch in the past—and move forward with streaming.

And connect with me on LinkedIn to discuss data streaming! Or join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

The post The Top 20 Problems with Batch Processing (and How to Fix Them with Data Streaming) appeared first on Kai Waehner.

How Microsoft Fabric Lakehouse Complements Data Streaming (Apache Kafka, Flink, et al.)

Kai Waehner — Sat, 12 Oct 2024 06:58:00 +0000

In today’s data-driven world, understanding data at rest versus data in motion is crucial for businesses. Data streaming frameworks like Apache Kafka and Apache Flink enable real-time data processing, offering quick insights and seamless system integration. They are ideal for applications that require immediate responses and handle transactional workloads. Meanwhile, lakehouses like Snowflake, Databricks, and Microsoft Fabric excel in long-term data storage and detailed analysis, perfect for reports and AI training. By leveraging both data streaming and lakehouse systems, businesses can effectively meet both short-term and long-term data needs. This blog post explores how these technologies complement each other in enterprise architecture.

This is part two of a blog series about Microsoft Fabric and its relation to other data platforms on the Azure cloud:

What is Microsoft Fabric for Azure Cloud (Beyond the Buzz) and how it Competes with Snowflake and Databricks
How Microsoft Fabric Lakehouse Complements Data Streaming (Apache Kafka, Flink, et al.)
When to Choose Apache Kafka vs. Azure Event Hubs vs. Confluent Cloud for a Microsoft Fabric Lakehouse

Subscribe to my newsletter to get an email about a new blog post every few weeks.

Data at Rest (Lakehouse) vs. Data in Motion (Data Streaming)

Data streaming technologies like Apache Kafka and Apache Flink enable continuous data processing while the data is in motion in an event-driven architecture. Data streaming enables immediate insights and seamless integration of data across systems. Kafka provides a robust real-time messaging and persistence platform, while Flink excels in low-latency stream processing, making them ideal for dynamic, stateful applications. A data streaming platform supports operational/transactional and analytical use cases.

Data lakes and data warehouses store data at rest before processing the data. The platforms are optimized for batch processing and long-term analytics, including AI/ML use cases such as model training. Some components provide near real-time capabilities, e.g. data ingestion or dashboards. Data lakes offer scalable, flexible storage for raw data, and data warehouses provide structured, high-performance environments for business intelligence and reporting, complementing the real-time capabilities of streaming technologies. Most leading data platforms provide a unified combination of data lake and data warehouse called lakehouse. Lakehouses are almost exclusively used for analytical workloads as they typically lack the SLAs and tight latency required for operational/transactional use cases.

Data streaming and lakehouses are complementary, with some overlaps but different sweet spots. If you want to learn more, check out these articles:

I also created a short ten minute video explaining the above concepts:

Let’s explore why data streaming and a lakehouse like Microsoft Fabric are complementary (with a few overlaps). I explained in the first blog of this series what Microsoft Fabric is. To understand the differences, it is important to understand what a data streaming platform really is.

Data Streaming in an Event-driven Architecture with Apache Kafka and Flink

There is a lot of confusion in the market. For instance, some folks still compare Apache Kafka to a message broker like RabbitMQ or IBM MQ. I mainly focus on Apache Kafka and Apache Flink as these are the de facto standards for data streaming across industries. Before talking about technologies and solutions, we need to start with the concept of an event-driven architecture as the foundation of data streaming.

Event-driven Architecture for Operational and Analytical Workloads

In today’s digital world, getting real-time data quickly is more important than ever. Traditional methods that process data in batches or via request-response APIs often cannot keep up when you need immediate insights.

Event-driven architecture offers a different approach by focusing on handling events – like transactions or user actions – as they happen. One of the key benefits of an event-driven architecture is its ability to decouple systems, meaning that different parts of a system can work independently. This makes it easier to scale and adapt to changes. An event-driven architecture excels in handling both operational and analytical workloads.

Source: Confluent

For operational tasks, the event-driven architecture enables real-time data processing, automating processes, enhancing customer experiences, and boosting efficiency. In e-commerce, for example, an event-driven system can instantly update inventory, trigger marketing campaigns, and detect fraud.

On the analytical side, the event-driven architecture allows organizations to derive insights from data as it flows, enabling real-time analytics and trend identification without the delays of batch processing. This is invaluable in sectors like finance and healthcare, where timely insights are crucial.

Building an event-driven architecture with data streaming technologies like Apache Kafka and Apache Flink enhances its potential. These platforms provide the infrastructure for high-throughput, low-latency data streams, enabling scalable and resilient event-driven systems.

Apache Kafka – The De Facto Standard for Event-driven Messaging and Integration

Apache Kafka has become the go-to platform for event-driven messaging and integration, transforming how organizations manage data in motion. Developed by LinkedIn and open-sourced, Kafka is a distributed streaming platform adept at handling real-time data feeds. Over 150,000 organizations use Kafka in the meantime.

Kafka’s architecture is based on a distributed commit log, ensuring data durability and consistency. It decouples data producers and consumers, allowing for flexible and scalable data architectures. Producers publish data to topics, and consumers subscribe independently, facilitating system evolution.

Beyond messaging, Kafka serves as a robust integration platform, connecting diverse systems and enabling seamless data flow. Its ecosystem of connectors allows integration with databases, cloud services, and legacy systems. This helps organizations in modernizing their data infrastructure step-by-step.

Kafka’s stream processing capabilities, through Kafka Streams and integration with Apache Flink, further enhance the value of the streaming data pipelines. Kafka Streams allows real-time data processing within Kafka to enable complex transformations and enrichments, driving data-driven innovation.

Apache Flink – The De Facto Standard for Stream Processing

Apache Flink stands out as the leading framework for stream processing. It offers a versatile platform for streaming ETL and stateful business applications. Flink processes data streams with low latency and high throughput, suitable for diverse use cases.

Source: Apache

Flink provides a unified programming model for batch and stream processing that allows developers to use the same API for real-time transactional or analytical batch tasks. This flexibility is a significant advantage as it enables varied data processing without separate tools.

A key feature of Flink is its stateful stream processing. This is crucial for maintaining state across events in real-time applications. Flink’s state management ensures accurate processing in complex scenarios. In contrast to many other stream processing solutions, Flink can do stateful processing even at an extreme scale (i.e., with a throughput of gigabytes per second).

Flink’s event time processing capabilities handle out-of-order or delayed events and ensure consistent results. Developers can define windows and triggers based on event timestamps, accommodating late-arriving data.

Apache Flink supports multiple programming languages, including Java, Python, and SQL, offering developers the flexibility to use their preferred language for building stream processing applications. This is a key differentiator to other stream processing engines, such as Kafka Streams or KSQL.

Kafka and Flink are a “Match Made in Heaven” for Data Streaming

The integration of Flink with Apache Kafka enhances its capabilities. Kafka serves as a reliable data source for Flink to enable seamless real-time data ingestion and processing. With Kafka’s persistent commit log, you can travel back in time and replay historical data in guaranteed ordering for analytical use cases. This combination supports high-volume, low-latency data pipelines, unlocking transactional real-time scenarios and batch analytics.

In summary, Apache Flink’s robust stream processing, combined with Apache Kafka, offers a powerful solution for organizations seeking to leverage real-time data. Whether for operational tasks, real-time analytics, or complex event processing (CEP), Flink provides the necessary flexibility and performance for data-driven innovation.

Microsoft Fabric and Data Streaming (Kafka, Flink): Complementary Forces, NOT Competitors

In the growing landscape of data management, it’s crucial to understand the complementary roles of Microsoft Fabric and data streaming technologies. While some may perceive these technologies as competitors, they actually serve distinct yet interconnected purposes that enhance an organization’s data strategy. And keep in mind that Microsoft Fabric is not just an offering for Azure cloud. Hybrid edge scenarios in the IoT space are perfect for Microsoft Fabric and data streaming together.

Microsoft Fabric’s Streaming Ingestion: A Common Feature Among Lakehouses

Microsoft Fabric, like other modern lakehouse platforms such as Snowflake and Databricks, offers streaming ingestion capabilities. This feature is essential for handling near real-time data flows. It allows organizations to capture and process data as it arrives in the lakehouse. However, it’s important to distinguish between operational and analytical workloads when considering the role of streaming ingestion.

Operational workloads benefit from the immediacy of streaming data, enabling real-time decision-making and process automation. In contrast, analytical workloads often require data to be stored at rest for in-depth analysis and reporting. Microsoft Fabric’s architecture focuses on streaming ingestion into robust storage solutions for analytical purposes.

The Fabric Real Time Intelligence Hub: Understanding Its Capabilities and Limitations

The integration of streaming ingestion into Microsoft Fabric is part of its Real Time Intelligence Hub, which aims to provide a comprehensive platform for managing real-time data. However, beyond the marketing and buzz around Fabric Real Time Intelligence Hub, it’s important to note that it doesn’t operate in true real-time.

Instead, Fabric’s “Real Time Intelligence Hub” uses Spark Streaming jobs to manipulate data, which can introduce some latency. And the infrastructure is not meant for critical SLAs that are required by operational / transactional systems. Additionally, the ingestion process is throttled when using Power BI and other batch analytics tools via an API gateway with a Kafka client.

Microsoft is strong in introducing new names for products or feature for Fabric that are actually just a new brand of existing services. If you find new terms such as “eventhouse” or “event streams feature in the Microsoft Fabric Real-Time Intelligence”, make sure to evaluate if this is really a new component or just some Fabric marketing.

Therefore, despite some overlapping with a data streaming platform, the collaboration between Microsoft Fabric and data streaming vendors like Confluent (Kafka, Flink) underscores the complementary nature of these platforms. By leveraging the strengths of both, organizations can build a robust data infrastructure that supports real-time operations and comprehensive analytics.

In conclusion, Microsoft Fabric and data streaming technologies such as Kafka and Flink are not competitors but complementary tools that, when used together, can significantly enhance an organization’s ability to manage and analyze data. By understanding the distinct roles each plays, businesses can create a more agile and responsive data strategy that meets both operational and analytical needs.

Enterprise Architecture with Data Streaming like Kafka / Flink and Lakehouse(s) like Microsoft Fabric

In the modern enterprise architecture landscape, data streaming and lakehouse platforms are pivotal in creating a robust and flexible data ecosystem. Data streaming technologies enable continuous data ingestion and processing for operational and analytical use cases.

Lakehouse platforms, like Microsoft Fabric, Snowflake and Databricks, provide a unified architecture that combines the best of data lakes and data warehouses, offering scalable storage and advanced analytics capabilities.

Together, these technologies empower businesses to handle both operational and analytical workloads efficiently, breaking down data silos and fostering a data-driven culture. By integrating data streaming with lakehouse architectures, enterprises can achieve seamless data flow and comprehensive insights across their operations.

Reverse ETL is an Anti-Pattern

Reverse ETL is the process of moving data from a data store at rest back into operational systems. It is often considered an anti-pattern in modern data architecture. This approach can lead to data inconsistencies, increased complexity, and higher maintenance costs, as it essentially reverses the natural flow of data in motion. Do NOT store data in MIcrosoft Fabric Lakehouse just to reverse it later into other operational systems!

Instead of relying on reverse ETL, organizations should focus on building real-time data pipelines that enable direct integration between data sources and operational systems. By leveraging an event-driven architecture and data streaming technologies, businesses can ensure that data is consistently updated and available where it’s needed most. This approach not only simplifies data management, but also enhances the accuracy and timeliness of insights.

Apache Iceberg as the De Facto Standard for an Open Table Format – Store Once, Analyze with any Tool

Apache Iceberg has emerged as the de facto standard for an open table format. It offers a opportunity for storing data once in an object store like Amazon S3 and analyzing data across various tools. With its ability to handle large-scale datasets and support ACID transactions, Iceberg provides a reliable and efficient way to manage data in a lakehouse environment.

Organizations can use their preferred analytics and processing tools without being locked into a specific vendor. This flexibility is crucial for businesses looking to maximize their data investments and adapt to changing technological landscapes. By adopting Apache Iceberg together with data streaming, enterprises can ensure data consistency and accessibility across all business units to drive better data-quality, insights and decision-making.

Shift Left Architecture to Support Operational and Analytical Workloads with an Event-driven Architecture

Traditionally, many organizations use data streaming with Kafka as a dumb pipeline to ingest all raw data into a data lake. The consequences are high compute cost for multiple (re-)processing of the raw data, inconsistencies across business units, and slow time to market for new applications.

The Shift Left Architecture is a forward-thinking approach that integrates operational and analytical workloads within an event-driven architecture. By shifting data processing closer to the source, this architecture enables real-time data ingestion and analysis, improving data quality for lakehouse ingestion, reducing latency and improving responsiveness.

Event-driven architectures, powered by technologies like Apache Kafka and Flink, facilitate the seamless flow and processing of data across systems. Shift Left ensures that both operational and analytical needs are met. This approach not only enhances the agility of data-driven applications, but also supports continuous improvement and innovation. By adopting a Shift Left Architecture, organizations can streamline their data processes, improve efficiency, and gain a competitive edge in the market.

Example: Confluent (Data Streaming) + Microsoft Fabric (Lakehouse) + Snowflake (Another Lakehouse)

An example of integrating data streaming and lakehouse technologies is the combination of Confluent, Microsoft Fabric, and Snowflake.

Confluent, built on Apache Kafka and Flink, provides a robust platform for real-time data streaming, enabling organizations to integrate with operational and analytical workloads.

Microsoft Fabric and Snowflake, both lakehouse platforms, offer scalable storage and advanced analytics capabilities to allow businesses performing in-depth analysis and reporting on historical data, near real-time analytics and AI model training.

Apache Iceberg enables storing data once and connects any analytical engine to the data, including lakehouses such as Microsoft Fabric or Snowflake, and unified batch and streaming frameworks such as Apache Flink. Iceberg improves the overall data quality for data sharing, reduces storage cost and enables a much faster rollout of new analytical applications.

By leveraging Confluent for data streaming and integrating it with Microsoft Fabric and Snowflake, organizations can create a comprehensive data architecture that supports both real-time operations and long-term analytics. This synergy not only enhances data accessibility and consistency but also empowers businesses to make data-driven decisions with confidence.

Microsoft Fabric Lakehouse + Data Streaming (Kafka, Flink) = Match Made in Heaven

In conclusion, the synergy between Microsoft Fabric and data streaming technologies like Apache Kafka and Apache Flink creates a powerful combination for modern data management. While Microsoft Fabric excels in providing robust analytics and storage capabilities, data streaming platforms offer real-time data processing and integration, ensuring that businesses can respond swiftly to operational demands.

By leveraging both technologies together, organizations can build a comprehensive data architecture that supports both immediate and long-term needs, enhancing their ability to make informed, data-driven decisions. This complementary relationship not only breaks down data silos, but also fosters a more agile and responsive data strategy. As businesses continue to navigate the complexities of data management, understanding and using the strengths of both Microsoft Fabric and data streaming with data streaming vendors like Confluent will be key to achieving a competitive edge.

The Shift Left Architecture, when paired with Apache Iceberg’s open table format, simplifies the integration of data streaming with one or more lakehouses. This combination enhances data quality for all data consumers and significantly reduces overall storage costs.

In part three of this blog series, I will dig deeper into the data streaming alternatives. When to choose open source frameworks such as Apache Kafka and Flink, a leading data streaming platform such as Confluent, or a native Azure service like Event Hubs. Primer: The trade-offs are huge. Do a proper evaluation BEFORE choosing your data streaming solution.

How do you see the combination of a lakehouse like Microsoft Fabric with data streaming? Do you already use both together? And what is your strategy for other data lakes and data warehouses you already have in your enterprise architecture, such as Databricks or Snowflake? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post How Microsoft Fabric Lakehouse Complements Data Streaming (Apache Kafka, Flink, et al.) appeared first on Kai Waehner.

What is Microsoft Fabric for Azure Cloud (Beyond the Buzz) and how it Competes with Snowflake and Databricks

Kai Waehner — Fri, 04 Oct 2024 05:16:31 +0000

If you ask your favorite large language model, Microsoft Fabric appears to be the ultimate solution for any data challenge you can imagine. That’s also the impression many people get from Microsoft’s sales teams. But is it really the silver bullet it’s made out to be? This article takes a closer look. The first part explores the glossy marketing and sales definition of the platform. The second part looks the layers and deconstructs it from a more practical perspective. By doing so, the third part uncovers what Microsoft Fabric is truly built for, and how it fits into the wider data landscape, especially in comparison to other major players in the data analytics market like Databricks and Snowflake.

This is part one of a blog series about Microsoft Fabric and its relation to other data platforms on the Azure cloud:

What is Microsoft Fabric for Azure Cloud (Beyond the Buzz) and how it Compares (or Competes) with Snowflake and Databricks
How Microsoft Fabric Complements Data Streaming (Apache Kafka, Flink, et al.)
When to Choose Apache Kafka vs. Azure Event Hubs vs. Confluent Cloud for a Microsoft Fabric Lakehouse

Subscribe to my newsletter to get an email about a new blog post every few weeks.

What is Microsoft Fabric?

If you listen to Microsoft’s sales and marketing, then Microsoft Fabric is a silver bullet for every use case. Let’s take a two-step approach. Look at the sales and marketing definition. Then deconstruct it a bit from a more realistic point of view…

GenAI Definition (= Sales and Marketing)

If you ask your favourite large language model or search engine, it you something like the following when I ask “What is Microsoft Fabric” (based on sales and marketing content):

“Microsoft Fabric is an end-to-end analytics platform designed to integrate various data services and enable businesses to manage, analyze, and act on their data seamlessly. It was launched as part of Microsoft’s data ecosystem and builds upon key features from platforms like Power BI, Azure Synapse Analytics, and Azure Data Factory.

Source: Microsoft

Here are some key aspects of Microsoft Fabric:

Unified Platform: It combines data engineering, data science, data warehousing, and real-time analytics into a single platform. This helps businesses eliminate the need to use multiple services for data management and analysis.
Lakehouse Architecture: Fabric is designed around the lakehouse concept, which merges the best of data lakes and data warehouses. It allows for both structured and unstructured data to be stored and processed together.
Tightly Integrated with Microsoft 365 and Azure: Microsoft Fabric connects seamlessly with other Microsoft services like Microsoft 365, Power BI, and Azure Machine Learning, enabling better collaboration, reporting, and AI-driven insights.
Low-code/No-code Experience: The platform provides intuitive tools for data analysts, developers, and business users, allowing non-technical users to work with data through drag-and-drop interfaces, while also enabling more complex scenarios for advanced users.
AI and Machine Learning Integration: Microsoft Fabric incorporates AI tools, making it easier for businesses to build predictive models and automate data-driven decisions.
End-to-End Security and Governance: The platform supports robust security measures and compliance requirements, offering features like data encryption, role-based access control, and regulatory compliance support.
Real-time Data Processing: With support for real-time analytics, Fabric enables organizations to derive insights from live data streams, improving decision-making speed and accuracy.

Microsoft Fabric is designed to streamline how businesses use data, combining the power of analytics with cloud-scale capabilities.”

Wow. Just wow. Microsoft Fabric seems to be everything you ever need for your data challenges.

Microsoft Developer has an excellent 45 minute presentation about OneLake and Microsoft Fabric with a few more technical details. This video is also the source of the screenshots below.

Well, let’s dig deeper. What is Microsoft Fabric really? Let’s deconstruct it a bit…

Microsoft Fabric is a Data Analytics Platform ( = NOT for Operational / Transactional Workloads)

Microsoft Fabric is part of Microsoft’s data analytics portfolio. That’s already the first alarm signal when you consider building operational workloads. This is not a criticism, but important to understand!

Microsoft Fabric is NOT a platform for transactional workloads like payments, fraud detection, order management or ERP integration. You should not build an operational application like an Azure Serverless Function or a self-managed Spring Boot container for Fabric.

Furthermore, within the data analytics layer, the foundation of Microsoft Fabric is (only) an optimized storage layer. And this storage layer called OneLake is a SaaS offering, i.e., the storage is part of the Microsoft tenant. Contrary to many other data lakes and lakehouses like Databricks, you do not control or own the storage.

While the conversation is usually around cloud analytics, Microsoft Fabric is a unified analytics platform that integrates with Azure Cloud but is sold independently. This allows organizations to deploy it in various environments, edge, and hybrid setups. For instance, Microsoft sells Fabric for hybrid IoT projects where data needs to be processed both locally and in the cloud.

OneLake – Cloud-based Storage Layer on Top of Azure Data Lake Storage (ADLS)

Microsoft OneLake is a unified, cloud-based data lake that acts as the central storage layer within Microsoft Fabric:

Microsoft OneLake is built on top of Azure Data Lake Storage (ADLS), using its scalable and secure data storage capabilities for long-term data retention. OneLake inherits ADLS’s features like hierarchical namespaces and advanced security, while adding a unified data lake experience across multiple clouds and deep integration with Microsoft’s analytics and data tools through Microsoft Fabric.

Source: Microsoft

The message is obvious: Store all data in OneLake and connect your favourite compute engines, such as Microsoft Fabric, Azure Databricks and Snowflake. Open Table Formats like Delta Lake and Apache Iceberg allow simple integration without the need to copy data again.

Microsoft Fabric Connects to Many Existing Azure Services

On top of the storage layer OneLake, Microsoft Fabric connects to plenty of different existing Microsoft Azure services, including Power BI, Data Explorer, various Synapse services, and so on. This explains why Microsoft Fabric can magically provide every capability you are looking for a few months after the initial announcement.

Source: Microsoft

Here are a few integrations of Azure services into the unified storage of Microsoft Fabric:

Power BI: A critical component of Microsoft Fabric, enabling data visualization and business intelligence. It allows users to create interactive dashboards and reports directly from data stored in the lakehouse, providing real-time insights with minimal data movement.
Azure Data Explorer: Used for analyzing large volumes of streaming and historical data. Microsoft Fabric connects to Data Explorer, allowing users to perform fast, complex, real-time queries on structured and semi-structured data.
Azure Synapse Analytics: Fabric integrates Synapse’s data engineering capabilities, allowing users to prepare, transform, and orchestrate data pipelines. It provides a unified workspace to manage end-to-end data engineering workflows, reducing the need for complex data movement.
Synapse Data Warehousing: Fabric connects with Synapse’s data warehousing services, making it easy to run massively parallel processing (MPP) queries for large-scale analytics on structured data.
Synapse Spark Pools: Fabric integrates with Apache Spark in Synapse, supporting big data processing, AI, and machine learning workloads. Users can leverage Spark’s distributed computing power within Fabric for data transformation, advanced analytics, and machine learning.
Azure Machine Learning (AML): Enables data scientists to build, train, and deploy machine learning models on data stored within the Fabric lakehouse. Users can perform machine learning experiments, automate ML model training, and deploy models with an unified data platform.
Azure Data Factory: Used for data ingestion, ETL (extract, transform, load), and data orchestration. Fabric connects with Azure Data Factory, making it easy to create data pipelines that move and transform data from a wide variety of sources, including on-premises databases, cloud storage, and third-party systems.
Azure Purview: Provides a unified data catalog, allowing users to discover, classify, and govern data assets across the Fabric ecosystem. It also provides compliance and auditing capabilities.
Azure Event Hubs and Stream Analytics: Real-time data processing and analytics. Event Hubs enables streaming data ingestion from sources like IoT devices, applications, and logs, while Stream Analytics allows for real-time data querying and analysis.

Expect more Azure services to be integrated with Microsoft Fabric in the coming months to provide a “complete lakehouse experience”. Also expect more fancy marketing brands, such as the new “Real Time Intelligence Hub” that is built by connecting / re-using existing Microsoft Azure services.

So, what is the main idea behind building this lakehouse product and brand within Microsoft’s huge cloud portfolio?

Microsoft Fabric is a Lakehouse Competing with Snowflake and Databricks

A lakehouse is a data architecture that combines the features of data lakes and data warehouses, allowing for both structured and unstructured data to be stored and processed together. It provides the scalability and flexibility of a data lake with the data management, governance, and performance features of a data warehouse. This unified approach enables real-time analytics and machine learning on diverse types of data, reducing the need for separate infrastructures.

Most analytical data vendors transition to a full-blown lakehouse. While Databricks moved from the data lake foundation powered by Apache Spark into the lakehouse, Snowflake comes from the data warehouse approach but has incorporated a lot of lakehouse features over time (even though Snowflake calls it a more general “data cloud”).

Microsoft Fabric competes with platforms like Databricks and Snowflake in the realm of data analytics, data engineering, and data warehousing by providing an integrated, cloud-native solution for data management and analytics.

Microsoft Fabric positions itself as a more holistic and integrated platform, offering a unified solution for businesses that need to handle everything from data ingestion to real-time analytics and AI. Its Microsoft ecosystem integration is a key competitive advantage.

There are also trade-offs. For instance:

Microsoft Fabric is only available on Azure cloud
Not a mature product yet
Starting a much more competitive approach with strategic partners like Databricks

The support of open table formats like Delta Lake and Apache Iceberg is great. But this is coming in all lakehouses because of market pressure. Not because the data cloud vendors like Databricks, Snowflake and now Microsoft with Fabric have a new business model. All of these vendors still want to collect all the data, store it forever, and put (their own!) compute services on top.

Microsoft Fabric is Azure’s Future Lakehouse

Microsoft Fabric’s integration with many Azure services allows it to offer a broad range of capabilities – from data ingestion, storage, and transformation to real-time analytics, machine learning, and governance. This interconnected ecosystem explains how Fabric can quickly meet diverse enterprise needs by leveraging Microsoft’s existing suite of powerful tools, providing a comprehensive data platform with minimal friction and seamless workflows.

In the end, Microsoft Fabric is a new lakehouse built on top of the optimized cloud storage OneLake. It directly competes with other lakehouses and data clouds such as Databricks and Snowflake to become the leading unified solution for all the things analytics. The future will show where this competition goes. Snowflake and Databricks have a very strong product and customer base already. They will not give up to Microsoft Fabric voluntarily.

Microsoft Fabric includes integrations with Azure Event Hubs (based on the Kafka protocol) and is building a brand around real-time intelligence. In the next article of this blog series, I will explore how this new lakehouse on Azure cloud competes or overlaps with data streaming technologies such as Apache Kafka, Flink, et al. Primer: Data Streaming and Microsoft Fabric are mostly complementary and have very different sweet spots.

How do you see the future of Microsoft Fabric? Do you already use it? What is the plan in the future, also keeping in mind that you likely already have other lakehouses in your enterprise architecture? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post What is Microsoft Fabric for Azure Cloud (Beyond the Buzz) and how it Competes with Snowflake and Databricks appeared first on Kai Waehner.

Apache Iceberg – The Open Table Format for Lakehouse AND Data Streaming

Kai Waehner — Sat, 13 Jul 2024 12:01:01 +0000

Every data-driven organization has operational and analytical workloads. A best of breed approach emerges with various data platforms, including data streaming, data lake, data warehouse and lakehouse solutions and cloud services. An open table format framework like Apache Iceberg is essential in the enterprise architecture to ensure reliable data management and sharing, seamless schema evolution, efficient handling of large-scale datasets and cost-efficient storage while providing strong support for ACID transactions and time travel queries. This blog post explores market trends, adoption of table format frameworks like Iceberg, Hudi, Paimon, Delta Lake, XTable, and the product strategy of leading vendors of data platforms such as Snowflake, Databricks (Apache Spark), Confluent (Apache Kafka / Flink), Amazon Athena and Google BigQuery.

What is an Open Table Format for a Data Platform?

An open table format helps in maintaining data integrity, optimizing query performance, and ensuring a clear understanding of the data stored within the platform.

The open table format for data platforms typically includes a well-defined structure with specific components that ensure data is organized, accessible, and easily queryable. A typical table format contains a table name, column names, data types, primary and foreign keys, indexes, and constraints.

This is not a new concept. Your favourite decades-old database, like Oracle, IBM DB2 (even on the mainframe) or PostgreSQL, uses the same principles. However, the requirements and challenges changed a bit for cloud data warehouses, data lakes, lake houses regarding scalability, performance and query capabilities.

Benefits of a “Lakehouse Table Format” like Apache Iceberg

Every part of an organization becomes data-driven. The consequence is extensive data sets, data sharing with data products across business units, and new requirements for processing data in near real-time.

Apache Iceberg provides many benefits for the enterprise architecture:

Single Storage: Data is stored once (coming from various data sources) reduces cost and complexity
Interoperability: Access without integration efforts from any analytical engine
All Data: Unify operational and analytical workloads (transactional systems, big data logs/IoT/clickstream, mobile APIs, 3rd party B2B interfaces, etc.)
Vendor Independence: Work with any favorite analytics engine (no matter if it is near real-time, batch or API-based)

Apache Hudi and Delta Lake provide the same characteristics. Though, Delta Lake is mainly driven by Databricks as a single vendor.

Table Format AND Catalog Interface

It is important to understand that discussions about Apache Iceberg or similar table format frameworks include two concepts: Table Format AND Catalog Interface! As an end user of the technology, you need both!

The Apache Iceberg project implements the format but only provides a specification (but not implementation) for the catalog:

The table format defines how data is organized, stored, and managed within a table.
The catalog interface manages the metadata for tables and provides an abstraction layer for accessing tables in a data lake.

The Apache Iceberg documentation explores the concepts in much more detail, based on this diagram:

Source: Apache Iceberg

Organizations use various implementations for Iceberg’s catalog interface. Each integrates with different metadata stores and services. Key implementations include:

Hadoop Catalog: Uses the Hadoop Distributed File System (HDFS) or other compatible file systems to store metadata. Suitable for environments already using Hadoop.
Hive Catalog: Integrates with Apache Hive Metastore to manage table metadata. Ideal for users leveraging Hive for their metadata management.
AWS Glue Catalog: Uses AWS Glue Data Catalog for metadata storage. Designed for users operating within the AWS ecosystem.
REST Catalog: Provides a RESTful interface for catalog operations via HTTP. Enables integration with custom or third-party metadata services.
Nessie Catalog: Uses Project Nessie, which provides a Git-like experience for managing data.

The momentum and growing adoption of Apache Iceberg motivates many data platform vendors to implement its own Iceberg catalog. I discuss a few strategies in the below section about data platform and cloud vendor strategies, including Snowflake’s Polaris, Databricks’ Unity, and Confluent’s Tableflow.

First-Class Iceberg Support vs. Iceberg Connector

Please note that supporting Apache Iceberg (or Hudi/Delta Lake) means much more than just providing a connector and integration with the table format via API. Vendors and cloud services differentiate by advanced features like automatic mapping between data formats, critical SLAs, travel back in time, intuitive user interfaces, and so on.

Let’s look at an example: Integration between Apache Kafka and Iceberg. Various Kafka Connect connectors were already implemented. However, here are the benefits of using a first-class integration with Iceberg (e.g., Confluent’s Tableflow) compared to just using a Kafka Connect connector:

No connector config
No consumption through connector
Built-in maintenance (compaction, garbage collection, snapshot management)
Automatic schema evolution
External catalog service synchronization
Simpler operations (in a fully-managed SaaS solution, it is serverless, with no need for any scale or operations by the end-user)

Similar benefits apply to other data platforms and potential first-class integration compared to providing simple connectors.

Open Table Format for a Data Lake / Lakehouse using Apache Iceberg, Apache Hudi, Delta Lake

The general goal of table format frameworks such as Apache Iceberg, Apache Hudi, and Delta Lake is to enhance the functionality and reliability of data lakes by addressing common challenges associated with managing large-scale data. These frameworks help to:

Improve Data Management
- Facilitate easier handling of data ingestion, storage, and retrieval in data lakes.
- Enable efficient data organization and storage, supporting better performance and scalability.
Ensure Data Consistency
- Provide mechanisms for ACID transactions, ensuring that data remains consistent and reliable even during concurrent read and write operations.
- Support snapshot isolation, allowing users to view a consistent state of data at any point in time.
Support Schema Evolution
- Allow for changes in data schema (such as adding, renaming, or removing columns) without disrupting existing data or requiring complex migrations.
Optimize Query Performance
- Implement advanced indexing and partitioning strategies to improve the speed and efficiency of data queries.
- Enable efficient metadata management to handle large datasets and complex queries effectively.
Enhance Data Governance
- Provide tools for better tracking and managing data lineage, versioning, and auditing, which are crucial for maintaining data quality and compliance.

By addressing these goals, table format frameworks like Apache Iceberg, Apache Hudi, and Delta Lake help organizations build more robust, scalable, and reliable data lakes and lakehouses. Data engineers, data scientists and business analysts leverage analytics, AI/ML or reporting/visualization tools on top of the table format to manage and analyze large volumes of data.

Comparison of Apache Iceberg, Hudi, Paimon, Delta Lake?

I won’t do a comparison of the different table format frameworks Apache Iceberg, Apache Hudi, Apache Paimon and Delta Lake here. Many experts wrote about this already. Each table format framework has unique strengths and benefits. But updates are required every month because of the fast evolution and innovation, adding new improvements and capabilities within these frameworks.

Here is a summary of what I see in various blog posts about the three alternatives:

Apache Iceberg: Excels in schema and partition evolution, efficient metadata management, and broad compatibility with various data processing engines.
Apache Hudi: Best suited for real-time data ingestion and upserts, with strong change data capture capabilities and data versioning.
Apache Paimon: A lake format that enables building a real-time lakehouse architecture with Flink and Spark for both streaming and batch operations.
Delta Lake: Provides robust ACID transactions, schema enforcement, and time travel features, making it ideal for maintaining data quality and integrity.

A key decision point might be that Delta Lake is not driven by a broad community like Iceberg and Hudi, but mainly by Databricks as a single vendor behind it.

Apache XTable as Interoperable Cross-Table Framework supporting Iceberg, Hudi and Delta Lake

Users have lots of choices. XTable is yet another incubating table framework under Apache open source license to seamlessly interoperate cross-table between Apache Hudi, Delta Lake, and Apache Iceberg.

Apache XTable:

provides cross-table omnidirectional interoperability between lakehouse table formats.
is NOT a new or separate format, Apache XTable provides abstractions and tools for the translation of lakehouse table format metadata.
is formerly known as OneTable.

Maybe Apache XTable is the answer to provide options for specific data platforms and cloud vendors but still provide simple integration and interoperability.

But be careful: A wrapper on top of different technologies is not a silver bullet. We saw this years ago when Apache Beam emerged. Apache Beam is an open source, unified model and set of language-specific SDKs for defining and executing data ingestion and data processing workflows. It supports a variety of stream processing engines, such as Flink, Spark, Samza. The primary driver behind Apache Beam is Google to allow the migration workflows into Google Cloud Dataflow. However, the limitations are huge, as such a wrapper needs to find the least common denominator supporting features. And most frameworks’ key benefit is the twenty percent that do not fit into such a wrapper. For these reasons, for instance, Kafka Streams does intentionally not support Apache Beam because it would have required too many design limitations.

Market Adoption of Table Format Frameworks

FIrst of all, we are still in the early stages. We are still at the innovation trigger in terms of the Gartner Hype Cycle, coming to the peak of inflated expectations. Most organizations are still evaluating, but not adopting these table formats in production across the organization yet.

Flashback: Container Wars – Kubernetes vs. Mesosphere vs. Cloud Foundry

The debate round Apache Iceberg reminds me of the container wars a few years ago. The term “Container Wars” refers to the competition and rivalry among different containerization technologies and platforms in the realm of software development and IT infrastructure.

The three competing technologies were Kubernetes, Mesosphere and Cloud Foundry. Here is where it went:

Cloud Foundry and Mesosphere were early. Kubernetes still won the battle. Why? I never understood all the technical details and differences. In the end, if the three frameworks are pretty similar, it is all about community adoption, the right timing of feature releases, good marketing, luck, and a few other factors. But it is good for the software industry to have one leading open source framework to build solutions and business models on, instead of three competing ones.

Present: Table Format Wars – Apache Iceberg vs. Hudi vs. Delta Lake

Obviously, Google Trends is no statistical evidence or sophisticated research. But I used it a lot in the past as an intuitive, simple, free tool to analyze market trends. Therefore, I also use this tool to see if Google searches overlap with my personal experience of the market adoption of Apache Iceberg, Hudi and Delta Lake (Apache XTable is too small yet to be added):

We obviously see a similar pattern as the container wars showed a few years ago. I have no idea where this is going. And if one technology wins, or if the frameworks differentiate enough to prove that there is no silver bullet. The future will show us.

My personal opinion? I think Apache Iceberg will win the race. Why? I cannot argue with any technical reasons. I just see many customers across all industries talk about it more and more. And more and more vendors start supporting it. But we will see. I actually do NOT care who wins. However, similarly to the container wars, I think it is good to have a single standard and vendors differentiating with features around it, like it is with Kubernetes.

But with this in mind, let’s explore the current strategy of the leading data platforms and cloud providers regarding table format support in their platforms and cloud services.

Data Platform and Cloud Vendor Strategies for Apache Iceberg

I won’t do any speculation in this section. The evolution of the table format frameworks moves quickly. And vendor strategies change quickly. Please refer to the vendor’s websites for the latest information. But here is a status quo about the data platform and cloud vendor strategies regarding the support and integration of Apache Iceberg.

Snowflake:
- Supports Apache Iceberg for quite some time already
- Adding better integrations and new features regularly
- Internal and external storage options (with trade-offs) like Snowflake’s storage or Amazon S3
- Announced Polaris, an open source catalog implementation for Iceberg, with commitment to support community-driven, vendor-agnostic bi-directional integration
Databricks:
- Focuses on Delta Lake as the table format and (now open sourced) Unity as catalog
- Acquired Tabular, the leading company behind Apache Iceberg
- Unclear future strategy of supporting open Iceberg interface (in both directions) or only to feed data into its lakehouse platform and technologies like Delta Lake and Unity Catalog
Confluent:
- Embeds Apache Iceberg as a first-class citizen into its data streaming platform (the product is called Tableflow)
- Converts a Kafka Topic and related schema metadata (i.e., data contract) into an Iceberg table
- Bi-directional integration between operational and analytical workloads
- Analytics with embedded serverless Flink and its unified batch and streaming API or data sharing with 3rd party analytics engines like Snowflake, Databricks or Amazon Athena
More Data Platforms and Open Source Analytics Engines:
- The list of technologies and cloud services supporting Iceberg grows every month
- A few examples: Apache Spark, Apache Flink, ClickHouse, Dremio, Starburst using Trino (formerly PrestoSQL), Cloudera using Impala, Imply using Apache Druid, Fivetran
Cloud Service Providers (AWS, Azure, GCP, Alibaba):
- Different strategies and integrations, but all cloud providers increase Iceberg support across their services these days, for instance:
- Object Storage: Amazon S3, Azure Data Lake Storage (ALDS), Google Cloud Storage (GCS)
- Catalogs: Cloud-specific like AWS Glue Catalog or vendor agnostic like Project Nessie or Hive Catalog
- Analytics: Amazon Athena, Azure Synapse Analytics, Microsoft Fabric, Google BigQuery

The Shift Left Architecture with Kafka, Flink and Iceberg to Unify Operational and Analytical Workloads

The Shift Left Architecture moves data processing closer to the data source, leveraging real-time data streaming technologies like Apache Kafka and Flink to process data in motion directly after it is ingested. This approach reduces latency and improves data consistency and data quality.

Unlike ETL and ELT, which involve batch processing with the data stored at rest, the Shift Left Architecture enables real-time data capture and transformation. It aligns with the Zero ETL concept by making data immediately usable. But in contrast to Zero ETL, shifting data processing to the left side of the enterprise architecture avoids a complex, hard-to-maintain spaghetti architecture with many point-to-point connections.

The Shift Left Architecture also reduces the need for Reverse ETL by ensuring data is actionable in real-time for both operational and analytical systems. Overall, this architecture enhances data freshness, reduces costs, and speeds up the time-to-market for data-driven applications. Learn more about this concept in my blog post about “The Shift Left Architecture“.

An open table format and catalog introduces enormous benefits into the enterprise architecture: interoperability, freedom of choice of the analytics engines, faster time-to-market and reduced cost.

Apache Iceberg seems to become the de facto standard across vendors and cloud providers. However, it is still at an early stage and competing and wrapper technologies like Apache Hudi, Apache Paimon, Delta Lake and Apache XTable are trying to get momentum, too.

Iceberg and other open table formats are not just a huge win for the single storage and integration with multiple analytics / data / AI/ML platforms such as Snowflake, Databricks, Google BigQuery, et al., but also for the unification of operational and analytical workloads using data streaming with technologies such as Apache Kafka and Flink. The Shift Left Architecture is a significant benefit to reduce efforts, improve data quality and consistency, and enable real-time instead of batch applications and insights.

Finally, if you still wonder what the differences are between data streaming and lakehouses (and how they complement each other), check out this ten minute video:

What is your table format strategy? Which technologies and cloud services do you connect? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Iceberg – The Open Table Format for Lakehouse AND Data Streaming appeared first on Kai Waehner.

The Shift Left Architecture – From Batch and Lakehouse to Real-Time Data Products with Data Streaming

Kai Waehner — Sat, 15 Jun 2024 06:12:44 +0000

Data integration is a hard challenge in every enterprise. Batch processing and Reverse ETL are common practices in a data warehouse, data lake or lakehouse. Data inconsistency, high compute cost, and stale information are the consequences. This blog post introduces a new design pattern to solve these problems: The Shift Left Architecture enables a data mesh with real-time data products to unify transactional and analytical workloads with Apache Kafka, Flink and Iceberg. Consistent information is handled with streaming processing or ingested into Snowflake, Databricks, Google BigQuery, or any other analytics / AI platform to increase flexibility, reduce cost and enable a data-driven company culture with faster time-to-market building innovative software applications.

Data Products – The Foundation of a Data Mesh

A data product is a crucial concept in the context of a data mesh that represents a shift from traditional centralized data management to a decentralized approach.

McKinsey finds that “when companies instead manage data like a consumer product—be it digital or physical—they can realize near-term value from their data investments and pave the way for quickly getting more value tomorrow. Creating reusable data products and patterns for piecing together data technologies enables companies to derive value from data today and tomorrow”:

According to McKinsey, the benefits of the data product approach can be significant:

New business use cases can be delivered as much as 90 percent faster.
The total cost of ownership, including technology, development, and maintenance, can decline by 30 percent.
The risk and data-governance burden can be reduced.

Data Product from a Technical Perspective

Here’s what a data product entails in a data mesh from a technical perspective:

Decentralized Ownership: Each data product is owned by a specific domain team. Applications are truly decoupled.
Sourced from Operational and Analytical Systems: Data products include information from any data source, including the most critical systems and analytics/reporting platforms.
Self-contained and Discoverable: A data product includes not only the raw data but also the associated metadata, documentation, and APIs.
Standardized Interfaces: Data products adhere to standardized interfaces and protocols, ensuring that they can be easily accessed and utilized by other data products and consumers within the data mesh.
Data Quality: Most use cases benefit from real-time data. A data product ensures data consistency across real-time and batch applications.
Value-Driven: The creation and maintenance of data products are driven by business value.

In essence, a data product in a data mesh framework transforms data into a managed, high-quality asset that is easily accessible and usable across an organization, fostering a more agile and scalable data ecosystem.

Anti-Pattern: Batch Processing and Reverse ETL

The “Modern” Data Stack leverages traditional ETL tools or data streaming for ingestion into a data lake, data warehouse or lakehouse. The consequence is a spaghetti architecture with various integration tools for batch and real-time workloads mixing analytical and operational technologies:

Reverse ETL is required to get information out of the data lake into operational applications and other analytical tools. As I have written about it previously, the combination of data lakes and Reverse ETL is an anti-pattern for the enterprise architecture largely due to the economic and organizational inefficiencies Reverse ETL creates. Event-driven data products enable a much simpler and more cost-efficient architecture.

One key reason for the need of batch processing and Reverse ETL patterns is the common use of the Lambda architecture: A data processing architecture that handles real-time and batch processing separately using different layers. This still widely exists in enterprise architectures. Not just for big data use cases like Hadoop/Spark and Kafka, but also for the integration with transactional systems like file-based legacy monoliths or Oracle databases.

Contrary, the Kappa Architecture handles both real-time and batch processing using a single technology stack. Learn more about “Kappa replacing Lambda Architecture” in its own article. TL;DR: The Kappa architecture is possible by bringing even legacy technologies into an event-driven architecture using a data streaming platform. Change Data Capture (CDC) is one of the most common helpers for this.

Traditional ELT in the Data Lake, Data Warehouse, Lakehouse

It seems like nobody does data warehouse anymore today. Everyone talks about a lakehouse merging data warehouse and data lake. Whatever term you use or prefer… The integration process these days looks like the following:

Just ingesting all the raw data into a data warehouse / data lake / lakehouse has several challenges:

Slower Updates: The longer the data pipeline and the more tools are used, the slower the update of the data product.
Longer Time-to-Market: Development efforts are repeated because each business unit needs to do the same or similar processing steps again instead of consuming from a curated data product.
Increased Cost: The cash cow of analytics platforms charge is compute, not storage. The more your business units use DBT, the better for the analytics SaaS provider.
Repeating Efforts: Most enterprises have multiple analytics platforms, including different data warehouses, data lakes, and AI platforms. ELT means doing the same processing again, again, and again.
Data Inconsistency: Reverse ETL, Zero ETL, and other integration patterns make sure that your analytical and especially operational applications see inconsistent information. You cannot connect a real-time consumer or mobile app API to a batch layer and expect consistent results.

Data Integration, Zero ETL and Reverse ETL with Kafka, Snowflake, Databricks, BigQuery, etc.

These disadvantages are real! I have not met a single customer in the past months who disagreed and told me these challenges do not exist. To learn more, check out my blog series about data streaming with Apache Kafka and analytics with Snowflake:

The blog series can be applied to any other analytics engine. It is a worthwhile read, no matter if you use Snowflake, Databricks, Google BigQuery, or a combination of several analytics and AI platforms.

The solution for this data mess creating data inconsistency, outdated information, and ever-growing cost is the Shift Left Architecture…

Shift Left to Data Streaming for Operational AND Analytical Data Products

The Shift Left Architecture enables consistent information from reliable, scalable data products, reduces the compute cost, and allows much faster time-to-market for operational and analytical applications with any kind of technology (Java, Python, iPaaS, Lakehouse, SaaS, “you-name-it”) and communication paradigm (real-time, batch, request-response API):

Shifting the data processing to the data streaming platform enables:

Capture and stream data continuously when the event is created
Create data contracts for downstream compatibility and promotion of trust with any application or analytics / AI platform
Continuously cleanse, curate and quality check data upstream with data contracts and policy enforcement
Shape data into multiple contexts on-the-fly to maximize reusability (and still allow downstream consumers to choose between raw and curated data products)
Build trustworthy data products that are instantly valuable, reusable and consistent for any transactional and analytical consumer (no matter if consumed in real-time or later via batch or request-response API)

While shifting to the left with some workloads, it is crucial to understand that developers/data engineers/data scientists can usually still use their favourite interface like SQL or a programming language such as Java or Python.

Shift Left Architecture with Apache Kafka, Flink and Iceberg

Data Streaming is the core fundament of the Shift Left Architecture to enable reliable, scalable real-time data products with good data quality. The following architecture shows how Apache Kafka and Flink connect any data source, curate data sets (aka stream processing / Streaming ETL) and share the processed events with any operational or analytical data sink:

The architecture shows an Apache Iceberg table as alternative consumer. Apache Iceberg is an open table format designed for managing large-scale datasets in a highly performant and reliable way, providing ACID transactions, schema evolution, and partitioning capabilities. It optimizes data storage and query performance, making it ideal for data lakes and complex analytical workflows. Iceberg evolves to the de facto standard with support from most major vendors in the cloud and data management space, including AWS, Azure, GCP, Snowflake, Confluent, and many more coming (like Databricks after its acquisition of Tabular).

From the data streaming perspective, the Iceberg table is just a button click away from the Kafka Topic and its Schema (using Confluent’s Tableflow – I am sure other vendors will follow soon with own solutions). The big advantage of Iceberg is that data needs to be stored only once (typically in a cost-efficient and scalable object store like Amazon S3). Each downstream application can consume the data with its own technology without any need for additional coding or connectors. This includes data lakehouses like Snowflake or Databricks AND data streaming engines like Apache Flink.

Video: Shift Left Architecture

I summarized the above architectures and examples for the Shift Left Architecture in a short ten minute video if you prefer listening to content:

Apache Iceberg – The New De Facto Standard for Lakehouse Table Format?

Apache Iceberg is such a huge topic and a real game changer for enterprise architectures, end users and cloud vendors. I will write another dedicated blog, including interesting topics such as:

Confluent’s product strategy to embed Iceberg tables into its data streaming platform
Snowflake’s open source Iceberg project Polaris
Databricks’ acquisition of Tabular (the company behind Apache Iceberg) and the relation to Delta Lake and open sourcing its Unity Catalog
The (expected) future of table format standardization, catalog wars, and other additional solutions like Apache Hudi or Apache XTable for omni-directional interoperability across lakehouse table formats.

Stay tuned and subscribe to my newsletter to receive new articles.

Business Value of the Shift Left Architecture

Apache Kafka is the de facto standard for data streaming building a Kappa Architecture. The Data Streaming Landscape shows various open source technologies and cloud vendors. Data Streaming is a new software category. Forrester published “The Forrester Wave: Streaming Data Platforms, Q4 2023“. The leaders are Microsoft, Google and Confluent, followed by Oracle, Amazon, Cloudera, and a few others.

Building data products more left in the enterprise architecture with a data streaming platform and technologies such as Kafka and Flink creates huge business value:

Cost Reduction: Reducing the compute cost in one or even multiple data platforms (data lake, data warehouse, lakehouse, AI platform, etc.).
Less Development Effort: Streaming ETL, data curation and data quality control already executed instantly (and only once) after the event creation.
Faster Time to Market: Focus on new business logic instead of doing repeated ETL jobs.
Flexibility: Best of breed approach for choosing the best and/or most cost-efficient technology per use case.
Innovation: Business units can choose any programming language, tool or SaaS to do real-time or batch consumption from data products to try and fail or scale fast.

The unification of transactional and analytical workloads is finally possible to enable good data quality, faster time to market for innovation and reduced cost of the entire data pipeline. Data consistency matters across all applications and databases… A Kafka Topic with a data contract (= Schema with policies) brings data consistency out of the box!

How does your data architecture look like today? Does the Shift Left Architecture make sense to you? What is your strategy to get there? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post The Shift Left Architecture – From Batch and Lakehouse to Real-Time Data Products with Data Streaming appeared first on Kai Waehner.

SAP Datasphere and Apache Kafka as Data Fabric for S/4HANA ERP Integration

Kai Waehner — Wed, 03 Jan 2024 09:06:02 +0000

SAP is the leading ERP solution across industries around the world. Data integration with other data platforms, applications, databases, and APIs is one of the hardest challenges in the IT and software landscape. This blog post explores how SAP Datasphere in conjunction with the data streaming platform Apache Kafka enables a reliable, scalable and open data fabric for connecting SAP business objects of ECC and S/4HANA ERP with other real-time, batch, or request-response interfaces.

What is SAP ERP?

SAP is a German multinational software corporation that develops enterprise software to manage business operations and customer relations. SAP is best known for its ERP (Enterprise Resource Planning) software, which helps organizations integrate and streamline their business processes.

A wide range of industries and companies of all sizes use it. SAP ERP is one of the most widely used ERP solutions globally. SAP is not a single product, like many people think. Over the years, SAP has expanded its product portfolio. It includes cloud-based solutions, analytics, database management, and other enterprise software applications.

SAP ECC, S/4HANA, and more ERP Products

SAP offers a range of ERP products that cater to different business needs and industries. Some of the key SAP ERP products include:

SAP S/4HANA: SAP S/4HANA is the flagship ERP suite that represents the next generation of SAP’s ERP solutions. The product is built on the SAP HANA in-memory database and provides a simplified data model, improved user experience, and advanced analytics capabilities. It covers core business functions, such as finance, supply chain, manufacturing, procurement, and more.
SAP ERP Central Component (ECC): ECC is the predecessor to SAP S/4HANA and is still widely used by many organizations. It includes various modules, such as SAP ERP Financials, SAP ERP Human Capital Management (HCM), SAP ERP Operations, and others.
SAP Business ByDesign: This is a cloud-based ERP solution designed for small to medium-sized enterprises (SMEs). It integrates core business functions, such as financials, human resources, procurement, supply chain management, and customer relationship management (CRM).
SAP Business One: Another ERP solution targeted at small and medium-sized businesses, SAP Business One is an integrated suite that covers areas such as accounting, sales, purchasing, inventory, and production.
SAP S/4HANA Cloud: This is a cloud-based version of SAP S/4HANA, offering similar functionalities but with the advantages of cloud deployment, including scalability, accessibility, and reduced infrastructure management.
SAP Business Suite: This is a set of business applications that includes SAP ERP and other related products. It comprises different modules to address various business processes.
SAP All-in-One: This is an industry-specific version of SAP ERP designed for midsize companies. It provides pre-configured industry solutions for sectors such as manufacturing, retail, and healthcare.

This product list might be out of date when you read it. SAP continuously develops its product offerings. Products get new names from time to time, consolidate, or deprecate. In other words, SAP modernization, integration, and migration are usually an ongoing effort that never ends.

What is SAP Datasphere?

SAP Datasphere is the next generation of SAP Data Warehouse Cloud. The platform provides a comprehensive data service that enables data professionals to deliver seamless and scalable access to critical business data.

SAP Datasphere is a cloud-based product packaged within SAP’s Business Technology Platform (BTP). Datasphere brings together two previously standalone products, SAP Data Intelligence Cloud (DIC) and SAP Data Warehouse Cloud (DWC), into one cloud native, data integration, and data management platform. The solution allows SAP customers to ingest, integrate, store, and analyze core SAP ERP data, as well as to share this data with other analytical services and downstream applications.

SAP Datasphere = Cloud Data Warehouse and Analytics Platform

Datasphere is the core part of a new solution, known as Business Data Fabric, to simplify data integration and management involving SAP ERP backend data. A key focus of SAP Datasphere is business intelligence and analytics.

I see Datasphere similar to Snowflake or Databricks as a general data warehouse / data lake / lakehouse, but focusing on SAP data with deep integration into the SAP ERP ecosystem and surrounding applications.

However, the out-of-the-box availability of SAP ERP data from SAP ECC, S/4HANA, and other SAP apps enables a simple but powerful opportunity for data integration beyond the SAP landscape. No need to use legacy SAP protocols like BAPI or IDoc anymore. Instead, SAP Datasphere provides a unified way to discover, connect, and manage data across different data sources, systems, and landscapes.

Features of SAP Datasphere and Complementary Software Partnerships

The key features of SAP Datasphere include:

Data Connectivity: SAP Datasphere enables organizations to connect to and access data from various sources, whether they are on-premises or in the cloud. It supports integration with different databases, data lakes, and other data repositories.
Data Orchestration: The platform allows organizations to orchestrate data flows and processing across different data environments. This can be essential for managing complex data pipelines and ensuring data consistency and coherence.
Data Governance: SAP Datasphere includes features for data governance, providing tools for managing metadata, ensuring data quality, and enforcing data policies across the distributed landscape.
Unified Data Discovery: The platform offers a unified view of data assets, helping organizations discover and understand the available data resources across their entire landscape.
Multi-Cloud and Edge Support: SAP Datasphere works in multi-cloud and edge computing environments, providing flexibility and scalability for organizations with diverse data storage and processing needs.

This sounds like any other data management platform, doesn’t it?

But the above features are focusing mainly on SAP environments. Therefore, Datasphere has a few strategic software partnerships:

Confluent (data streaming)
Databricks (data lakehouse)
Collibra (data governance)
Data Robot (automated machine learning)

This emphasizes the strength of Datasphere around the SAP ecosystem. The other partners connect non-SAP IT infrastructure and applications with SAP environments bidirectionally.

SAP Datasphere = One-Stop-Shop for Multi-Generation SAP ERP Systems

SAP Datasphere is more than just an analytical platform for SAP ERP data.

Datasphere leverages SAP internal tooling to access data directly from SAP systems. It is a complete data integration and analytics solution optimized for collecting and preparing data from all SAP ERP systems of multiple generations. For the first time in their history, SAP is making core ERP data from numerous back-end systems available in a one-stop-shop fashion through Datasphere.

This brings us to the excellent opportunity of combining SAP business objects with Apache Kafka and the rest of the enterprise architecture.

Why Apache Kafka for SAP Integration?

Apache Kafka is a distributed streaming platform that has gained widespread popularity for its ability to handle large-scale, real-time data streaming and event processing. When it comes to SAP integration, there are several reasons organizations choose to use Apache Kafka:

Real-time Data Streaming
- Apache Kafka is designed for real-time data streaming, making it well-suited for scenarios where timely and continuous data updates are crucial. This is important in SAP environments where real-time integration is essential for various business processes.
Scalability
- Kafka is highly scalable and can handle large volumes of data and high-throughput requirements. SAP systems often handle massive amounts of data. Kafka’s scalability enables efficient management and processing of this data.
Reliability and Fault Tolerance
- Kafka is known for its reliability and fault-tolerance features. It ensures data durability and availability, which is critical for critical applications like those in SAP environments, e.g., in finance or supply chain business processes. Features like rolling upgrades allow zero downtime continuously.
Decoupling Systems
- Kafka facilitates the decoupling of systems by acting as an intermediary for data exchange. This decoupling allows SAP systems and other applications to communicate without being directly connected, leading to a more flexible and modular architecture. Kafka ensures data consistency across real-time and non-real-time systems.
Event-Driven Architecture
- Kafka supports an event-driven architecture, which aligns well with modern integration patterns. The streaming platform efficiently propagates events, such as changes in SAP data or system events. This enables a more responsive and agile IT landscape. Kafka Connect enables integration with other plain messaging platforms like IBM MQ, TIBCO EMS, or Solace.
Integration with Big Data Ecosystem
- Kafka integrates well with the broader big data ecosystem, including tools like Apache Hadoop, Apache Spark, Elasticsearch, MongoDB, and others. This can be beneficial for organizations looking to analyze and derive insights from SAP data in combination with other data sources and data sinks. Kafka is a much more flexible, scalable and reliable middleware compared to other data integration tools (including SAP middleware like SAP PI).
Message Retention
- Kafka stores messages for a configurable period, allowing systems to catch up on missed messages in case of temporary disruptions. This is particularly useful in scenarios where SAP systems may be temporarily offline, unreachable, or cannot handle the throughput. Or if the transaction cost needs to be reduced by offloading the consumption of downstream applications to a cheaper platform like Kafka. Tiered Storage for Kafka is a significant change for long-term event store for ERP information.
Support for Multiple Protocols
- The Kafka ecosystem supports various communication protocols (like Kafka, HTTP, File, WebSockets, and more), making it versatile for integration with different systems and technologies. This flexibility is crucial in heterogeneous IT environments, where SAP systems coexist with other technologies.
Open Source Community and Ecosystem
- Kafka has a vibrant open-source community and a rich ecosystem of connectors and tools. This ecosystem can simplify integration efforts by providing pre-built connectors for SAP systems and other common technologies.
Analytical and Operational Workloads
- Kafka was initially built for big data analytics use cases. However, most organizations leverage the technology for operational workloads, like orders or payments. Kafka evolved over the years and even introduced a transaction API for exactly-once semantics.

An ERP environment should be real-time, scalable, and open. SAP ERP is not just one product or technology. And organizations always combine it with other open source frameworks, proprietary standard software, and SaaS. “Building a Postmodern ERP with Apache Kafka” explores how SAP ERP and other technologies provide the most value together in a flexible, open environment. Many next-generation ERP systems use Kafka under the hood, too. Even if you don’t see it because it is a proprietary product or SaaS. But event-driven architectures are helpful for software products as they are for any other software projects.

Continuous SAP Migration and Cutover with Kafka

Integration between SAP ERP and other applications is crucial. Another kind of project is the migration and ERP modernization, e.g., from SAP ECC to S/4HANA or the migration between SAP and another software vendor.

A SAP migration project involves moving an SAP system or landscape from one environment to another. This could include moving from an on-premises environment to the cloud, upgrading to a newer version of SAP software, or consolidating multiple SAP instances. The exact steps and considerations for a SAP migration can vary based on the specific migration scenario.

Most SAP ERP migrations these days are from SAP ECC to SAP S/4Hana. These projects usually take years. Apache Kafka can provide valuable help in different SAP integration and migration scenarios.

The combination of real-time capabilities, an event storage for true decoupling and data consistency across real-time and non-real-time systems, and data integration with non-SAP systems and APIs make Kafka the perfect middleware for SAP modernization and ERP migrations.

I covered such a migration via Apache Kafka in a data warehouse modernization story where legacy and modern applications live in parallel for some months or even years until the final cutover is done.

Until the completion of the S/4Hana migration in the cloud, SAP ECC on-premise continues to exist for years. The hybrid deployment and synchronization capabilities of Kafka make it unique for SAP migration and modernization projects.

Confluent’s Fully Managed SAP Integration and Strategic Partnership

Data streaming defines a new software category. Confluent leads the data streaming industry. It provides a serverless cloud offering on all major public clouds and an offering for self-managed deployments powered by Apache Kafka and Flink. In December 2013, the research company Forrester published “The Forrester Wave: Streaming Data Platforms, Q4 2023“. Get free access to the report here. The report explores what Confluent and other vendors like AWS, Microsoft, Google, Oracle and Clouders provide.

Confluent is now available in the SAP® Store, the online marketplace for SAP and partner offerings. The data streaming platform integrates with SAP Datasphere. The combination delivers a secure, governed solution for accessing SAP data as fully managed data streams for customers.

Confluent provides businesses that use SAP solutions with a cloud-native and complete data streaming platform available everywhere it’s needed – in the cloud, across clouds, on-premises, and hybrid environments. Configured directly within SAP Datasphere, the new Confluent integration allows businesses to:

Build real-time applications at a lower cost with fully managed data streams powered by Confluent’s Kora Engine, which reduces the total cost of ownership for Kafka by up to 60%.
Move SAP data anywhere it needs to go. Merge with third-party sources in real time via many pre-built connectors, including AWS Redshift, AWS S3, Databricks, Google Cloud BigQuery, MongoDB, and Snowflake paired with a serverless offering for Apache Flink.
Maintain strict security, compliance, and governance standards with enterprise-grade data streaming security controls, and the industry’s only fully managed governance suite for Kafka.

Confluent in the SAP PartnerEdge Program

Confluent is a partner in the SAP PartnerEdge program. The SAP PartnerEdge program provides the enablement tools, benefits and support to facilitate building high-quality, innovative applications focused on specific business needs – quickly and cost-effective.

Here is an example architecture connecting SAP ERP and non-SAP applications (Flink and Snowflake in this example) with Datasphere and Confluent:

Confluent and SAP Datasphere are the perfect combination for building a data fabric for all enterprise data. Like many companies leverage Apache Kafka as data fabric for AI and Machine Learning.

Alternative Integration Options for SAP and Kafka

Is SAP Datasphere the new silver bullet for SAP ERP integration scenarios? No! As you learned in the above sections, Datasphere enables easy access to old and new SAP ERP data objects. However, Datasphere might have some drawbacks, too:

New technology: The product is only available for a few months at the time of writing this blog post in early 2024. It will mature and features will strengthen.
Heavyweight: A direct integration with a proprietary SAP API call, e.g., BAPI, IDoc or the more modern Operational Data Provisioning (ODP) might be easier to implement and more cost-efficient from a TCO perspective for some projects.
Vendor lock-in: Choosing a SAP product as middleware and/or analytics platform might not be the right strategy. Many organizations choose a best-of-breed approach for different domains and use cases instead of relying on a single vendor from a technology and licensing perspective.

One solution does not fit all integration use cases. Know the different options and make your evaluation.

Plenty of other options exist for SAP-Kafka integration. I explored tens of APIs, tools, and connectors for data integration between SAP ERP and Apache Kafka.

For instance, look at the Confluent Hub and search for SAP Kafka integration. You will find many mature, lightweight and innovative solutions from various vendors. For instance, INIT, Asapio, Advantco, KaTe, Onibex, and Qlik provide integrations via different open and proprietary SAP interfaces like ODB, OData, REST, BAPI, or iDoc.

SAP Datasphere and Kafka Connect the Entire Enterprise (and Hybrid Cloud)

It was never easier to integrate the SAP ecosystem with the rest of the IT world in an enterprise architecture. SAP Datasphere supports straightforward access to SAP S/4 HANA, SAP BW/4HANA, SAP BW, SAP ECC, and SAP HANA ERP data without the need for complex integration projects. In addition, SAP supports connectivity to Business Warehouse, SAP’s on-premise data warehouse solution.

Apache Kafka enables data consistency across SAP and non-SAP applications across the data center and public cloud. No matter if the data source or sink is real time, near-real-time, batch, file-based, or a rest-response API like HTTP/REST. The heart of the data fabric is event-based, scalable, and reliable.

Confluent is the leading vendor of data streaming technologies like Apache Kafka. The strategic partnership and deep product integration between SAP Datasphere and Confluent provides an excellent opportunity for any organization that needs to integrate SAP and the rest of the IT infrastructure.

Some people might tell you how great Kafka is for analytical use cases. But not suited for operational, critical use cases (because some folks want to pitch another product for SAP integrations). That’s not accurate. Apache Kafka supports analytical AND transactional workloads. Actually, almost all customers I work with around the world use Confluent for transactional data from the SAP ERP for orders, payments, fraud detection, and similar operational use cases.

How do you integrate with your SAP systems today? Do you already use modern technologies like Apache Kafka? What connectors or solutions do you use? Will you use SAP Datasphere in the future? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post SAP Datasphere and Apache Kafka as Data Fabric for S/4HANA ERP Integration appeared first on Kai Waehner.

The Heart of the Data Mesh Beats Real-Time with Apache Kafka

Kai Waehner — Thu, 28 Jul 2022 06:08:38 +0000

If there were a buzzword of the hour, it would undoubtedly be “data mesh“! This new architectural paradigm unlocks analytic and transactional data at scale and enables rapid access to an ever-growing number of distributed domain datasets for various usage scenarios. The data mesh addresses the most common weaknesses of the traditional centralized data lake or data platform architecture. And the heart of a decentralized data mesh infrastructure must be real-time, reliable, and scalable. Learn how the de facto standard for data streaming, Apache Kafka, plays a crucial role in building a data mesh.

There is no single technology or product for a data mesh!

This post explores how Apache Kafka, as an open and scalable decentralized real-time platform, can be the basis of a data mesh infrastructure and – complemented by many other data platforms like a data warehouse, data lake, and lakehouse – solve real business problems.

There is no silver bullet or single technology/product/cloud service for implementing a data mesh. The key outcome of a data mesh architecture is the ability to build data products; with the right tool for the job. A good data mesh combines data streaming technology like Apache Kafka or Confluent Cloud with cloud-native data warehouse and data lake architectures from Snowflake, Databricks, Google BigQuery, et al.

What is a data mesh?

I won’t write yet another article describing the concepts of a data mesh. Zhamak Dehghani coined the term in 2019. The following data mesh architecture from 30,000-foot view explains the basic idea well:

I summarize data mesh as the following three bullet points:

An architecture paradigm with several historical influences (domain-driven design, microservices, data marts, data streaming)
Not specific to a single technology or product; no single vendor can implement a data mesh alone
Handling data as a product is a fundamental change, enabling a more flexible architecture and independent solving of separate business problems
Decentralized services, not just analytics but also transactional workloads

Why handle data as a product?

Talking about innovative technology is insufficient to introduce a new architectural paradigm. Consequently, measuring the business value of the enterprise architecture is critical, too.

For McKinsey, the benefits of this approach can be significant:

New business use cases can be delivered as much as 90 percent faster.
The total cost of ownership, including technology, development, and maintenance, can decline by 30 percent.
The risk and data-governance burden can be reduced.

What is data streaming with Apache Kafka and its relation to data mesh?

A data mesh enables flexibility through decentralization and best-of-breed data products. The heart of data sharing requires reliable real-time data at any scale between data producers and data consumers. Additionally, true decoupling between the decentralized data products is key to the success of the data mesh paradigm. Each domain must have access to shared data but also the ability to choose the right tool (i.e., technology, API, product, or SaaS) to solve its business problems.

That’s where data streaming fits into the data mesh story:

The de facto standard for data streaming is Apache Kafka. A cloud-native data streaming infrastructure that can link clusters with each other out-of-the-box enables building a modern data mesh. No Data Mesh will use just one technology or vendor. Learn from inspiring posts from your favorite data products vendors like AWS, Snowflake, Databricks, Confluent, and many more to successfully define and build your custom Data Mesh. Data Mesh is a journey, not a big bang. A data warehouse or data lake (or in modern days, a lakehouse) cannot be the only infrastructure for data mesh and data products.

I covered how to leverage the capabilities of Apache Kafka and its ecosystem like Kafka Connect, ksqlDB, Cluster Linking, etc. to build the heart of a data mesh in a separate blog post: Streaming Data Exchange with Kafka and a Data Mesh in Motion.

Example: Real-time data fabric in hybrid cloud

Here is one example spanning a streaming Data Mesh across multiple cloud providers like AWS, Azure, GCP, or Alibaba, and on-premise / edge sites:

This example shows all the characteristics discussed in the above sections for a Data Mesh:

Decentralized real-time infrastructure across domains and infrastructures
True decoupling between domains within and between the clouds
Several communication paradigms, including data streaming, RPC, and batch
Data integration with legacy and cloud-native technologies
Continuous stream processing where it adds value, and batch processing in some analytics sinks

Presentation: Building a decentralized data mesh with data streaming at its heart

The following slide deck walks you through the motivation, principles, and architectures of building a real-time data mesh powered by Apache Kafka using the Kappa architecture, hybrid cloud, and stream data sharing:

The data mesh provides flexibility and freedom of technology choice for each data product

The heart of a decentralized data mesh infrastructure must be real-time, reliable, and scalable. As the de facto standard for data streaming, Apache Kafka plays a crucial role in a cloud-native data mesh architecture. Nevertheless, data mesh is not bound to a specific technology. The beauty of the decentralized architecture is the freedom of technology choice for each business unit when building its data products.

Data sharing between domains within and across organizations is another aspect where data streaming helps in a data mesh. Real-time data beats slow data. That is not just true for most business problems across industries but also for replicating data between data centers, clouds, regions, or organizations. A streaming data exchange enables data sharing in real-time to build a data mash in motion.

Did you already start building your Data Mesh? What does the enterprise architecture look like? What frameworks, products, and cloud services do you use? Is the heart of your data mesh real-time in motion or some lakehouse at rest? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post The Heart of the Data Mesh Beats Real-Time with Apache Kafka appeared first on Kai Waehner.

Best Practices for Building a Cloud-Native Data Warehouse or Data Lake

Kai Waehner — Thu, 21 Jul 2022 09:40:51 +0000

The concepts and architectures of a data warehouse, a data lake, and data streaming are complementary to solving business problems. Storing data at rest for reporting and analytics requires different capabilities and SLAs than continuously processing data in motion for real-time workloads. Many open-source frameworks, commercial products, and SaaS cloud services exist. Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let’s explore this dilemma in a blog series. Learn how to build a modern data stack with cloud-native technologies. This is part 5: Best Practices for Building a Cloud-Native Data Warehouse or Data Lake.

Blog Series: Data Warehouse vs. Data Lake vs. Data Streaming

This blog series explores concepts, features, and trade-offs of a modern data stack using data warehouse, data lake, and data streaming together:

Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
THIS POST: Best Practices for Building a Cloud-Native Data Warehouse or Data Lake

Stay tuned for a dedicated blog post for each topic as part of this blog series. I will link the blogs here as soon as they are available (in the next few weeks). Subscribe to my newsletter to get an email after each publication (no spam or ads).

Best practices for building a cloud-native data warehouse or data lake

Let’s explore the following lessons learned from building cloud-native data analytics infrastructure with data warehouses, data lakes, data streaming, and lakehouses:

Lesson 1: Process and store data in the right place.
Lesson 2: Don’t design for data at rest to reverse it.
Lesson 3: There is no need for a lambda architecture to separate batch and real-time workloads
Lesson 4: Understand the trade-offs between data sharing at rest and a streaming data exchange.
Lesson 5: Data mesh is not a single product or technology.

Let’s get started…

Lesson 1: Process and store data in the right place.

Ask yourself: What is the use case for your data?

Here are a few use case examples for data and exemplary tools to implement the business case:

Recurring reporting for management => Data warehouse and its out-of-the-box reporting tools
Interactive analysis of structured and unstructured data => Business intelligence tools like Tableau, Power BI, Qlik, or TIBCO Spotfire on top of the data warehouse or another data store
Transactional business workloads => Custom Java application running in a Kubernetes environment or serverless cloud infrastructure
Advanced analytics to find insights in historical data => Raw data sets stored in a data lake for applying powerful algorithms with AI / Machine Learning technologies such as TensorFlow
Real-time actions on new events => Streaming applications to process and correlate data continuously while it is relevant

Real-time or batch compute on the right platform as needed

Batch workloads run best in an infrastructure that was built for this. For instance, Hadoop or Apache Spark. Real-time workloads run best in an infrastructure that was built for this. For example, Apache Kafka.

However, sometimes, both platforms could be used. Understand the underlying infrastructure to leverage it the best way. Apache Kafka can replace a database! Nevertheless, it should only be done in the few scenarios where it makes sense (i.e., simplifies the architecture or adds business value).

For example, replayability as a sequence of events (with guaranteed ordering with time stamps) is built into the immutable Kafka log. Replaying and re-processing historical data from Kafka are straightforward and a perfect use case for many scenarios, including:

New consumer application
Error-handling
Compliance / regulatory processing
Query and analyze existing events
Schema changes in the analytics platform
Model training

On the other side, if you need to do complex analytics like map-reduce or shuffling, SQL queries with tens of JOINs, a robust time-series analysis of sensor events, a search index based on ingested log information, and so on. Then you better choose Spark, Rockset, Apache Druid, or Elasticsearch for that use case.

Tiered storage with cloud-native object storage for cost-efficiency

A single storage infrastructure cannot solve all these problems, even if the “lakehouse vendors” tell you so. Hence, ingesting all data into a single system will fail to succeed with the above use cases. Choose a best-of-breed approach instead with the right tools for the job.

Modern, cloud-native systems decouple storage and compute. This is true for data streaming platforms like Apache Kafka and analytics platforms like Apache Spark, Snowflake, or Google BigQuery. SaaS solutions implement innovative tiered storage solutions (under the hood so you don’t see them) for cost-efficient separation between storage and compute.

Even modern data streaming services leverage tiered storage:

Lesson 2: Don’t design for data at rest to reverse it.

Ask yourself: Is there any added business value if you process data now instead of later (whatever later means)?

If yes, then don’t store the data in a database or data lake, or data warehouse as the first step. The data is stored at rest then and not available for real-time processing anymore. A data streaming platform like Apache Kafka is the right choice if real-time data beats slow data in your use case!

I find it amazing how many people put all their raw data into data storage just to find out that they could leverage the data in real-time later. Reverse ETL tools are spun up then to access the data in the lakehouse again via change data capture (CDC) or similar approaches. Or if you use Spark Structured Streaming (= “real-time”), but the first thing to get the data for “real-time stream processing” is reading it from an S3 object storage (= “at rest and too late”) is unfitting.

Reverse ETL is NOT the right approach for real-time use cases…

If you store data in a data warehouse or data lake, you cannot process it in real-time anymore as it is already stored at rest. These data stores are built for indexing, search, batch processing, reporting, model training, and other use cases that make sense in the storage system. But you cannot consume the data in real-time in motion from storage at rest:

… data streaming is built for continuously processing data in real-time

That’s where event streaming comes into play. Platforms like Apache Kafka enable processing data in motion in real-time for transactional and analytical workloads.

Reverse ETL is not needed in modern event-driven architecture! It is “built-in” into the architecture out-of-the-box. Each consumer directly consumes the data in real-time, if appropriate and technically feasible. And data warehouses or data lakes still consume it at their own pace in near-real-time or batch:

Again, this does not mean you should not put data at rest in your data warehouse or data lake. But only do that if you need to analyze the data later. The data storage at rest is NOT appropriate for real-time workloads.

Learn more about this topic in my blog post “When to Use Reverse ETL and when it is an Anti-Pattern“.

Lesson 3: There is no need for a lambda architecture to separate batch and real-time workloads

Ask yourself: What is the easiest way to consume and process incoming data with my favorite data analytics technology?

Real-time data beats slow data, but NOT always!

Think about your industry, business units, problems you solve, and innovative new applications you build. Real-time data beats slow data. This statement is almost always true. Either to increase revenue, reduce cost, reduce risk, or improve the customer experience.

Data at rest means to store data in a database, data warehouse, or data lake. This way, data is processed too late in many use cases – even if a real-time streaming component (like Kafka) ingests the data. The data processing is still a web service call, SQL query, or map-reduce batch process away from providing a result to your problem.

Don’t get me wrong. Data at rest is not a bad thing. Several use cases, such as reporting (business intelligence), analytics (batch processing), and model training (machine learning) work very well with this approach. But real-time beats batch in almost all other use cases.

The Kappa architecture simplifies the infrastructure for batch AND real-time workloads

The Kappa architecture is an event-based software architecture that can handle all data at any scale in real-time for transactional AND analytical workloads.

The central premise behind the Kappa architecture is that you can perform both real-time and batch processing with a single technology stack. That’s a very different approach than the well-known Lambda architecture. The latter separates batch and real-time workloads in separate infrastructures and technology stacks.

The heart of a Kappa infrastructure is streaming architecture. First, the event streaming platform log stores incoming data. From there, a stream processing engine processes the data continuously in real-time or ingests the data into any other analytics database or business application via any communication paradigm and speed, including real-time, near real-time, batch, and request-response.

Learn more about the benefits and trade-offs of the Kappa architecture in my article “Kappa Architecture is Mainstream Replacing Lambda“.

Ask yourself: How do I need to share data with other internal business units or external organizations?

Use cases for hybrid and multi-cloud replication with data streaming, data lakes, data warehouses, and lakehouses

Many good reasons exist to replicate data across data centers, regions, or cloud providers:

Disaster recovery and high availability: Create a disaster recovery cluster and failover during an outage.
Global and multi-cloud replication: Move and aggregate data across regions and clouds.
Data sharing: Share data with other teams, lines of business, or organizations.
Data migration: Migrate data and workloads from one cluster to another (like from a legacy on-premise data warehouse to a cloud-native data lakehouse).

The story around internal or external data sharing is not different from other applications. Real-time replication beats slow data exchanges. Hence, storing data at rest and then replicating it to another data center, region, or cloud provider is an anti-pattern if real-time information adds business value.

The following example shows how independent stakeholders (= domains in different enterprises) use a cross-company streaming data exchange:

Innovation does not stop at the own border. Streaming replication is relevant for all use cases where real-time is better than slow data (valid for most scenarios). A few examples:

End-to-end supply chain optimization from suppliers to the manufacturer to the intermediary to the aftersales
Track and trace across countries
Integration of 3rd party add-on services to the own digital product
Open APIs for embedding and combining external services to build a new product

Read the details about a “Streaming Data Exchange with Kafka and a Data Mesh in Motion vs. Data Sharing at Rest in the Data Warehouse or Data Lake” for more details.

Also, understand why APIs (= REST / HTTP) and data streaming (= Apache Kafka) are complementary, not competitive!

Lesson 5: Data mesh is not a single product or technology.

Ask yourself: How do I create a flexible and agile enterprise architecture to innovate more efficiently and solve my business problems faster?

Data Mesh is a Logical View, not Physical!

Data mesh shifts to a paradigm that draws from modern distributed architecture: considering domains as the first-class concern, applying platform thinking to create a self-serve data infrastructure, treating data as a product, and implementing open standardization to enable an ecosystem of interoperable distributed data products.

Here is an example of a Data Mesh:

TL;DR: Data Mesh combines existing paradigms, including Domain-driven Design, Data Marts, Microservices, and Event Streaming.

A data warehouse or data lake is NOT and CAN NOT BECOME the entire data mesh!

The heart of a Data Mesh infrastructure should be real-time, decoupled, reliable, and scalable. Kafka is a modern cloud-native enterprise integration platform (also often called iPaaS today). Therefore, Kafka provides all the capabilities for the foundation of a Data Mesh.

However, not all components can or should be Kafka-based. The beauty of microservice architectures is that every application can choose the right technologies. An application might or might not include databases, analytics tools, or other complementary components. The input and output data ports of the data product should be independent of the chosen solutions:

Kafka can be a strategic component of a cloud-native data mesh, not more and not less. But even if you do not use data streaming and build a data mesh only with data at rest, there is still no silver bullet. Don’t try to build a data mesh with a single product, technology, or vendor. No matter if that tool focuses on real-time data streaming, batch processing and analytics, or API-based interfaces. Tools like Starburst – a SQL-based MPP query engine powered by open source Trino (formerly Presto) – enable analytics on top of different data stores.

Best practices for a cloud-native data warehouse go beyond a SaaS product

Building a cloud-native data warehouse or data lake is an enormous project. It requires data ingestion, data integration, connectivity to analytics platforms, data privacy and security patterns, and much more. All of that is needed before the actual tasks like reporting or analytics can even begin.

The complete enterprise architecture beyond the scope of the data warehouse or data lake is even more complex. Best practices must be applied to build a resilient, scalable, elastic, and cost-efficient data analytics infrastructure. SLAs, latencies, and uptime have very different requirements across business domains. Best of breed approaches choose the right tool for the job. True decoupling between business units and applications allows focusing on solving specific business problems.

Separation of storage and compute, unified real-time pipelines instead of separating batch and real-time, avoiding anti-patterns like Reverse ETL, and appropriate data sharing concepts enable a successful journey to cloud-native data analytics.

For more details, browse other posts of this blog series:

Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
THIS POST: Lessons Learned from Building a Cloud-Native Data Warehouse

How did you modernize your data warehouse or data lake? Do you agree with the lessons I learned? What other experiences did you have? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Best Practices for Building a Cloud-Native Data Warehouse or Data Lake appeared first on Kai Waehner.

Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization

Kai Waehner — Mon, 18 Jul 2022 08:48:48 +0000

The concepts and architectures of a data warehouse, a data lake, and data streaming are complementary to solving business problems. Storing data at rest for reporting and analytics requires different capabilities and SLAs than continuously processing data in motion for real-time workloads. Many open-source frameworks, commercial products, and SaaS cloud services exist. Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let’s explore this dilemma in a blog series. Learn how to build a modern data stack with cloud-native technologies. This is part 4: Case Studies for cloud-native data streaming and data warehouse modernization.

Blog Series: Data Warehouse vs. Data Lake vs. Data Streaming

This blog series explores concepts, features, and trade-offs of a modern data stack using data warehouse, data lake, and data streaming together:

Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
THIS POST: Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
Lessons Learned from Building a Cloud-Native Data Warehouse

Case studies: Cloud-native data streaming for data warehouse modernization

Every project is different. This is true for data streaming, analytics, and other software development. The following shows three case studies with significantly different architectures and technologies for data warehouse modernization. The examples come from various verticals: Software and cloud business, financial services, logistics and transportation, and the travel and accommodation industry.

Confluent: Data warehouse modernization from batch ETL with Stitch to streaming ETL with Kafka

The article “Streaming ETL SFDC Data for Real-Time Customer Analytics” explores how Confluent eats its dog food to modernize the internal data warehouse pipeline.

The use case is straightforward and standard across most organizations: Extract, transform, and load (ETL) Salesforce data into a Google BigQuery data warehouse, so that the business can use the data. But it is more complex than it sounds:

Organizations often rely on a third-party ETL tool to periodically load data from a CRM and other applications to their data warehouse. These batch tools introduce a lag between when the business events are captured in Salesforce and when they are made available for consumption and processing. The batch workloads commonly result in discrepancies between Salesforce reports and internal dashboards, leading to concerns about the integrity and reliability of the data.

Confluent used Talend’s Stitch batch ETL tool in the beginning. The old architecture looked like this:

The consequence of batch ETL and a 3rd party tool in the middle lead to insufficient and inconsistent information updates.

Over the past few years, Confluent has invested in building stream processing capabilities into the internal data warehouse pipeline. Confluent leverages its own fully managed Confluent Cloud connectors (in this case, the Salesforce CDC source and BigQuery sink connectors), Schema Registry for data governance, and ksqlDB + Kafka Streams for reliable streaming ETL to send SFDC data to BigQuery. Here is the modernized architecture:

Paypal: Reducing the time for readouts from 12 hours to a few seconds for 30 billion events per day

Paypal has plenty of Kafka projects for many critical and analytical workloads. In this use case, it scales the Kafka Consumer for 30-35 Billion events per day to migrate its analytical workloads to the Google Cloud Platform (GCP).

A streaming application ingests the events from Kafka directly to BigQuery. This is a critical project for PayPal as most of the analytical readouts are based on this. The outcome of the data warehouse modernization and building a cloud-native architecture: Reduce the time for readouts from 12 hours to a few seconds.

Read more about this success story in the PayPal Technology Blog.

Shippeo: From on-premise databases to multiple cloud-native data lakes

Shippeo provides real-time and multimodal transportation visibility for logistics providers, shippers, and carriers. Its software uses automation and artificial intelligence to share real-time insights, enable better collaboration, and unlock your supply chain’s full potential. The platform can give instant access to predictive, real-time information for every delivery.

Shippeo described how they integrated traditional databases (MySQL and PostgreSQL) and cloud-native data warehouses (Snowflake and BigQuery) with Apache Kafka and Debezium:

This is an excellent example of cloud-native enterprise architecture leveraging a “best of breed” approach for data warehousing and analytics. Kafka decouples the analytical workloads from the transactional systems and handles the backpressure for slow consumers.

Sykes Cottages: Fully-managed end-to-end pipeline with Confluent Cloud, Kafka Connect, Snowflake

Sykes Holiday Cottages are one of the UK’s leading and fastest-growing independent holiday cottage rental agencies representing over 19,000 cottages across the UK, Ireland, and New Zealand.

The experience of its customers on the web is a top priority and is one way to stay competitive. The goal is to match customers to their perfect holiday cottage experience and delight at each stage along the way. Getting the data pipeline to fuel this innovation is critical. Data warehouse modernization and data streaming enabled new ways to further innovate the web experience through a data-driven approach.

From inconsistent and slow batch workloads…

While serving its purpose for several years, the existing pipeline had problems impairing this cycle. Very early in this pipeline, the ETL process turned the data into rows and columns (structured data). Various copies were made, and the results were presented via a static report. Data engineers were needed for changes, such as new events or contextual information. The scale was also challenging as this has to be done manually in the main.

Critically keeping the data in a semi-structured format until it is ingested into the warehouse and then using ELT to do a single transformation of the data, Sykes Holiday Cottages can simplify the pipeline and make it much more agile.

… to event-based real-time updates and continuous stream processing

New web events (and any context that goes with it) can be wrapped up within a message and can flow all the way to the warehouse without a single code change. The new events are then available to the web teams either through a query or the visualization tool.

The current throughput is around 50k (peaking at over 300k) messages per minute. As new events are captured, this will grow considerably. Additionally, each of the above components must scale accordingly.

The new architecture enables the web teams to capture new events. And analyze the data using self-service tools with no dependency on data engineering.

In conclusion, the business case for doing this is compelling. Based on our testing and projections, we expect at least 10x ROI over three years for this investment.

In Sykes Holiday Cottages’ blog post, learn more details: Why Sykes Cottages partnered with Snowflake and Confluent to drive enhanced customer experience.

Doordash: From multiple pipelines to data streaming for Snowflake integration

Even digital natives – that started their business in the cloud without legacy applications in their own data centers – need to modernize the enterprise architecture to improve business processes, reduce costs, and provide real-time information to its downstream applications.

It is cost inefficient to build multiple pipelines that are trying to achieve similar purposes. Doordash used cloud-native AWS messaging and streaming systems like Amazon SQS and Amazon Kinesis for data ingestion into the Snowflake data warehouse:

Mixing different kinds of data transport and going through multiple messaging/queueing systems without carefully designed observability around it leads to difficulties in operations.

These issues resulted in high data latency, significant cost, and operational overhead at Doordash. Therefore, Doordash moved to a cloud-native streaming platform powered by Apache Kafka and Apache Flink for continuous stream processing before ingesting data into Snowflake:

The move to a data streaming platform provides many benefits to Doordash:

Heterogeneous data sources and destinations, including REST APIs using the Confluent rest proxy
Easily accessible
End-to-end data governance with schema enforcement and schema evolution with Confluent Schema Registry
Scalable, fault-tolerant, and easy to operate for a small team

All the details about this cloud-native infrastructure optimization are in Doordash’s engineering blog post: “Building Scalable Real Time Event Processing with Kafka and Flink“.

Real-world case studies for cloud-native projects prove the business value

Data warehouse and data lake modernization only make sense if there is a business value. Elastic scale, reduced operations complexity, and faster time to market are significant advantages of cloud services like Snowflake, Databricks, or Google BigQuery.

Data streaming plays a vital role in these initiatives to integrate with legacy and cloud-native data sources, continuous streaming ETL, true decoupling between the data sources, and multiple data sinks (lakes, warehouses, business applications).

The case studies of Confluent, Paypal, Shippeo, and Sykes Cottages showed their different success stories of moving into cloud-native infrastructure to rain real-time visibility and analytics capabilities. Elastic scale and fully-managed end-to-end pipelines are crucial success factors in gaining business value with consistently up-to-date information.

For more details, browse other posts of this blog series:

Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
THIS POST: Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
Lessons Learned from Building a Cloud-Native Data Warehouse

Do you have another success story to share? Or are your projects for data lake and data warehouse modernization still ongoing? Do you use separate infrastructure for specific use cases or build a monolithic lakehouse instead? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization appeared first on Kai Waehner.

Data Warehouse and Data Lake Modernization: From Legacy On-Premise to Cloud-Native Infrastructure

Kai Waehner — Fri, 15 Jul 2022 06:03:28 +0000

The concepts and architectures of a data warehouse, a data lake, and data streaming are complementary to solving business problems. Storing data at rest for reporting and analytics requires different capabilities and SLAs than continuously processing data in motion for real-time workloads. Many open-source frameworks, commercial products, and SaaS cloud services exist. Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let’s explore this dilemma in a blog series. Learn how to build a modern data stack with cloud-native technologies. This is part 3: Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure.

Blog Series: Data Warehouse vs. Data Lake vs. Data Streaming

This blog series explores concepts, features, and trade-offs of a modern data stack using data warehouse, data lake, and data streaming together:

Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
THIS POST: Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
Lessons Learned from Building a Cloud-Native Data Warehouse

Data warehouse modernization: From legacy on-premise to cloud-native infrastructure

Many people talk about data warehouse modernization when they move to a cloud-native data warehouse. Though, what does data warehouse modernization mean? Why do people move away from their on-premise data warehouse? What are the benefits?

Many projects I have seen in the wild went through the following steps:

Select a cloud-native data warehouse
Get data into the new data warehouse
[Optional] Migrate from the old to the new data warehouse

Let’s explore these steps in more detail and understand the technology and architecture options.

1. Selection of a cloud-native data warehouse

Many years ago, cloud computing was a game-changer for operating infrastructure. AWS innovated by providing not just EC2 virtual machines but also storage, like AWS S3 as a service.

Cloud-native data warehouse offerings are built on the same fundamental change. Cloud providers brought their analytics cloud services, such as AWS Redshift, Azure Synapse, or GCP BigQuery. Independent vendors rolled out a cloud-native data warehouse or data lake SaaS such as Snowflake, Databricks, and many more. While each solution has its trade-offs, a few general characteristics are true for most of them:

Cloud-native: A modern data warehouse is elastic, scales for small up to extreme workloads, and automates most business processes around development, operations, and monitoring.
Fully managed: The vendor takes over the operations burden. This includes scaling, failover handling, upgrades, and performance tuning. Some offerings are truly serverless, while many services require capacity planning and manual or automated scaling up and down.
Consumption-based pricing: Pay-as-you-go enables getting started in minutes and scaling costs with broader software usage. Most enterprise deployments allow commitment to getting price discounts.
Data sharing: Replicating data sets across regions and environments is a common feature to offer data locality, privacy, lower latency, and regulatory compliance.
Multi-cloud and hybrid deployments: While cloud providers usually only offer the 1st party service on their cloud infrastructure, 3rd party vendors provide a multi-cloud strategy. Some vendors even offer hybrid environments, including on-premise and edge deployments.

Plenty of comparisons exist in the community, plus analyst research from Gartner, Forrester, et al. Looking at vendor information and trying out the various cloud products using free credits is crucial, too. Finding the right cloud-native data warehouse is its own challenge and not in this blog post.

2. Data streaming as (near) real-time and hybrid integration layer

Data ingestion into data warehouses and data lakes was already covered in part two of this blog series. The more real-time, the better for most business applications. Near real-time ingestion is possible with specific tools (like AWS Kinesis or Kafka) or as part of the data fabric (the streaming data hub where a tool like Kafka plays a bigger role than just data ingestion).

The often more challenging part is data integration. Most data warehouse and data lake pipelines require ETL to ingest data. As the next-generation analytics platform is crucial for making the right business decisions, the data ingestion and integration platform must also be cloud-native! Tools like Kafka provide the reliable and scalable integration layer to get all required data into the data warehouse.

Integration of legacy on-premise data into the cloud-native data warehouse

In a greenfield project, the project team is lucky. Data sources run in the same cloud, using open and modern APIs, and scale as well as the cloud-native data warehouse.

Unfortunately, the reality is brownfield almost always, even if all applications run in public cloud infrastructure. Therefore, the integration and replication of data from legacy and on-premise applications is a general requirement.

Data is typically consumed from legacy databases, data warehouses, applications, and other proprietary platforms. The replication into the cloud data warehouse usually needs to be near real-time and reliable.

A data streaming platform like Kafka is perfect for replicating data across data centers, regions, and environments because of its elastic scalability and true decoupling capabilities. Kafka enables connectivity to modern AND legacy systems via connectors, proprietary APIs, programming languages, or open REST interfaces:

A common scenario in such a brownfield project is the clear separation of concerns and true decoupling between legacy on-premise and modern cloud workloads. Here, Kafka is deployed on-premise to connect to legacy applications.

Tools like MirrorMaker, Replicator, or Confluent Cluster Linking replicate events in real-time into the Kafka cluster in the cloud. The Kafka brokers provide access to the incoming events. Downstream consumers read the data into the data sinks at their own pace; real-time, near real-time, batch, or request-response via API. Streaming ETL is possible at any site – where it makes the most sense from a business or security perspective and is the most cost-efficient.

Example: Confluent Cloud + Snowflake = Cloud-native Data Warehouse Modernization

Here is a concrete example of data warehouse modernization using cloud-native data streaming and data warehousing with Confluent Cloud and Snowflake:

For modernizing the data warehouse, data is ingested from various legacy and modern data sources using different communication paradigms, APIs, and integration strategies. The data is transmitted in motion from data sources via Kafka (and optional preprocessing) into the Snowflake data warehouse. The whole end-to-end pipeline is scalable, reliable, and fully managed, including the connectivity and ingestion between the Kafka and Snowflake clusters.

However, there is more to the integration and ingestion layer: The data streaming platform stores the data for true decoupling and slow downstream applications; not every consumer is or can be real-time. Most enterprise architectures do not ingest data into a single data warehouse or data lake or lakehouse. The reality is that different downstream applications need access to the same information; even though vendors of data warehouses and data lakes tell you differently, of course

By consuming events from the streaming data hub, each application domain decides by itself if it

processes events within Kafka with stream processing tools like Kafka Streams or ksqlDB
builds own downstream applications with its code and technologies (like Java, .NET, Golang, Python)
integrates with 3rd party business applications like Salesforce or SAP
ingests the raw or preprocessed and curated data from Kafka into the sink system (like a data warehouse or data lake)

3. Data warehouse modernization and migration from legacy to cloud-native

An often misunderstood concept is the buzz around data warehouse modernization: Companies rarely take the data of the existing on-premise data warehouse or data lake, write a few ETL jobs, and put the data into a cloud-native data warehouse for the sake of doing it.

If you think about a one-time lift-and-shift from an on-premise data warehouse to the cloud, then a traditional ETL tool or a replication cloud service might be the easiest. However, usually, data warehouse modernization is more than that!

What is data warehouse modernization?

A data warehouse modernization can mean many things, including replacing and migrating the existing data warehouse, building a new cloud-native data warehouse from scratch, or optimizing a legacy ETL pipeline of a cloud-native data warehouse.

In all these cases, data warehouse modernization requires business justification, for instance:

Application issues in the legacy data warehouse, such as too slow data processing with legacy batch workloads, result in wrong or conflicting information for the business people.
Scalability issues in the on-premise data warehouse as the data volume grows too much.
Cost issues because the legacy data warehouse does not offer reasonable pricing with pay-as-you-go models.
Connectivity issues as legacy data warehouses were not built with an open API and data sharing in mind. Cloud-native data warehouses run on cost-efficient and scalable object storage, separate storage from computing, and allow data consumption and sharing. (but keep in mind that Reverse ETL is often an anti-pattern!)
A strategic move to the cloud with all infrastructure. The analytics platform is no exception if all old and new applications go to the cloud.

Cloud-native applications usually come with innovation, i.e., new business processes, data formats, and data-driven decision-making. From a data warehouse perspective, the best modernization is to start from scratch. Consume data directly from the existing data sources, ETL it, and do business intelligence on top of the new data structures.

I have seen many more projects where customers use change data capture (CDC) from Oracle databases (i.e., the leading core system) instead of trying to replicate data from the legacy data warehouse (i.e., the analytics platform) as scalability, cost, and later shutdown of legacy infrastructure benefits from this approach.

Data warehouse migration: Continuous vs. cut-over

The project is usually a cut-over when you need to do a real modernization (i.e., migration) from a legacy data warehouse to a cloud-native one. This way, the first project phase integrates the legacy data sources with the new data warehouse. The old and new data warehouse platforms operate in parallel, so that old and new business processes go on. After some time (months or years later), when the business is ready to move, the old data warehouse will be shut down after legacy applications are either migrated to the new data warehouse or replaced with new applications:

My article “Mainframe Integration, Offloading and Replacement with Apache Kafka” illustrates this offloading and long-term migration process. Just scroll to the section “Mainframe Offloading and Replacement in the Next 5 Years” in that post and replace the term ‘mainframe’ with ‘legacy data warehouse’ in your mind.

A migration and cut-over is its project and can include the legacy data warehouse; or not. Data lake modernization (e.g., from a self- or partially managed Cloudera cluster running on-premise in the data center to a fully managed Databricks or Snowflake cloud infrastructure) follows the same principles. And mixing the data warehouse (reporting) and data lake (big data analytics) into a single infrastructure does not change this either.

Data warehouse modernization is NOT a big bang and NOT a single tool approach!

Most data warehouse modernization projects are ongoing efforts over a long period. You must select a cloud-native data warehouse, get data into the new data warehouse from various sources, and optionally migrate away from legacy on-premise infrastructure.

Data streaming for data ingestion, business applications, or data sharing in real-time should always be a separate component in the enterprise architecture. It has very different requirements regarding SLAs, uptime, through, latency, etc. Putting all real-time and analytical workloads into the same cluster makes little sense from a cost, risk, or business value perspective. The idea of a modern data flow and building a data mesh is the separation of concerns with domain-driven design and focusing on data products (using different, independent APIs, technologies, and cloud services).

For more details, browse other posts of this blog series:

Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
THIS POST: Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
Lessons Learned from Building a Cloud-Native Data Warehouse

What cloud-native data warehouse(s) do you use? How does data streaming fit into your journey? Did you integrate or replace your legacy on-premise data warehouse(s); or start from greenfield in the cloud? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Data Warehouse and Data Lake Modernization: From Legacy On-Premise to Cloud-Native Infrastructure appeared first on Kai Waehner.

Data Lake Archives - Kai Waehner

The Top 20 Problems with Batch Processing (and How to Fix Them with Data Streaming)

The Issues of Batch Processing

Real-time Data Streaming Beats Slow Data and Batch Processing

Corrupted Data and Null Values

Thousands of Batch Jobs and Complexity

Missing Data and Manual Backfilling

Data Inconsistencies and Data Copies

Exactly-Once Not Guaranteed

Invalid and Incompatible Schemas

Compliance Challenges

Duplicated Data and Small Files

High Latency and Outdated Information

Brittle Pipelines and Manual Fixes

Logically and Semantically Invalid Data

Exhausted Deduplication and Inaccurate Results

Schema Evolution Compatibility Issues

Similar Yet Different Datasets

Inaccurate Data

Data Streaming Is the New Standard to Avoid Batch Processing

How Microsoft Fabric Lakehouse Complements Data Streaming (Apache Kafka, Flink, et al.)

Data at Rest (Lakehouse) vs. Data in Motion (Data Streaming)

Data Streaming in an Event-driven Architecture with Apache Kafka and Flink

Event-driven Architecture for Operational and Analytical Workloads

Apache Kafka – The De Facto Standard for Event-driven Messaging and Integration

Apache Flink – The De Facto Standard for Stream Processing

Kafka and Flink are a “Match Made in Heaven” for Data Streaming

Microsoft Fabric and Data Streaming (Kafka, Flink): Complementary Forces, NOT Competitors

Microsoft Fabric’s Streaming Ingestion: A Common Feature Among Lakehouses

The Fabric Real Time Intelligence Hub: Understanding Its Capabilities and Limitations

Enterprise Architecture with Data Streaming like Kafka / Flink and Lakehouse(s) like Microsoft Fabric

Reverse ETL is an Anti-Pattern

Apache Iceberg as the De Facto Standard for an Open Table Format – Store Once, Analyze with any Tool

Shift Left Architecture to Support Operational and Analytical Workloads with an Event-driven Architecture

Example: Confluent (Data Streaming) + Microsoft Fabric (Lakehouse) + Snowflake (Another Lakehouse)

Microsoft Fabric Lakehouse + Data Streaming (Kafka, Flink) = Match Made in Heaven

What is Microsoft Fabric for Azure Cloud (Beyond the Buzz) and how it Competes with Snowflake and Databricks

What is Microsoft Fabric?

GenAI Definition (= Sales and Marketing)

Microsoft Fabric is a Data Analytics Platform ( = NOT for Operational / Transactional Workloads)

OneLake – Cloud-based Storage Layer on Top of Azure Data Lake Storage (ADLS)

Microsoft Fabric Connects to Many Existing Azure Services

Microsoft Fabric is a Lakehouse Competing with Snowflake and Databricks

Microsoft Fabric is Azure’s Future Lakehouse

Apache Iceberg – The Open Table Format for Lakehouse AND Data Streaming

What is an Open Table Format for a Data Platform?

Benefits of a “Lakehouse Table Format” like Apache Iceberg

Table Format AND Catalog Interface

First-Class Iceberg Support vs. Iceberg Connector

Open Table Format for a Data Lake / Lakehouse using Apache Iceberg, Apache Hudi, Delta Lake

Comparison of Apache Iceberg, Hudi, Paimon, Delta Lake?

Apache XTable as Interoperable Cross-Table Framework supporting Iceberg, Hudi and Delta Lake

Market Adoption of Table Format Frameworks

Flashback: Container Wars – Kubernetes vs. Mesosphere vs. Cloud Foundry

Present: Table Format Wars – Apache Iceberg vs. Hudi vs. Delta Lake

Data Platform and Cloud Vendor Strategies for Apache Iceberg

The Shift Left Architecture with Kafka, Flink and Iceberg to Unify Operational and Analytical Workloads

Apache Iceberg as Open Table Format and Catalog for Seamless Data Sharing across Analytics Engines

The Shift Left Architecture – From Batch and Lakehouse to Real-Time Data Products with Data Streaming

Data Products – The Foundation of a Data Mesh

Data Product from a Technical Perspective

Anti-Pattern: Batch Processing and Reverse ETL

Traditional ELT in the Data Lake, Data Warehouse, Lakehouse

Data Integration, Zero ETL and Reverse ETL with Kafka, Snowflake, Databricks, BigQuery, etc.

Shift Left to Data Streaming for Operational AND Analytical Data Products

Shift Left Architecture with Apache Kafka, Flink and Iceberg

Video: Shift Left Architecture

Apache Iceberg – The New De Facto Standard for Lakehouse Table Format?

Business Value of the Shift Left Architecture

SAP Datasphere and Apache Kafka as Data Fabric for S/4HANA ERP Integration

What is SAP ERP?

SAP ECC, S/4HANA, and more ERP Products

What is SAP Datasphere?

SAP Datasphere = Cloud Data Warehouse and Analytics Platform

Features of SAP Datasphere and Complementary Software Partnerships

SAP Datasphere = One-Stop-Shop for Multi-Generation SAP ERP Systems

Why Apache Kafka for SAP Integration?

Continuous SAP Migration and Cutover with Kafka

Confluent’s Fully Managed SAP Integration and Strategic Partnership

Confluent in the SAP PartnerEdge Program