data warehouse Archives - Kai Waehner https://www.kai-waehner.de/blog/tag/data-warehouse/ Technology Evangelist - Big Data Analytics - Middleware - Apache Kafka Sun, 18 May 2025 15:47:53 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.2 https://www.kai-waehner.de/wp-content/uploads/2020/01/cropped-favicon-32x32.png data warehouse Archives - Kai Waehner https://www.kai-waehner.de/blog/tag/data-warehouse/ 32 32 Shift Left Architecture for AI and Analytics with Confluent and Databricks https://www.kai-waehner.de/blog/2025/05/09/shift-left-architecture-for-ai-and-analytics-with-confluent-and-databricks/ Fri, 09 May 2025 06:03:07 +0000 https://www.kai-waehner.de/?p=7774 Confluent and Databricks enable a modern data architecture that unifies real-time streaming and lakehouse analytics. By combining shift-left principles with the structured layers of the Medallion Architecture, teams can improve data quality, reduce pipeline complexity, and accelerate insights for both operational and analytical workloads. Technologies like Apache Kafka, Flink, and Delta Lake form the backbone of scalable, AI-ready pipelines across cloud and hybrid environments.

The post Shift Left Architecture for AI and Analytics with Confluent and Databricks appeared first on Kai Waehner.

]]>
Modern enterprise architectures are evolving. Traditional batch data pipelines and centralized processing models are being replaced by more flexible, real-time systems. One of the driving concepts behind this change is the Shift Left approach. This blog compares Databricks’ Medallion Architecture with a Shift Left Architecture popularized by Confluent. It explains where each concept fits best—and how they can work together to create a more complete, flexible, and scalable architecture.

Shift Left Architecture with Confluent Data Streaming and Databricks Lakehouse Medallion

About the Confluent and Databricks Blog Series

This article is part of a blog series exploring the growing roles of Confluent and Databricks in modern data and AI architectures:

Learn how these platforms will affect data use in businesses in future articles. Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including more details about the shift left architecture with data streaming and lakehouses.

Medallion Architecture: Structured, Proven, but Not Always Optimal

The Medallion Architecture, popularized by Databricks, is a well-known design pattern for organizing and processing data within a lakehouse. It provides structure, modularity, and clarity across the data lifecycle by breaking pipelines into three logical layers:

  • Bronze: Ingest raw data in its original format (often semi-structured or unstructured)
  • Silver: Clean, normalize, and enrich the data for usability
  • Gold: Aggregate and transform the data for reporting, dashboards, and machine learning
Databricks Medallion Architecture for Lakehouse ETL
Source: Databricks

This layered approach is valuable for teams looking to establish governed and scalable data pipelines. It supports incremental refinement of data and enables multiple consumers to work from well-defined stages.

Challenges of the Medallion Architecture

The Medallion Architecture also introduces challenges:

  • Pipeline delays: Moving data from Bronze to Gold can take minutes or longer—too slow for operational needs
  • Infrastructure overhead: Each stage typically requires its own compute and storage footprint
  • Redundant processing: Data transformations are often repeated across layers
  • Limited operational use: Data is primarily at rest in object storage; using it for real-time operational systems often requires inefficient reverse ETL pipelines.

For use cases that demand real-time responsiveness and/or critical SLAs—such as fraud detection, personalized recommendations, or IoT alerting—this traditional batch-first model may fall short. In such cases, an event-driven streaming-first architecture, powered by a data streaming platform like Confluent, enables faster, more cost-efficient pipelines by performing validation, enrichment, and even model inference before data reaches the lakehouse.

Importantly, this data streaming approach doesn’t replace the Medallion pattern—it complements it. It allows you to “shift left” critical logic, reducing duplication and latency while still feeding trusted, structured data into Delta Lake or other downstream systems for broader analytics and governance.

In other words, shifting data processing left (i.e., before it hits a data lake or Lakehouse) is especially valuable when the data needs to serve multiple downstream systems—operational and analytical alike—because it avoids duplication, reduces latency, and ensures consistent, high-quality data is available wherever it’s needed.

Shift Left Architecture: Process Earlier, Share Faster

In a Shift Left Architecture, data processing happens earlier—closer to the source, both physically and logically. This often means:

  • Transforming and validating data as it streams in
  • Enriching and filtering in real time
  • Sharing clean, usable data quickly across teams AND different technologies/applications

Shift Left Architecture with Data Streaming into Data Lake Warehouse Lakehouse

This is especially useful for:

  • Reducing time to insight
  • Improving data quality at the source
  • Creating reusable, consistent data products
  • Operational workloads with critical SLAs

How Confluent Enables Shift Left with Databricks

In a Shift Left setup, Apache Kafka provides scalable, low-latency, and truly decoupled ingestion of data across operational and analytical systems, forming the backbone for unified data pipelines.

Schema Registry and data governance policies enforce consistent, validated data across all streams, ensuring high-quality, secure, and compliant data delivery from the very beginning.

Apache Flink enables early data processing — closer to where data is produced. This reduces complexity downstream, improves data quality, and allows real-time decisions and analytics.

Shift Left Architecture with Confluent Databricks and Delta Lake

Data Quality Governance via Data Contracts and Schema Validation

Flink can enforce data contracts by validating incoming records against predefined schemas (e.g., using JSON Schema, Apache Avro or Protobuf with Schema Registry). This ensures structurally valid data continues through the pipeline. In cases where schema violations occur, records can be automatically routed to a Dead Letter Queue (DLQ) for inspection.

Confluent Schema Registry for good Data Quality, Policy Enforcement and Governance using Apache Kafka

Additionally, data contracts can enforce policy-based rules at the schema level—such as field-level encryption, masking of sensitive data (PII), type coercion, or enrichment defaults. These controls help maintain compliance and reduce risk before data reaches regulated or shared environments.

Flink can perform the following tasks before data ever lands in a data lake or warehouse:

Filtering and Routing

Events can be filtered based on business rules and routed to the appropriate downstream system or Kafka topic. This allows different consumers to subscribe only to relevant data, optimizing both performance and cost.

Metric Calculation

Use Flink to compute rolling aggregates (e.g., counts, sums, averages, percentiles) over windows of data in motion. This is useful for business metrics, anomaly detection, or feeding real-time dashboards—without waiting for batch jobs.

Real-Time Joins and Enrichment

Flink supports both stream-stream and stream-table joins. This enables real-time enrichment of incoming events with contextual information from reference data (e.g., user profiles, product catalogs, pricing tables), often sourced from Kafka topics, databases, or external APIs.

Streaming ETL with Apache Flink SQL

By shifting this logic to the beginning of the pipeline, teams can reduce duplication, avoid unnecessary storage and compute costs in downstream systems, and ensure that data products are clean, policy-compliant, and ready for both operational and analytical use—as soon as they are created.

Example: A financial application might use Flink to calculate running balances, detect anomalies, and enrich records with reference data before pushing to Databricks for reporting and training analytic models.

In addition to enhancing data quality and reducing time-to-insight in the lakehouse, this approach also makes data products immediately usable for operational workloads and downstream applications—without building separate pipelines.

Learn more about stateless and stateful stream processing in real-time architectures using Apache Flink in this in-depth blog post.

Combining Shift Left with Medallion Architecture

These architectures are not mutually exclusive. Shift Left is about processing data earlier. Medallion is about organizing data once it arrives.

You can use Shift Left principles to:

  • Pre-process operational data before it enters the Bronze layer
  • Ensure clean, validated data enters Silver with minimal transformation needed
  • Reduce the need for redundant processing steps between layers

Confluent’s Tableflow bridges the two worlds. It converts Kafka streams into Delta tables, integrating cleanly with the Medallion model while supporting real-time flows.

Shift Left with Delta Lake, Iceberg, and Tableflow

Confluent Tableflow makes it easy to publish Kafka streams into Delta Lake or Apache Iceberg formats. These can be discovered and queried inside Databricks via Unity Catalog.

This integration:

  • Simplifies integration, governance and discovery
  • Enables live updates to AI features and dashboards
  • Removes the need to manage Spark streaming jobs

This is a natural bridge between a data streaming platform and the lakehouse.

Confluent Tableflow to Unify Operational and Analytical Workloads with Apache Iceberg and Delta Lake
Source: Confluent

AI Use Cases for Shift Left with Confluent and Databricks

The Shift Left model benefits both predictive and generative AI:

  • Model training: Real-time data pipelines can stream features to Delta Lake
  • Model inference: In some cases, predictions can happen in Confluent (via Flink) and be pushed back to operational systems instantly
  • Agentic AI: Real-time event-driven architectures are well suited for next-gen, stateful agents

Databricks supports model training and hosting via MosaicML. Confluent can integrate with these models, or run lightweight inference directly from the stream processing application.

Data Warehouse Use Cases for Shift Left with Confluent and Databricks

  • Batch reporting: Continue using Databricks for traditional BI
  • Real-time analytics: Flink or real-time OLAP engines (e.g., Apache Pinot, Apache Druid) may be a better fit for sub-second insights
  • Hybrid: Push raw events into Databricks for historical analysis and use Flink for immediate feedback

Where you do the data processing depends on the use case.

Architecture Benefits Beyond Technology

Shift Left also brings architectural benefits:

  • Cost Reduction: Processing early can lower storage and compute usage
  • Faster Time to Market: Data becomes usable earlier in the pipeline
  • Reusability: Processed streams can be reused and consumed by multiple technologies/applications (not just Databricks teams)
  • Compliance and Governance: Validated data with lineage can be shared with confidence

These are important for strategic enterprise data architectures.

Bringing in New Types of Data

Shift Left with a data streaming platform supports a wider range of data sources:

  • Operational databases (like Oracle, DB2, SQL Server, Postgres, MongoDB)
  • ERP systems (SAP et al)
  • Mainframes and other legacy technologies
  • IoT interfaces (MQTT, OPC-UA, proprietary IIoT gateway, etc.)
  • SaaS platforms (Salesforce, ServiceNow, and so on)
  • Any other system that does not directly fit into the “table-driven analytics perspective” of a Lakehouse

With Confluent, these interfaces can be connected in real time, enriched at the edge or in transit, and delivered to analytics platforms like Databricks.

This expands the scope of what’s possible with AI and analytics.

Shift Left Using ONLY Databricks

A shift left architecture only with Databricks is possible, too. A Databricks consultant took my Shift Left slide and adjusted it that way:

Shift Left Architecture with Databricks and Delta Lake

 

Relying solely on Databricks for a “Shift Left Architecture” can work if all workloads (should) stay within the platform — but it’s a poor fit for many real-world scenarios.

Databricks focuses on ELT, not true ETL, and lacks native support for operational workloads like APIs, low-latency apps, or transactional systems. This forces teams to rely on reverse ETL tools – a clear anti-pattern in the enterprise architecture – just to get data where it’s actually needed. The result: added complexity, latency, and tight coupling.

The Shift Left Architecture is valuable, but in most cases it requires a modular approach, where streaming, operational, and analytical components work together — not a monolithic platform.

That said, shift left principles still apply within Databricks. Processing data as early as possible improves data quality, reduces overall compute cost, and minimizes downstream data engineering effort. For teams that operate fully inside the Databricks ecosystem, shifting left remains a powerful strategy to simplify pipelines and accelerate insight.

Meesho: Scaling a Real-Time Commerce Platform with Confluent and Databricks

Many high-growth digital platforms adopt a shift-left approach out of necessity—not as a buzzword, but to reduce latency, improve data quality, and scale efficiently by processing data closer to the source.

Meesho, one of India’s largest online marketplaces, relies on Confluent and Databricks to power its hyper-growth business model focused on real-time e-commerce. As the company scaled rapidly, supporting millions of small businesses and entrepreneurs, the need for a resilient, scalable, and low-latency data architecture became critical.

To handle massive volumes of operational events — from inventory updates to order management and customer interactions — Meesho turned to Confluent Cloud. By adopting a fully managed data streaming platform using Apache Kafka, Meesho ensures real-time event delivery, improved reliability, and faster application development. Kafka serves as the central nervous system for their event-driven architecture, connecting multiple services and enabling instant, context-driven customer experiences across mobile and web platforms.

Alongside their data streaming architecture, Meesho migrated from Amazon Redshift to Databricks to build a next-generation analytics platform. Databricks’ lakehouse architecture empowers Meesho to unify operational data from Kafka with batch data from other sources, enabling near real-time analytics at scale. This migration not only improved performance and scalability but also significantly reduced costs and operational overhead.

With Confluent managing real-time event processing and ingestion, and Databricks providing powerful, scalable analytics, Meesho is able to:

  • Deliver real-time personalized experiences to customers
  • Optimize operational workflows based on live data
  • Enable faster, data-driven decision-making across business teams

By combining real-time data streaming with advanced lakehouse analytics, Meesho has built a flexible, future-ready data infrastructure to support its mission of democratizing online commerce for millions across India.

Shift Left: Reducing Complexity, Increasing Value for the Lakehouse (and other Operational Systems)

Shift Left is not about replacing Databricks. It’s about preparing better data earlier in the pipeline—closer to the source—and reducing end-to-end complexity.

  • Use Confluent for real-time ingestion, enrichment, and transformation
  • Use Databricks for advanced analytics, reporting, and machine learning
  • Use Tableflow and Delta Lake to govern and route high-quality data to the right consumers

This architecture not only improves data quality for the lakehouse, but also enables the same real-time data products to be reused across multiple downstream systems—including operational, transactional, and AI-powered applications.

The result: increased agility, lower costs, and scalable innovation across the business.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including more details about the shift left architecture with data streaming and lakehouses.

The post Shift Left Architecture for AI and Analytics with Confluent and Databricks appeared first on Kai Waehner.

]]>
How Microsoft Fabric Lakehouse Complements Data Streaming (Apache Kafka, Flink, et al.) https://www.kai-waehner.de/blog/2024/10/12/how-microsoft-fabric-lakehouse-complements-data-streaming-apache-kafka-flink-et-al/ Sat, 12 Oct 2024 06:58:00 +0000 https://www.kai-waehner.de/?p=6870 In today's data-driven world, understanding data at rest versus data in motion is crucial for businesses. Data streaming frameworks like Apache Kafka and Apache Flink enable real-time data processing. Meanwhile, lakehouses like Snowflake, Databricks, and Microsoft Fabric excel in long-term data storage and detailed analysis, perfect for reports and AI training. This blog post explores how these technologies complement each other in enterprise architecture.

The post How Microsoft Fabric Lakehouse Complements Data Streaming (Apache Kafka, Flink, et al.) appeared first on Kai Waehner.

]]>
In today’s data-driven world, understanding data at rest versus data in motion is crucial for businesses. Data streaming frameworks like Apache Kafka and Apache Flink enable real-time data processing, offering quick insights and seamless system integration. They are ideal for applications that require immediate responses and handle transactional workloads. Meanwhile, lakehouses like Snowflake, Databricks, and Microsoft Fabric excel in long-term data storage and detailed analysis, perfect for reports and AI training. By leveraging both data streaming and lakehouse systems, businesses can effectively meet both short-term and long-term data needs. This blog post explores how these technologies complement each other in enterprise architecture.

Lakehouse and Data Streaming - Competitor or Complementary - Kafka Flink Confluent Microsoft Fabric Snowflake Databricks

This is part two of a blog series about Microsoft Fabric and its relation to other data platforms on the Azure cloud:

  1. What is Microsoft Fabric for Azure Cloud (Beyond the Buzz) and how it Competes with Snowflake and Databricks
  2. How Microsoft Fabric Lakehouse Complements Data Streaming (Apache Kafka, Flink, et al.)
  3. When to Choose Apache Kafka vs. Azure Event Hubs vs. Confluent Cloud for a Microsoft Fabric Lakehouse

Subscribe to my newsletter to get an email about a new blog post every few weeks.

Data at Rest (Lakehouse) vs. Data in Motion (Data Streaming)

Data streaming technologies like Apache Kafka and Apache Flink enable continuous data processing while the data is in motion in an event-driven architecture. Data streaming enables immediate insights and seamless integration of data across systems. Kafka provides a robust real-time messaging and persistence platform, while Flink excels in low-latency stream processing, making them ideal for dynamic, stateful applications. A data streaming platform supports operational/transactional and analytical use cases.

Data lakes and data warehouses store data at rest before processing the data. The platforms are optimized for batch processing and long-term analytics, including AI/ML use cases such as model training. Some components provide near real-time capabilities, e.g. data ingestion or dashboards. Data lakes offer scalable, flexible storage for raw data, and data warehouses provide structured, high-performance environments for business intelligence and reporting, complementing the real-time capabilities of streaming technologies. Most leading data platforms provide a unified combination of data lake and data warehouse called lakehouse. Lakehouses are almost exclusively used for analytical workloads as they typically lack the SLAs and tight latency required for operational/transactional use cases.

Data Streaming with Apache Kafka Flink and Lakehouse with Microsoft Fabric Databricks Snowflake Apache Iceberg

Data streaming and lakehouses are complementary, with some overlaps but different sweet spots. If you want to learn more, check out these articles:

I also created a short ten minute video explaining the above concepts:

Let’s explore why data streaming and a lakehouse like Microsoft Fabric are complementary (with a few overlaps). I explained in the first blog of this series what Microsoft Fabric is. To understand the differences, it is important to understand what a data streaming platform really is.

There is a lot of confusion in the market. For instance, some folks still compare Apache Kafka to a message broker like RabbitMQ or IBM MQ. I mainly focus on Apache Kafka and Apache Flink as these are the de facto standards for data streaming across industries. Before talking about technologies and solutions, we need to start with the concept of an event-driven architecture as the foundation of data streaming.

Event-driven Architecture for Operational and Analytical Workloads

In today’s digital world, getting real-time data quickly is more important than ever. Traditional methods that process data in batches or via request-response APIs often cannot keep up when you need immediate insights.

Event-driven architecture offers a different approach by focusing on handling events – like transactions or user actions – as they happen. One of the key benefits of an event-driven architecture is its ability to decouple systems, meaning that different parts of a system can work independently. This makes it easier to scale and adapt to changes. An event-driven architecture excels in handling both operational and analytical workloads.

Event-driven Architecture with Data Streaming using Apache Kafka and Flink
Source: Confluent

For operational tasks, the event-driven architecture enables real-time data processing, automating processes, enhancing customer experiences, and boosting efficiency. In e-commerce, for example, an event-driven system can instantly update inventory, trigger marketing campaigns, and detect fraud.

On the analytical side, the event-driven architecture allows organizations to derive insights from data as it flows, enabling real-time analytics and trend identification without the delays of batch processing. This is invaluable in sectors like finance and healthcare, where timely insights are crucial.

Event-driven Architecture for Data Streaming with Apache Kafka and Flink

Building an event-driven architecture with data streaming technologies like Apache Kafka and Apache Flink enhances its potential. These platforms provide the infrastructure for high-throughput, low-latency data streams, enabling scalable and resilient event-driven systems.

Apache Kafka – The De Facto Standard for Event-driven Messaging and Integration

Apache Kafka has become the go-to platform for event-driven messaging and integration, transforming how organizations manage data in motion. Developed by LinkedIn and open-sourced, Kafka is a distributed streaming platform adept at handling real-time data feeds. Over 150,000 organizations use Kafka in the meantime.

Kafka’s architecture is based on a distributed commit log, ensuring data durability and consistency. It decouples data producers and consumers, allowing for flexible and scalable data architectures. Producers publish data to topics, and consumers subscribe independently, facilitating system evolution.

Apache Kafka's Commit Log for Real Time and Batch Data Streaming and Persistence Layer

Beyond messaging, Kafka serves as a robust integration platform, connecting diverse systems and enabling seamless data flow. Its ecosystem of connectors allows integration with databases, cloud services, and legacy systems. This helps organizations in modernizing their data infrastructure step-by-step.

Kafka’s stream processing capabilities, through Kafka Streams and integration with Apache Flink, further enhance the value of the streaming data pipelines. Kafka Streams allows real-time data processing within Kafka to enable complex transformations and enrichments, driving data-driven innovation.

Apache Flink stands out as the leading framework for stream processing. It offers a versatile platform for streaming ETL and stateful business applications. Flink processes data streams with low latency and high throughput, suitable for diverse use cases.

Apache Flink for Real-Time Business Applications and Analytics
Source: Apache

Flink provides a unified programming model for batch and stream processing that allows developers to use the same API for real-time transactional or analytical batch tasks. This flexibility is a significant advantage as it enables varied data processing without separate tools.

A key feature of Flink is its stateful stream processing. This is crucial for maintaining state across events in real-time applications. Flink’s state management ensures accurate processing in complex scenarios. In contrast to many other stream processing solutions, Flink can do stateful processing even at an extreme scale (i.e., with a throughput of gigabytes per second).

Flink’s event time processing capabilities handle out-of-order or delayed events and ensure consistent results. Developers can define windows and triggers based on event timestamps, accommodating late-arriving data.

Apache Flink supports multiple programming languages, including Java, Python, and SQL, offering developers the flexibility to use their preferred language for building stream processing applications. This is a key differentiator to other stream processing engines, such as Kafka Streams or KSQL.

The integration of Flink with Apache Kafka enhances its capabilities. Kafka serves as a reliable data source for Flink to enable seamless real-time data ingestion and processing. With Kafka’s persistent commit log, you can travel back in time and replay historical data in guaranteed ordering for analytical use cases. This combination supports high-volume, low-latency data pipelines, unlocking transactional real-time scenarios and batch analytics.

In summary, Apache Flink’s robust stream processing, combined with Apache Kafka, offers a powerful solution for organizations seeking to leverage real-time data. Whether for operational tasks, real-time analytics, or complex event processing (CEP), Flink provides the necessary flexibility and performance for data-driven innovation.

In the growing landscape of data management, it’s crucial to understand the complementary roles of Microsoft Fabric and data streaming technologies. While some may perceive these technologies as competitors, they actually serve distinct yet interconnected purposes that enhance an organization’s data strategy. And keep in mind that Microsoft Fabric is not just an offering for Azure cloud. Hybrid edge scenarios in the IoT space are perfect for Microsoft Fabric and data streaming together.

Microsoft Fabric’s Streaming Ingestion: A Common Feature Among Lakehouses

Microsoft Fabric, like other modern lakehouse platforms such as Snowflake and Databricks, offers streaming ingestion capabilities. This feature is essential for handling near real-time data flows. It allows organizations to capture and process data as it arrives in the lakehouse. However, it’s important to distinguish between operational and analytical workloads when considering the role of streaming ingestion.

Operational workloads benefit from the immediacy of streaming data, enabling real-time decision-making and process automation. In contrast, analytical workloads often require data to be stored at rest for in-depth analysis and reporting. Microsoft Fabric’s architecture focuses on streaming ingestion into robust storage solutions for analytical purposes.

The Fabric Real Time Intelligence Hub: Understanding Its Capabilities and Limitations

The integration of streaming ingestion into Microsoft Fabric is part of its Real Time Intelligence Hub, which aims to provide a comprehensive platform for managing real-time data. However, beyond the marketing and buzz around Fabric Real Time Intelligence Hub, it’s important to note that it doesn’t operate in true real-time.

Instead, Fabric’s “Real Time Intelligence Hub” uses Spark Streaming jobs to manipulate data, which can introduce some latency. And the infrastructure is not meant for critical SLAs that are required by operational / transactional systems. Additionally, the ingestion process is throttled when using Power BI and other batch analytics tools via an API gateway with a Kafka client.

Microsoft is strong in introducing new names for products or feature for Fabric that are actually just a new brand of existing services. If you find new terms such as “eventhouse” or “event streams feature in the Microsoft Fabric Real-Time Intelligence”, make sure to evaluate if this is really a new component or just some Fabric marketing.

Therefore, despite some overlapping with a data streaming platform, the collaboration between Microsoft Fabric and data streaming vendors like Confluent (Kafka, Flink) underscores the complementary nature of these platforms. By leveraging the strengths of both, organizations can build a robust data infrastructure that supports real-time operations and comprehensive analytics.

In conclusion, Microsoft Fabric and data streaming technologies such as Kafka and Flink are not competitors but complementary tools that, when used together, can significantly enhance an organization’s ability to manage and analyze data. By understanding the distinct roles each plays, businesses can create a more agile and responsive data strategy that meets both operational and analytical needs.

In the modern enterprise architecture landscape, data streaming and lakehouse platforms are pivotal in creating a robust and flexible data ecosystem. Data streaming technologies enable continuous data ingestion and processing for operational and analytical use cases.

Lakehouse platforms, like Microsoft Fabric, Snowflake and Databricks, provide a unified architecture that combines the best of data lakes and data warehouses, offering scalable storage and advanced analytics capabilities.

Together, these technologies empower businesses to handle both operational and analytical workloads efficiently, breaking down data silos and fostering a data-driven culture. By integrating data streaming with lakehouse architectures, enterprises can achieve seamless data flow and comprehensive insights across their operations.

Reverse ETL is an Anti-Pattern

Reverse ETL is the process of moving data from a data store at rest back into operational systems. It is often considered an anti-pattern in modern data architecture. This approach can lead to data inconsistencies, increased complexity, and higher maintenance costs, as it essentially reverses the natural flow of data in motion. Do NOT store data in MIcrosoft Fabric Lakehouse just to reverse it later into other operational systems!

Data at Rest and Reverse ETL

Instead of relying on reverse ETL, organizations should focus on building real-time data pipelines that enable direct integration between data sources and operational systems. By leveraging an event-driven architecture and data streaming technologies, businesses can ensure that data is consistently updated and available where it’s needed most. This approach not only simplifies data management, but also enhances the accuracy and timeliness of insights.

Apache Iceberg as the De Facto Standard for an Open Table Format – Store Once, Analyze with any Tool

Apache Iceberg has emerged as the de facto standard for an open table format. It offers a opportunity for storing data once in an object store like Amazon S3 and analyzing data across various tools. With its ability to handle large-scale datasets and support ACID transactions, Iceberg provides a reliable and efficient way to manage data in a lakehouse environment.

Apache Iceberg Open Table Format for Data Lake Lakehouse Streaming wtih Kafka Flink Databricks Snowflake AWS GCP Azure

Organizations can use their preferred analytics and processing tools without being locked into a specific vendor. This flexibility is crucial for businesses looking to maximize their data investments and adapt to changing technological landscapes. By adopting Apache Iceberg together with data streaming, enterprises can ensure data consistency and accessibility across all business units to drive better data-quality, insights and decision-making.

Shift Left Architecture to Support Operational and Analytical Workloads with an Event-driven Architecture

Traditionally, many organizations use data streaming with Kafka as a dumb pipeline to ingest all raw data into a data lake. The consequences are high compute cost for multiple (re-)processing of the raw data, inconsistencies across business units, and slow time to market for new applications.

ETL and ELT Data Integration to Data Lake Warehouse Lakehouse in Batch - Ingestion via Kafka into Microsoft Fabric

The Shift Left Architecture is a forward-thinking approach that integrates operational and analytical workloads within an event-driven architecture. By shifting data processing closer to the source, this architecture enables real-time data ingestion and analysis, improving data quality for lakehouse ingestion, reducing latency and improving responsiveness.

Shift Left Architecture with Data Streaming into Data Lake Warehouse Lakehouse - Processing with Kafka and Flink before Microsoft Fabric

Event-driven architectures, powered by technologies like Apache Kafka and Flink, facilitate the seamless flow and processing of data across systems. Shift Left ensures that both operational and analytical needs are met. This approach not only enhances the agility of data-driven applications, but also supports continuous improvement and innovation. By adopting a Shift Left Architecture, organizations can streamline their data processes, improve efficiency, and gain a competitive edge in the market.

Example: Confluent (Data Streaming) + Microsoft Fabric (Lakehouse) + Snowflake (Another Lakehouse)

An example of integrating data streaming and lakehouse technologies is the combination of Confluent, Microsoft Fabric, and Snowflake.

Confluent, built on Apache Kafka and Flink, provides a robust platform for real-time data streaming, enabling organizations to integrate with operational and analytical workloads.

Microsoft Fabric and Snowflake, both lakehouse platforms, offer scalable storage and advanced analytics capabilities to allow businesses performing in-depth analysis and reporting on historical data, near real-time analytics and AI model training.

Shift Left Architecture with Confluent Kafka Snowflake and Microsoft Fabric for Data Streaming and Lakehouse

Apache Iceberg enables storing data once and connects any analytical engine to the data, including lakehouses such as Microsoft Fabric or Snowflake, and unified batch and streaming frameworks such as Apache Flink. Iceberg improves the overall data quality for data sharing, reduces storage cost and enables a much faster rollout of new analytical applications.

By leveraging Confluent for data streaming and integrating it with Microsoft Fabric and Snowflake, organizations can create a comprehensive data architecture that supports both real-time operations and long-term analytics. This synergy not only enhances data accessibility and consistency but also empowers businesses to make data-driven decisions with confidence.

In conclusion, the synergy between Microsoft Fabric and data streaming technologies like Apache Kafka and Apache Flink creates a powerful combination for modern data management. While Microsoft Fabric excels in providing robust analytics and storage capabilities, data streaming platforms offer real-time data processing and integration, ensuring that businesses can respond swiftly to operational demands.

By leveraging both technologies together, organizations can build a comprehensive data architecture that supports both immediate and long-term needs, enhancing their ability to make informed, data-driven decisions. This complementary relationship not only breaks down data silos, but also fosters a more agile and responsive data strategy. As businesses continue to navigate the complexities of data management, understanding and using the strengths of both Microsoft Fabric and data streaming with data streaming vendors like Confluent will be key to achieving a competitive edge.

The Shift Left Architecture, when paired with Apache Iceberg’s open table format, simplifies the integration of data streaming with one or more lakehouses. This combination enhances data quality for all data consumers and significantly reduces overall storage costs.

In part three of this blog series, I will dig deeper into the data streaming alternatives. When to choose open source frameworks such as Apache Kafka and Flink, a leading data streaming platform such as Confluent, or a native Azure service like Event Hubs. Primer: The trade-offs are huge. Do a proper evaluation BEFORE choosing your data streaming solution.

How do you see the combination of a lakehouse like Microsoft Fabric with data streaming? Do you already use both together? And what is your strategy for other data lakes and data warehouses you already have in your enterprise architecture, such as Databricks or Snowflake? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post How Microsoft Fabric Lakehouse Complements Data Streaming (Apache Kafka, Flink, et al.) appeared first on Kai Waehner.

]]>
What is Microsoft Fabric for Azure Cloud (Beyond the Buzz) and how it Competes with Snowflake and Databricks https://www.kai-waehner.de/blog/2024/10/04/what-is-microsoft-fabric-for-azure-cloud-beyond-the-buzz-and-how-it-competes-with-snowflake-and-databricks/ Fri, 04 Oct 2024 05:16:31 +0000 https://www.kai-waehner.de/?p=6833 If you ask your favorite large language model, Microsoft Fabric appears to be the ultimate solution for any data challenge you can imagine. That’s also the impression many people get from Microsoft’s sales teams. But is it really the silver bullet it’s made out to be? This article takes a closer look exploring the glossy marketing and sales definition of the platform and then deconstructing it from a more practical perspective. Learn what Microsoft Fabric is truly built for, and how it fits into the wider data landscape, especially in comparison to other major players in the data analytics market like Databricks and Snowflake.

The post What is Microsoft Fabric for Azure Cloud (Beyond the Buzz) and how it Competes with Snowflake and Databricks appeared first on Kai Waehner.

]]>
If you ask your favorite large language model, Microsoft Fabric appears to be the ultimate solution for any data challenge you can imagine. That’s also the impression many people get from Microsoft’s sales teams. But is it really the silver bullet it’s made out to be? This article takes a closer look. The first part explores the glossy marketing and sales definition of the platform. The second part looks the layers and deconstructs it from a more practical perspective. By doing so, the third part uncovers what Microsoft Fabric is truly built for, and how it fits into the wider data landscape, especially in comparison to other major players in the data analytics market like Databricks and Snowflake.

Microsoft Fabric and OneLake Azure Lakehouse vs Databricks and Snowflake Cloud

This is part one of a blog series about Microsoft Fabric and its relation to other data platforms on the Azure cloud:

  1. What is Microsoft Fabric for Azure Cloud (Beyond the Buzz) and how it Compares (or Competes) with Snowflake and Databricks
  2. How Microsoft Fabric Complements Data Streaming (Apache Kafka, Flink, et al.)
  3. When to Choose Apache Kafka vs. Azure Event Hubs vs. Confluent Cloud for a Microsoft Fabric Lakehouse

Subscribe to my newsletter to get an email about a new blog post every few weeks.

What is Microsoft Fabric?

If you listen to Microsoft’s sales and marketing, then Microsoft Fabric is a silver bullet for every use case. Let’s take a two-step approach. Look at the sales and marketing definition. Then deconstruct it a bit from a more realistic point of view…

GenAI Definition (= Sales and Marketing)

If you ask your favourite large language model or search engine, it you something like the following when I ask “What is Microsoft Fabric” (based on sales and marketing content):

Microsoft Fabric is an end-to-end analytics platform designed to integrate various data services and enable businesses to manage, analyze, and act on their data seamlessly. It was launched as part of Microsoft’s data ecosystem and builds upon key features from platforms like Power BI, Azure Synapse Analytics, and Azure Data Factory.

Microsoft Fabric Marketing Website
Source: Microsoft

Here are some key aspects of Microsoft Fabric:

  • Unified Platform: It combines data engineering, data science, data warehousing, and real-time analytics into a single platform. This helps businesses eliminate the need to use multiple services for data management and analysis.
  • Lakehouse Architecture: Fabric is designed around the lakehouse concept, which merges the best of data lakes and data warehouses. It allows for both structured and unstructured data to be stored and processed together.
  • Tightly Integrated with Microsoft 365 and Azure: Microsoft Fabric connects seamlessly with other Microsoft services like Microsoft 365, Power BI, and Azure Machine Learning, enabling better collaboration, reporting, and AI-driven insights.
  • Low-code/No-code Experience: The platform provides intuitive tools for data analysts, developers, and business users, allowing non-technical users to work with data through drag-and-drop interfaces, while also enabling more complex scenarios for advanced users.
  • AI and Machine Learning Integration: Microsoft Fabric incorporates AI tools, making it easier for businesses to build predictive models and automate data-driven decisions.
  • End-to-End Security and Governance: The platform supports robust security measures and compliance requirements, offering features like data encryption, role-based access control, and regulatory compliance support.
  • Real-time Data Processing: With support for real-time analytics, Fabric enables organizations to derive insights from live data streams, improving decision-making speed and accuracy.

Microsoft Fabric is designed to streamline how businesses use data, combining the power of analytics with cloud-scale capabilities.

Wow. Just wow. Microsoft Fabric seems to be everything you ever need for your data challenges.

Microsoft Developer has an excellent 45 minute presentation about OneLake and Microsoft Fabric with a few more technical details. This video is also the source of the screenshots below.

Well, let’s dig deeper. What is Microsoft Fabric really? Let’s deconstruct it a bit…

Microsoft Fabric is a Data Analytics Platform ( = NOT for Operational / Transactional Workloads)

Microsoft Fabric is part of Microsoft’s data analytics portfolio. That’s already the first alarm signal when you consider building operational workloads. This is not a criticism, but important to understand!

Azure Data Analytics SaaS Platform

Microsoft Fabric is NOT a platform for transactional workloads like payments, fraud detection, order management or ERP integration. You should not build an operational application like an Azure Serverless Function or a self-managed Spring Boot container for Fabric.

Furthermore, within the data analytics layer, the foundation of Microsoft Fabric is (only) an optimized storage layer. And this storage layer called OneLake is a SaaS offering, i.e., the storage is part of the Microsoft tenant. Contrary to many other data lakes and lakehouses like Databricks, you do not control or own the storage.

While the conversation is usually around cloud analytics, Microsoft Fabric is a unified analytics platform that integrates with Azure Cloud but is sold independently. This allows organizations to deploy it in various environments, edge, and hybrid setups. For instance, Microsoft sells Fabric for hybrid IoT projects where data needs to be processed both locally and in the cloud.

OneLake – Cloud-based Storage Layer on Top of Azure Data Lake Storage (ADLS)

Microsoft OneLake is a unified, cloud-based data lake that acts as the central storage layer within Microsoft Fabric:

Microsoft OneLake is built on top of Azure Data Lake Storage (ADLS), using its scalable and secure data storage capabilities for long-term data retention. OneLake inherits ADLS’s features like hierarchical namespaces and advanced security, while adding a unified data lake experience across multiple clouds and deep integration with Microsoft’s analytics and data tools through Microsoft Fabric.

OneLake for Azure Lakehouse, Databricks with Delta Lake, Snowflake with Apache Iceberg Lakehouse
Source: Microsoft

The message is obvious: Store all data in OneLake and connect your favourite compute engines, such as Microsoft Fabric, Azure Databricks and Snowflake. Open Table Formats like Delta Lake and Apache Iceberg allow simple integration without the need to copy data again.

Microsoft Fabric Connects to Many Existing Azure Services

On top of the storage layer OneLake, Microsoft Fabric connects to plenty of different existing Microsoft Azure services, including Power BI, Data Explorer, various Synapse services, and so on. This explains why Microsoft Fabric can magically provide every capability you are looking for a few months after the initial announcement.

Microsoft Fabric Architecture - OneLake as Storage Layer and Azure SaaS Analytics Cloud Services
Source: Microsoft

Here are a few integrations of Azure services into the unified storage of Microsoft Fabric:

  • Power BI: A critical component of Microsoft Fabric, enabling data visualization and business intelligence. It allows users to create interactive dashboards and reports directly from data stored in the lakehouse, providing real-time insights with minimal data movement.
  • Azure Data Explorer: Used for analyzing large volumes of streaming and historical data. Microsoft Fabric connects to Data Explorer, allowing users to perform fast, complex, real-time queries on structured and semi-structured data.
  • Azure Synapse Analytics: Fabric integrates Synapse’s data engineering capabilities, allowing users to prepare, transform, and orchestrate data pipelines. It provides a unified workspace to manage end-to-end data engineering workflows, reducing the need for complex data movement.
  • Synapse Data Warehousing: Fabric connects with Synapse’s data warehousing services, making it easy to run massively parallel processing (MPP) queries for large-scale analytics on structured data.
  • Synapse Spark Pools: Fabric integrates with Apache Spark in Synapse, supporting big data processing, AI, and machine learning workloads. Users can leverage Spark’s distributed computing power within Fabric for data transformation, advanced analytics, and machine learning.
  • Azure Machine Learning (AML): Enables data scientists to build, train, and deploy machine learning models on data stored within the Fabric lakehouse. Users can perform machine learning experiments, automate ML model training, and deploy models with an unified data platform.
  • Azure Data Factory: Used for data ingestion, ETL (extract, transform, load), and data orchestration. Fabric connects with Azure Data Factory, making it easy to create data pipelines that move and transform data from a wide variety of sources, including on-premises databases, cloud storage, and third-party systems.
  • Azure Purview: Provides a unified data catalog, allowing users to discover, classify, and govern data assets across the Fabric ecosystem. It also provides compliance and auditing capabilities.
  • Azure Event Hubs and Stream Analytics: Real-time data processing and analytics. Event Hubs enables streaming data ingestion from sources like IoT devices, applications, and logs, while Stream Analytics allows for real-time data querying and analysis.

Expect more Azure services to be integrated with Microsoft Fabric in the coming months to provide a “complete lakehouse experience”. Also expect more fancy marketing brands, such as the new “Real Time Intelligence Hub” that is built by connecting / re-using existing Microsoft Azure services.

So, what is the main idea behind building this lakehouse product and brand within Microsoft’s huge cloud portfolio?

Microsoft Fabric is a Lakehouse Competing with Snowflake and Databricks

A lakehouse is a data architecture that combines the features of data lakes and data warehouses, allowing for both structured and unstructured data to be stored and processed together. It provides the scalability and flexibility of a data lake with the data management, governance, and performance features of a data warehouse. This unified approach enables real-time analytics and machine learning on diverse types of data, reducing the need for separate infrastructures.

Most analytical data vendors transition to a full-blown lakehouse. While Databricks moved from the data lake foundation powered by Apache Spark into the lakehouse, Snowflake comes from the data warehouse approach but has incorporated a lot of lakehouse features over time (even though Snowflake calls it a more general “data cloud”).

Microsoft Fabric competes with platforms like Databricks and Snowflake in the realm of data analytics, data engineering, and data warehousing by providing an integrated, cloud-native solution for data management and analytics.

Microsoft Fabric positions itself as a more holistic and integrated platform, offering a unified solution for businesses that need to handle everything from data ingestion to real-time analytics and AI. Its Microsoft ecosystem integration is a key competitive advantage.

There are also trade-offs. For instance:

  • Microsoft Fabric is only available on Azure cloud
  • Not a mature product yet
  • Starting a much more competitive approach with strategic partners like Databricks

The support of open table formats like Delta Lake and Apache Iceberg is great. But this is coming in all lakehouses because of market pressure. Not because the data cloud vendors like Databricks, Snowflake and now Microsoft with Fabric have a new business model. All of these vendors still want to collect all the data, store it forever, and put (their own!) compute services on top.

Microsoft Fabric is Azure’s Future Lakehouse

Microsoft Fabric’s integration with many Azure services allows it to offer a broad range of capabilities – from data ingestion, storage, and transformation to real-time analytics, machine learning, and governance. This interconnected ecosystem explains how Fabric can quickly meet diverse enterprise needs by leveraging Microsoft’s existing suite of powerful tools, providing a comprehensive data platform with minimal friction and seamless workflows.

In the end, Microsoft Fabric is a new lakehouse built on top of the optimized cloud storage OneLake. It directly competes with other lakehouses and data clouds such as Databricks and Snowflake to become the leading unified solution for all the things analytics. The future will show where this competition goes. Snowflake and Databricks have a very strong product and customer base already. They will not give up to Microsoft Fabric voluntarily.

Microsoft Fabric includes integrations with Azure Event Hubs (based on the Kafka protocol) and is building a brand around real-time intelligence. In the next article of this blog series, I will explore how this new lakehouse on Azure cloud competes or overlaps with data streaming technologies such as Apache Kafka, Flink, et al. Primer: Data Streaming and Microsoft Fabric are mostly complementary and have very different sweet spots.

How do you see the future of Microsoft Fabric? Do you already use it? What is the plan in the future, also keeping in mind that you likely already have other lakehouses in your enterprise architecture? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post What is Microsoft Fabric for Azure Cloud (Beyond the Buzz) and how it Competes with Snowflake and Databricks appeared first on Kai Waehner.

]]>
Apache Kafka and Tinybird (ClickHouse) for Streaming Analytics HTTP APIs https://www.kai-waehner.de/blog/2024/04/04/apache-kafka-and-tinybird-clickhouse-for-streaming-analytics-http-apis/ Thu, 04 Apr 2024 06:54:12 +0000 https://www.kai-waehner.de/?p=6103 Apache Kafka became the de facto standard for data streaming. However, the combination of an event-driven architecture with request-response APIs is crucial for most enterprise architectures. This blog post explores how Tinybird innovates with a REST/HTTP layer on top of the open source analytics database ClickHouse in the cloud. Integrating Kafka with Tinybird, the benefits of fully managed services like Confluent Cloud, and customer stories from Factorial and FanDuel show why Kafka and analytics databases complement each other for more innovation and faster time-to-market.

The post Apache Kafka and Tinybird (ClickHouse) for Streaming Analytics HTTP APIs appeared first on Kai Waehner.

]]>
Apache Kafka became the de facto standard for data streaming. However, the combination of an event-driven architecture with request-response APIs is crucial for most enterprise architectures. This blog post explores how Tinybird innovates with a REST/HTTP layer on top of the open source analytics database ClickHouse in the cloud. Integrating Kafka with Tinybird, the benefits of fully managed services like Confluent Cloud, and customer stories from Factorial and FanDuel show why Kafka and analytics databases complement each other for more innovation and faster time-to-market.

Streaming Analytics SQL API with Apache Kafka Confluent ClickHouse Tinybird

Tinybird = ClickHouse Analytics + HTTP APIs

Tinybird is powered by the open source database ClickHouse, but differentiates itself in the analytics market by providing a platform for providing real-time analytics as API.

ClickHouse and the Overwhelming Analytics Market

ClickHouse is an open-source column-oriented database management system designed for handling large-scale analytical workloads. The internal architecture is optimized for high performance and scalability, particularly for real-time analytics and OLAP (Online Analytical Processing) queries.

ClickHouse supports SQL queries and is optimized for fast data ingestion and querying of large volumes of data. The database is commonly used for data warehousing, time-series data analysis, and ad hoc analytics in various industries, including e-commerce, finance, and telecommunications.

ClickHouse is a great technology. However, in the analytics space, each database solution competes with various other open source frameworks, commercial products and fully managed cloud services. If you evaluate ClickHouse, you probably also evaluate:

  • Analytics services of the cloud providers (like Amazon Redshift, Azure Synapse Analytics, Google BigQuery, etc.)
  • Open source frameworks like Apache Druid (Imply), Pinot (StarTree), and others
  • Data analytics platforms like Snowflake, Databricks, et al.

The overlapping between these products is huge. Of course, sweet spots exist for each technology. But it is a mass market and often only the big players grow consumption significantly while small vendors end up in a niche.

Tinybird does NOT try to compete directly with all the other analytic databases. Instead, it added an intuitive layer on top of ClickHouse to differentiate its product offering.

Tinybird Adds APIs on Top of ClickHouse for Simple Integration and Publishing of Applications

Tinybird is a real-time data pipeline platform designed to help developers build and deploy scalable data products quickly and efficiently. It offers a range of tools and features for ingesting, processing, and serving real-time data streams, including data transformation, aggregation, and analytics.

Tinybird earns its reputation for simplicity and ease of use. It enables users to create custom data pipelines, applications, and APIs without the need for extensive coding or infrastructure setup. The analytics platform focuses on building operational and user-facing use cases with requirements for fresh data, fast queries, and high concurrency. Tinybird is NOT focused on traditional BI use cases served by data warehouses.

Tinybird differentiates from the competition by being more than just a managed database. The platform provides an underlying database powered by ClickHouse. It meets performance and scale needs while focusing on solving the pain around the database: data ingestion, data publication, and development workflow.

Data Streaming to APIs in Minutes
Source: Tinybird

Tinybird helps users to productize their data as HTTP APIs for integration with external systems (lakes, data mesh, user-facing applications). The product covers the entire end-to-end experience, from data ingestion to analytics to publishing APIs. An intuitive UI for interactive development and prototyping in conjunction with a full ‘data as config’ development life cycle makes the development of analytics applications straightforward. Data integration, analytics, and publication are defined as config files, stored in a Git repository. Tools help with automatic CI/CD for testing and deployment, plus a CLI and plugins for developer IDEs.

Relation of ClickHouse and Tinybird to Data Streaming with Apache Kafka

Data streaming with Apache Kafka refers to the process of ingesting, processing, and analyzing real-time data with a distributed streaming platform. Kafka enables the creation of real-time data pipelines that can handle high volumes of data from various sources, allowing for continuous data ingestion, processing, and delivery.

Kafka became the de facto standard protocol for data streaming, like Amazon S3 is the de facto standard for object storage. However…

Apache Kafka is NOT Complex Analytics

Apache Kafka is a database (even though many people disagree or don’t like this statement). But Kafka is not the right platform for complex analytics. Kafka is complementary to other databases like MySQL, MongoDB, Elasticsearch, ClickHouse, et al.

Learn more about this in my article “When NOT to use Apache Kafka” or the following lightboard video:

To be very clear: Data streaming also includes analytics capabilities. Stream processing enables continues processing of data in motion in real-time at scale. Use cases include simple stateless streaming ETL and advanced stateful computations, including embedding AI and machine learning into a stream processor. However, the workloads differ from analytical databases like ClickHouse. For a better understanding of when to use stream processing, check out my blog about the perfect match between Apache Kafka and stream processing with Kafka Streams or Apache Flink.

Apache Kafka is NOT Request-Response APIs

Request-response communication with REST / HTTP is simple, well understood, and supported by most technologies, products, and SaaS cloud services. Contrarily, data streaming with Apache Kafka is a fundamental change to process data continuously.

Data streaming with Apache Kafka and request-response APIs like HTTP/REST complement each other in most enterprise architectures. I wrote a lot about this in the past:

TL;DR: The question is not if you need Kafka or APIs, but how to combine them the best way in your architecture. Vendors like Confluent provide native HTTP APIs and connectors to produce and consume with the Kafka API using HTTP. But for more advanced use cases, solutions like Tinybird are perfect in combination with Kafka.

Apache Kafka + Tinybird = Streaming Analytics APIs

The Tinybird website explains the relation between data streaming with Kafka and Tinybird very well:

Turn your Kafka Streams into actionable API Endpoints your teams can consume. Instead of building a new consumer every time you want to make sense of your Data Streams, write SQL queries and expose them as API endpoints. Easy to maintain. Always up-to-date. Fast as can be.”

Apache Kafka Tinybird ClickHouse Integration with Kafka Connect
Source: Tinybird

The Tinybird application development process is very simple:

  1. Connect to Kafka Topics via out-of-the-box Kafka integration.
  2. Store the events in a managed, columnar data source with schemas and a data contract for good data quality.
  3. Query the events with SQL to filter, aggregate, join and enrichments for fast and scalable analytics.
  4. Publish queries as dynamic, low-latency REST/HTTP APIs that scale.

Fully Managed Cloud Services for Streaming Analytics: Confluent Cloud + Tinybird

Fully managed cloud services offer simplified operations, scalability, high availability, security, compliance, cost savings, and performance optimization. They streamline infrastructure management, ensure reliability, enhance security, reduce costs, and optimize performance for businesses. Project teams focus on business logic and much faster time-to-market of new products and innovation.

Working for Confluent, I enjoy our fast growing “Connect with Confluent” partner program to see how customers build innovative applications in a cloud-native fully managed environment, including end-to-end integration and data governance.

Confluent Cloud has fully integrated Tinybird. Developers build new applications with HTTP APIs on top of data streaming with Kafka faster than ever before. Check out this screencast to learn how you can publish an API from a Kafka stream in four minutes leveraging Confluent Cloud and Tinybird.

Tinybird Confluent Cloud Integration Connector
Source: Tinybird

Data streaming truly decoupled the business domains. In the above diagram, you see a very common scenario: various consumers of the same business information (i.e., a single Kafka Topic). Some in real-time (like Tinybird), some in near real-time or batch (like Snowflake or Databricks).

Each developer can use different programming languages, databases, or SaaS analytics platforms. Apache Kafka unifies operational and analytical workloads. As a result, a Tinybird application aggregates transactional and analytical workloads to build new APIs.

Customer Stories using Kafka in Confluent Cloud and Tinybird for Real-Time Analytics

This section explores a few real world case studies that combine Apache Kafka and ClickHouse under the hood of Confluent Cloud’s and Tinybird’s SaaS cloud solutions.

Factorial Human Resources (HR) Software: Data Freshness for Users

Factorial is provides human resources (HR) software solutions for over 8000 small and medium-sized businesses. Their platform offers features such as employee onboarding, time tracking, leave management, performance reviews, and HR analytics. Factorial streamlines HR processes, boosts employee productivity, and allows businesses to effectively manage their workforce. People leaders can focus on people, not paperwork.

Factorial HR Software for Real-Time Analytics
Source: Factorial

With data streaming and real-time analytics leveraging fully managed cloud services from Confluent and Tinybird, Factorial has improved its data freshness and reduced query latency, leading to significantly faster user feature launches. Read the detailed success story on the Confluent blog.

FanDuel: Customer Personalization and Fraud Prevention in Gambling and Sports Betting

FanDuel is operating in the online gambling and sports betting industry in the United States. I had the pleasure of hosting the company as guest speakers at executive dinner events in London.

FanDuel is a popular sports betting and daily fantasy sports company. It provides online platforms and mobile apps for users to place bets on various sports events and take part in fantasy sports contests. FanDuel has gained a reputation for its user-friendly interface, a wide range of betting options, and innovative features in the online gambling market.

The entire gambling and betting industry leverages data streaming for real-time data procession, transactional payment processing, fraud detection, customer loyalty platforms, and many other use cases. Read more about this topic in the state of data streaming for the betting industry and Kafka case studies for betting.

FanDuel leverages the combination of Confluent Cloud and Tinybird for real-time analytics use cases with user-facing applications and mobile apps.

Fanduel’s case study quotes: “Fanduel uses Confluent and Tinybird to power real-time personalization across all their sports betting solutions to improve time-to-first-bet and reduce the risk of fraud.

If you need APIs on Top of Data Streaming and Analytics, choose Kafka and Tinybird

… and if you want a fully managed end-to-end data pipeline with out-of-the-box connectivity, critical SLAs and cloud-native elasticity and pricing, go with Confluent Cloud and Tinybird. Of course, the data streaming landscape 2024 is broad. Absolutely fine to evaluate other vendors, too. 🙂

The conversations I had with FanDuel at our customer dinner showed that the real world challenge is not choosing the right technologies, but making them easy to use for fast time-to-market and elastic scale with reliable fully managed cloud services.

The motivation for this blog post were my meetings with these joint customers. I will do similar customer dinners in the next months with other Confluent partners like Rockset and StarTree. I can’t wait to hear from joint customers about their benefits of leveraging a specific analytics engine together with a fully managed data streaming platform for product innovation and better customer experiences.

Do you already use Tinybird together with Kafka? Or how do you build scalable real-time APIs on top of your favorite analytics database? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka and Tinybird (ClickHouse) for Streaming Analytics HTTP APIs appeared first on Kai Waehner.

]]>
When NOT to Use Apache Kafka? (Lightboard Video) https://www.kai-waehner.de/blog/2024/03/26/when-not-to-use-apache-kafka-lightboard-video/ Tue, 26 Mar 2024 06:45:11 +0000 https://www.kai-waehner.de/?p=6262 Apache Kafka is the de facto standard for data streaming to process data in motion. With its significant adoption growth across all industries, I get a very valid question every week: When NOT to use Apache Kafka? What limitations does the event streaming platform have? When does Kafka simply not provide the needed capabilities? How to qualify Kafka out as it is not the right tool for the job? This blog post contains a lightboard video that gives you a twenty-minute explanation of the DOs and DONTs.

The post When NOT to Use Apache Kafka? (Lightboard Video) appeared first on Kai Waehner.

]]>
Apache Kafka is the de facto standard for data streaming to process data in motion. With its significant adoption growth across all industries, I get a very valid question every week: When NOT to use Apache Kafka? What limitations does the event streaming platform have? When does Kafka simply not provide the needed capabilities? How to qualify Kafka out as it is not the right tool for the job? This blog post contains a lightboard video that gives you a twenty-minute explanation of the DOs and DONTs.

When NOT to Use Apache Kafka?

DisclaimeR: This blog post shares a lightboard video to watch an explanation about when NOT to use Apache Kafka. For a much more detailed and technical blog post with various use cases and case studies, check out this blog post from 2022 (which is still valid today – whenever you read it).

What is Apache Kafka, and what is it NOT?

Kafka is often misunderstood. For instance, I still hear way too often that Kafka is a message queue. Part of the reason is that some vendors only pitch it for a specific problem (such as data ingestion into a data lake or data warehouse) to sell their products. So, in short:

Kafka is…

  • a scalable real-time messaging platform to process millions of messages per second.
  • a data streaming platform for massive volumes of big data analytics and small volumes of transactional data processing.
  • a distributed storage provides true decoupling for backpressure handling, support of various communication protocols, and replayability of events with guaranteed ordering.
  • a data integration framework (Kafka Connect) for streaming ETL.
  • a data processing framework (Kafka Streams) for continuous stateless or stateful stream processing.

This combination of characteristics in a single platform makes Kafka unique (and successful).

Kafka is NOT…

  • a proxy for millions of clients (like mobile apps) – but Kafka-native proxies (like REST or MQTT) exist for some use cases.
  • an API Management platform – but these tools are usually complementary and used for the creation, life cycle management, or the monetization of Kafka APIs.
  • a database for complex queries and batch analytics workloads – but good enough for transactional queries and relatively simple aggregations (especially with ksqlDB).
  • an IoT platform with features such as device management  – but direct Kafka-native integration with (some) IoT protocols such as MQTT or OPC-UA is possible and the appropriate approach for (some) use cases.
  • a technology for hard real-time applications such as safety-critical or deterministic systems – but that’s true for any other IT framework, too. Embedded systems are a different software!

For these reasons, Kafka is complementary, not competitive, to these other technologies. Choose the right tool for the job and combine them!

Lightboard Video: When NOT to use Apache Kafka

The following video explores the key concepts of Apache Kafka. Afterwards, the DOs and DONTs of Kafka show how to complement data streaming with other technologies for analytics, APIs, IoT, and other scenarios.

Data Streaming Vendors and Cloud Services

The research company Forrester defines data streaming platforms as a new software category in a new Forrester Wave. Apache Kafka is the de facto standard used by over 100,000 organizations.

Plenty of vendors offer Kafka platforms and cloud services. Many complementary open source stream processing frameworks like Apache Flink and related cloud offerings emerged. And competitive technologies like Pulsar, Redpanda, or WarpStream try to get market share leveraging the Kafka protocol. Check out the data streaming landscape of 2024 to summarize existing solutions and market trends. The end of the article gives an outlook to potential new entrants in 2025.

Data Streaming Landscape 2024 around Kafka Flink and Cloud

Apache Kafka is a Data Streaming Platform: Combine it with other Platforms when needed!

Over 150,000 organizations use Apache Kafka in the meantime. The Kafka protocol is the de facto standard for many open source frameworks, commercial products and serverless cloud SaaS offerings.

However, Kafka is not an allrounder for every use case. Many projects combine Kafka with other technologies, such as databases, data lakes, data warehouses, IoT platforms, and so on. Additionally, Apache Flink is becoming the de facto standard for stream processing (but Kafka Streams is not going away and is the better choice for specific use cases).

Where do you (not) use Apache Kafka? What other technologies do you combine Kafka with? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post When NOT to Use Apache Kafka? (Lightboard Video) appeared first on Kai Waehner.

]]>
Top 5 Data Streaming Trends for 2023 https://www.kai-waehner.de/blog/2022/12/15/top-5-data-streaming-trends-for-2023-with-apache-kafka/ Thu, 15 Dec 2022 06:14:03 +0000 https://www.kai-waehner.de/?p=4847 Data Streaming is one of the most relevant buzzwords in tech to build scalable real-time applications in the cloud and innovative business models. Do you wonder about my predicted TOP 5 data streaming trends in 2023 to set data in motion? Check out the following presentation and learn what role Apache Kafka plays. Learn about decentralized Data Mesh, cloud-native lakehouse, data sharing, improved user experience, and advanced data governance.

The post Top 5 Data Streaming Trends for 2023 appeared first on Kai Waehner.

]]>
Data Streaming is one of the most relevant buzzwords in tech to build scalable real-time applications in the cloud and innovative business models. Do you wonder about my predicted TOP 5 data streaming trends in 2023 to set data in motion? Check out the following presentation and learn what role Apache Kafka plays. Learn about decentralized Data Mesh, cloud-native lakehouse, data sharing, improved user experience, and advanced data governance.

Some followers might notice that this became a series with past posts about the top 5 data streaming trends for 2021 and the top 5 for 2022. Data streaming with Apache Kafka is a journey and evolution to set data in motion. Trends change over time, but the core value of a scalable real-time infrastructure as the central data hub stays.

Top Use Cases and Architectures for Data Streaming with Apache Kafka in 2023

The research and consulting company Gartner defines the top strategic technology trends every year. This time, the trends are more focused on particular niche concepts. On a higher level, it is all about optimizing, scaling, and pioneering. Here is what Gartner expects for 2023:

Gartner Strategic Technology Trends for 2023
Source: Gartner

It is funny (but not surprising): Gartner’s predictions overlap and complement the five trends I focus on for data streaming with Apache Kafka looking forward to 2023. I explore how data streaming enables better time to market with decentralized optimized architectures, cloud-native infrastructure for elastic scale, and pioneering innovative use cases to build valuable data products.

Hence, here you go with the top 5 trends in data streaming for 2023.

The top 5 data streaming trends for 2023

I see the following topics coming up more regularly in conversations with customers, prospects, and the broader Kafka community across the globe:

  1. Cloud-native lakehouses
  2. Decentralized data mesh
  3. Data sharing in real-time
  4. Improved developer and user experience
  5. Advanced data governance and policy enforcement

The following sections describe each trend in more detail. The end of the article contains the complete slide deck. The trends are relevant for various scenarios. No matter if you use open source Apache Kafka, a commercial platform, or a fully-managed cloud service like Confluent Cloud.

Kafka as data fabric for cloud-native lakehouses

Many data platform vendors pitch the lakehouse vision today. That’s the same story as the data lake in the Hadoop era with few new nuances. Put all your data into a single data store to save the world and solve every problem and use case:

One data lake or lakehouse for all data

In the last ten years, most enterprises realized this strategy did not work. The data lake is great for reporting and batch analytics, but not the right choice for every problem. Besides technical challenges, new challenges emerged: data governance, compliance issues, data privacy, and so on.

Applying a best-of-breed enterprise architecture for real-time and batch data analytics using the right tool for each job is a much more successful, flexible, and future-ready approach:

Data Streaming with Apache Kafka as Data Fabric for Databases, Data Lake, and Lakehouse Architectures

Data platforms like Databricks, Snowflake, Elastic, MongoDB, BigQuery, etc., have their sweet spots and trade-offs.

Data streaming increasingly becomes the real-time data fabric between all the different data platforms and other business applications leveraging the real-time Kappa architecture instead of the much more batch-focused Lamba architecture.

Decentralized data mesh with valuable data products

Focusing on business value by building data products in independent domains with various technologies is key to success in today’s agile world with ever-changing requirements and challenges. Data mesh came to the rescue and emerged as a next-generation design pattern, succeeding service-oriented architectures and microservices.

Two main proposals exist by vendors for building a data mesh: Data integration with data streaming enables fully decentralized business products. On the other side, data virtualization provides centralized queries:

Data Mesh with Data Streaming using Apache Kafka vs. Data Virtualization

Centralized queries are simple but do not provide a clean architecture and decoupled domains and applications. It might work well to solve a single problem in a project. However, I highly recommend building a decentralized data mesh with data streaming to decouple the applications, especially for strategic enterprise architectures.

Collaboration within and across organizations in real-time

Collaborating within and outside the organization with data sharing using Open APIs, streaming data exchange, and cluster linking enable many innovative business models:

Stream Data Exchange and Sharing with Data Mesh in Motion

The difference between data streaming to a database, data warehouse, or data lake is crucial: All these platforms enable data sharing at rest. The data is stored on a disk before it is replicated and shared within the organization or with partners. This is not real-time. You cannot connect a real-time consumer to data at rest.

However, real-time data beats slow data. Hence, sharing data in real-time with data streaming platforms like Apache Kafka or Confluent Cloud enables accurate data as soon as a change happens. A consumer can be real-time, near real-time, or batch. A streaming data exchange puts data in motion within the organization or for B2B data sharing and Open API business models.

AsyncAPI spec for Apache Kafka API schemas

AsyncAPI allows developers to define the interfaces of asynchronous APIs. It is protocol agnostic. Features include:

  • Specification of OpenAPI contracts (= schemas in the data streaming world)
  • Documentation of APIs
  • Code generation for many programming languages
  • Data governance
  • And much more…

Confluent Cloud recently added a feature for generating an AsyncAPI specification for Apache Kafka clusters.

AsyncAPI and Apache Kafka with Confluent Cloud

We don’t know yet where the market is going. Will AsynchAPI become the standard for OpenAPI in data streaming? Maybe. I see increasing demand for this specification by customers. Let’s review the status of AsynchAPI in a few quarters or years. But it has the potential.

Improved developer experience with low-code / no-code tools for Apache Kafka

Many analysts and vendors pitch low code/no code tools. Visual coding is nothing new. Very sophisticated, powerful, and easy-to-use solutions exist as IDE or cloud applications. The significant benefit is time-to-market for developing applications and easier maintenance. At least in theory.

These tools support various personas like developers, citizen integrators, and data scientists. At least in theory.

The reality is that:

  • Code is king
  • Development is about evolution
  • Open platforms win

Low code/no code is great for some scenarios and personas. But it is just one option of many. Let’s look at a few alternatives for building Kafka-native applications:

Kafka API vs Streams vs KSQL vs Visual Coding with Stream Designer

These Kafka-native technologies have their trade-offs. For instance, the Confluent Stream Designer is perfect for building streaming ETL pipelines between various data sources and sinks. Just click the pipeline and transformations together. Then deploy the data pipeline into a scalable, reliable, and fully-managed streaming application. The difference to separate tools like Apache Nifi is that the generated code run in the same streaming platform, i.e., one infrastructure end-to-end. This makes ensuring SLAs and latency requirements much more manageable and the whole data pipeline more cost-efficient.

However, the simpler a tool is, the less flexible it is. It is that easy. No matter which product or vendor you look at. This is not just true for Kafka-native tools.

And you are flexible with your tool choice per project or business problem. Add your favorite non-Kafka stream processing engine to the stack, for instance, Apache Flink. Or use a separate iPaaS middleware like Dell Boomi or SnapLogic.

Domain-driven design with dumb pipes and smart endpoints

The real benefit of data streaming is the freedom of choice for your favorite Kafka-native technology, open-source stream processing framework, or cloud-native iPaaS middleware.

Choose the proper library, tool, or SaaS for your project. Data streaming enables a decoupled domain-driven design with dumb pipes and smart endpoints:

Decentralized Data Mesh powered by data streaming and Apache Kafka

Data streaming with Apache Kafka is perfect for domain-driven design (DDD). On the contrary, often used point-to-point microservice architecture HTTP/REST web service or push-based message brokers like RabbitMQ create much stronger dependencies between applications.

Data governance across the data streaming pipeline

An enterprise architecture powered by data streaming enables easy access to data in real-time. Many enterprises leverage Apache Kafka as the central nervous system between all data sources and sinks.

The consequence of being able to access all data easily across business domains is two conflicting pressures on organizations: Unlock the data to enable innovation versus Lock up the data to keep it safe.

Achieving data governance across the end-to-end data streams with data lineage, event tracing, policy enforcement, and time travel to analyze historical events is critical for strategic data streaming in the enterprise architecture. Data governance on top of the streaming platform is required for end-to-end visibility, compliance, and security:

Data governance for streaming data with lineage, catalog, quality, policy management

Policy enforcement with schemas and API contracts

The foundation for data governance is the management of API contracts (so-called schemas in data streaming platforms like Apache Kafka). Solutions like Confluent enforce schemas along the data pipeline, including data producer, server, and consumer:

Data Governance and Policy Enforcement in Apache Kafka with Schema and API Contracts

Additional data governance tools like data lineage, catalog, or police enforcement are built on this foundation. The recommendation for any serious data streaming project is to use schema from the beginning. It is unnecessary for the first pipeline. But the following producers and consumers need a trusted environment with enforced policies to establish a decentralized data mesh architecture with independent but connected data products.

Slides and Video for Data Streaming Use Cases in 2023

Here is the slide deck from my presentation:

Fullscreen Mode

And here is the free on-demand video recording:

Video Recording - Top Trends for Data Streaming in 2023

Data streaming goes up in the maturity curve in 2023

It is still an early stage for data streaming in most enterprises. But the discussion goes beyond questions like “when to use Kafka?” or “which cloud service to use?”… In 2023, most enterprises look at more sophisticated challenges around their numerous data streaming projects.

The new trends are often related to each other. A data mesh enables the building of independent data products that focus on business value. Data sharing is a fundamental requirement for a data mesh. New personas access the data stream. Often, citizen developers or data scientists need easy tools to pioneer new projects. The enterprise architecture requires and enforces data governance across the pipeline for security, compliance, and privacy reasons.

Scalability and elasticity need to be there out of the box. Fully-managed data streaming is a brilliant opportunity for getting started in 2023 and moving up in the maturity curve from single projects to a central nervous system of real-time data.

What are your most relevant and exciting trends for data streaming and Apache Kafka in 2023 to set data in motion? What are your strategy and timeline? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Top 5 Data Streaming Trends for 2023 appeared first on Kai Waehner.

]]>
The Heart of the Data Mesh Beats Real-Time with Apache Kafka https://www.kai-waehner.de/blog/2022/07/28/the-heart-of-the-data-mesh-beats-real-time-with-apache-kafka/ Thu, 28 Jul 2022 06:08:38 +0000 https://www.kai-waehner.de/?p=4706 If there were a buzzword of the hour, it would undoubtedly be "data mesh"! This new architectural paradigm unlocks analytic and transactional data at scale and enables rapid access to an ever-growing number of distributed domain datasets for various usage scenarios. The data mesh addresses the most common weaknesses of the traditional centralized data lake or data platform architecture. And the heart of a decentralized data mesh infrastructure must be real-time, reliable, and scalable. Learn how the de facto standard for data streaming, Apache Kafka, plays a crucial role in building a data mesh.

The post The Heart of the Data Mesh Beats Real-Time with Apache Kafka appeared first on Kai Waehner.

]]>
If there were a buzzword of the hour, it would undoubtedly be “data mesh“! This new architectural paradigm unlocks analytic and transactional data at scale and enables rapid access to an ever-growing number of distributed domain datasets for various usage scenarios. The data mesh addresses the most common weaknesses of the traditional centralized data lake or data platform architecture. And the heart of a decentralized data mesh infrastructure must be real-time, reliable, and scalable. Learn how the de facto standard for data streaming, Apache Kafka, plays a crucial role in building a data mesh.

The Heart of the Data Mesh Beats Real Time with Apache Kafka

There is no single technology or product for a data mesh!

This post explores how Apache Kafka, as an open and scalable decentralized real-time platform, can be the basis of a data mesh infrastructure and – complemented by many other data platforms like a data warehouse, data lake, and lakehouse – solve real business problems.

There is no silver bullet or single technology/product/cloud service for implementing a data mesh. The key outcome of a data mesh architecture is the ability to build data products; with the right tool for the job. A good data mesh combines data streaming technology like Apache Kafka or Confluent Cloud with cloud-native data warehouse and data lake architectures from Snowflake, Databricks, Google BigQuery, et al.

What is a data mesh?

I won’t write yet another article describing the concepts of a data mesh. Zhamak Dehghani coined the term in 2019. The following data mesh architecture from 30,000-foot view explains the basic idea well:

Data Mesh for Domain Driven Design Microservices Data Mart Data Streaming

I summarize data mesh as the following three bullet points:

  • An architecture paradigm with several historical influences (domain-driven design, microservices, data marts, data streaming)
  • Not specific to a single technology or product; no single vendor can implement a data mesh alone
  • Handling data as a product is a fundamental change, enabling a more flexible architecture and independent solving of separate business problems
  • Decentralized services, not just analytics but also transactional workloads

Why handle data as a product?

Talking about innovative technology is insufficient to introduce a new architectural paradigm. Consequently, measuring the business value of the enterprise architecture is critical, too.

McKinsey finds that “when companies instead manage data like a consumer product—be it digital or physical—they can realize near-term value from their data investments and pave the way for quickly getting more value tomorrow. Creating reusable data products and patterns for piecing together data technologies enables companies to derive value from data today and tomorrow”:

McKinsey - Why Handle Data as a Product

For McKinsey, the benefits of this approach can be significant:

  • New business use cases can be delivered as much as 90 percent faster.
  • The total cost of ownership, including technology, development, and maintenance, can decline by 30 percent.
  • The risk and data-governance burden can be reduced.

What is data streaming with Apache Kafka and its relation to data mesh?

A data mesh enables flexibility through decentralization and best-of-breed data products. The heart of data sharing requires reliable real-time data at any scale between data producers and data consumers. Additionally, true decoupling between the decentralized data products is key to the success of the data mesh paradigm. Each domain must have access to shared data but also the ability to choose the right tool (i.e., technology, API, product, or SaaS) to solve its business problems.

That’s where data streaming fits into the data mesh story:

Flexibility through Decentralization and Best-of-Breed with Data Streaming

The de facto standard for data streaming is Apache Kafka. A cloud-native data streaming infrastructure that can link clusters with each other out-of-the-box enables building a modern data mesh. No Data Mesh will use just one technology or vendor. Learn from inspiring posts from your favorite data products vendors like AWS, Snowflake, Databricks, Confluent, and many more to successfully define and build your custom Data Mesh. Data Mesh is a journey, not a big bang. A data warehouse or data lake (or in modern days, a lakehouse) cannot be the only infrastructure for data mesh and data products.

I covered how to leverage the capabilities of Apache Kafka and its ecosystem like Kafka Connect, ksqlDB, Cluster Linking, etc. to build the heart of a data mesh in a separate blog post: Streaming Data Exchange with Kafka and a Data Mesh in Motion.

Example: Real-time data fabric in hybrid cloud

Here is one example spanning a streaming Data Mesh across multiple cloud providers like AWS, Azure, GCP, or Alibaba, and on-premise / edge sites:

Hybrid Cloud Streaming Data Mesh powered by Apache Kafka and Cluster Linking

This example shows all the characteristics discussed in the above sections for a Data Mesh:

  • Decentralized real-time infrastructure across domains and infrastructures
  • True decoupling between domains within and between the clouds
  • Several communication paradigms, including data streaming, RPC, and batch
  • Data integration with legacy and cloud-native technologies
  • Continuous stream processing where it adds value, and batch processing in some analytics sinks

Presentation: Building a decentralized data mesh with data streaming at its heart

The following slide deck walks you through the motivation, principles, and architectures of building a real-time data mesh powered by Apache Kafka using the Kappa architecture, hybrid cloud, and stream data sharing:

The data mesh provides flexibility and freedom of technology choice for each data product

The heart of a decentralized data mesh infrastructure must be real-time, reliable, and scalable. As the de facto standard for data streaming, Apache Kafka plays a crucial role in a cloud-native data mesh architecture. Nevertheless, data mesh is not bound to a specific technology. The beauty of the decentralized architecture is the freedom of technology choice for each business unit when building its data products.

Data sharing between domains within and across organizations is another aspect where data streaming helps in a data mesh. Real-time data beats slow data. That is not just true for most business problems across industries but also for replicating data between data centers, clouds, regions, or organizations. A streaming data exchange enables data sharing in real-time to build a data mash in motion.

Did you already start building your Data Mesh? What does the enterprise architecture look like? What frameworks, products, and cloud services do you use? Is the heart of your data mesh real-time in motion or some lakehouse at rest? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post The Heart of the Data Mesh Beats Real-Time with Apache Kafka appeared first on Kai Waehner.

]]>
Best Practices for Building a Cloud-Native Data Warehouse or Data Lake https://www.kai-waehner.de/blog/2022/07/21/best-practices-for-building-a-cloud-native-data-warehouse-data-lake-lakehouse/ Thu, 21 Jul 2022 09:40:51 +0000 https://www.kai-waehner.de/?p=4666 The concepts and architectures of a data warehouse, a data lake, and data streaming are complementary to solving business problems. Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let's explore this dilemma in a blog series. This is part 5: Best Practices for Building a Cloud-Native Data Warehouse or Data Lake.

The post Best Practices for Building a Cloud-Native Data Warehouse or Data Lake appeared first on Kai Waehner.

]]>
The concepts and architectures of a data warehouse, a data lake, and data streaming are complementary to solving business problems. Storing data at rest for reporting and analytics requires different capabilities and SLAs than continuously processing data in motion for real-time workloads. Many open-source frameworks, commercial products, and SaaS cloud services exist. Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let’s explore this dilemma in a blog series. Learn how to build a modern data stack with cloud-native technologies. This is part 5: Best Practices for Building a Cloud-Native Data Warehouse or Data Lake.

Best Practices for Data Analytics with AWS Azure Googel BigQuery Spark Kafka Confluent Databricks

Blog Series: Data Warehouse vs. Data Lake vs. Data Streaming

This blog series explores concepts, features, and trade-offs of a modern data stack using data warehouse, data lake, and data streaming together:

  1. Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
  2. Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
  3. Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
  4. Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
  5. THIS POST: Best Practices for Building a Cloud-Native Data Warehouse or Data Lake

Stay tuned for a dedicated blog post for each topic as part of this blog series. I will link the blogs here as soon as they are available (in the next few weeks). Subscribe to my newsletter to get an email after each publication (no spam or ads).

Best practices for building a cloud-native data warehouse or data lake

Let’s explore the following lessons learned from building cloud-native data analytics infrastructure with data warehouses, data lakes, data streaming, and lakehouses:

  • Lesson 1: Process and store data in the right place.
  • Lesson 2: Don’t design for data at rest to reverse it.
  • Lesson 3: There is no need for a lambda architecture to separate batch and real-time workloads
  • Lesson 4: Understand the trade-offs between data sharing at rest and a streaming data exchange.
  • Lesson 5: Data mesh is not a single product or technology.

Let’s get started…

Lesson 1: Process and store data in the right place.

Ask yourself: What is the use case for your data?

Here are a few use case examples for data and exemplary tools to implement the business case:

  • Recurring reporting for management =>  Data warehouse and its out-of-the-box reporting tools
  • Interactive analysis of structured and unstructured data => Business intelligence tools like Tableau, Power BI, Qlik, or TIBCO Spotfire on top of the data warehouse or another data store
  • Transactional business workloads => Custom Java application running in a Kubernetes environment or serverless cloud infrastructure
  • Advanced analytics to find insights in historical data => Raw data sets stored in a data lake for applying powerful algorithms with AI / Machine Learning technologies such as TensorFlow
  • Real-time actions on new events => Streaming applications to process and correlate data continuously while it is relevant

Real-time or batch compute on the right platform as needed

Batch workloads run best in an infrastructure that was built for this. For instance, Hadoop or Apache Spark. Real-time workloads run best in an infrastructure that was built for this. For example, Apache Kafka.

However, sometimes, both platforms could be used. Understand the underlying infrastructure to leverage it the best way. Apache Kafka can replace a database! Nevertheless, it should only be done in the few scenarios where it makes sense (i.e., simplifies the architecture or adds business value).

For example, replayability as a sequence of events (with guaranteed ordering with time stamps) is built into the immutable Kafka log. Replaying and re-processing historical data from Kafka are straightforward and a perfect use case for many scenarios, including:

  • New consumer application
  • Error-handling
  • Compliance / regulatory processing
  • Query and analyze existing events
  • Schema changes in the analytics platform
  • Model training

On the other side, if you need to do complex analytics like map-reduce or shuffling, SQL queries with tens of JOINs, a robust time-series analysis of sensor events, a search index based on ingested log information, and so on. Then you better choose Spark, Rockset, Apache Druid, or Elasticsearch for that use case 

Tiered storage with cloud-native object storage for cost-efficiency

A single storage infrastructure cannot solve all these problems, even if the “lakehouse vendors” tell you so. Hence, ingesting all data into a single system will fail to succeed with the above use cases. Choose a best-of-breed approach instead with the right tools for the job.

Modern, cloud-native systems decouple storage and compute. This is true for data streaming platforms like Apache Kafka and analytics platforms like Apache Spark, Snowflake, or Google BigQuery. SaaS solutions implement innovative tiered storage solutions (under the hood so you don’t see them) for cost-efficient separation between storage and compute.

Even modern data streaming services leverage tiered storage:

Confluent Tiered Storage for Kafka for Digital Forensics of Historical Data

Lesson 2: Don’t design for data at rest to reverse it.

Ask yourself: Is there any added business value if you process data now instead of later (whatever later means)?

If yes, then don’t store the data in a database or data lake, or data warehouse as the first step. The data is stored at rest then and not available for real-time processing anymore. A data streaming platform like Apache Kafka is the right choice if real-time data beats slow data in your use case!

I find it amazing how many people put all their raw data into data storage just to find out that they could leverage the data in real-time later. Reverse ETL tools are spun up then to access the data in the lakehouse again via change data capture (CDC) or similar approaches. Or if you use Spark Structured Streaming (= “real-time”), but the first thing to get the data for “real-time stream processing” is reading it from an S3 object storage (= “at rest and too late”) is unfitting.

Reverse ETL is NOT the right approach for real-time use cases…

If you store data in a data warehouse or data lake, you cannot process it in real-time anymore as it is already stored at rest. These data stores are built for indexing, search, batch processing, reporting, model training, and other use cases that make sense in the storage system. But you cannot consume the data in real-time in motion from storage at rest:

Data at Rest and Reverse ETL

… data streaming is built for continuously processing data in real-time

That’s where event streaming comes into play. Platforms like Apache Kafka enable processing data in motion in real-time for transactional and analytical workloads.

Reverse ETL is not needed in modern event-driven architecture! It is “built-in” into the architecture out-of-the-box. Each consumer directly consumes the data in real-time, if appropriate and technically feasible. And data warehouses or data lakes still consume it at their own pace in near-real-time or batch:

Data in Motion and Data at Rest

Again, this does not mean you should not put data at rest in your data warehouse or data lake. But only do that if you need to analyze the data later. The data storage at rest is NOT appropriate for real-time workloads.

Learn more about this topic in my blog post “When to Use Reverse ETL and when it is an Anti-Pattern“.

Lesson 3: There is no need for a lambda architecture to separate batch and real-time workloads

Ask yourself: What is the easiest way to consume and process incoming data with my favorite data analytics technology?

Real-time data beats slow data, but NOT always!

Think about your industry, business units, problems you solve, and innovative new applications you build. Real-time data beats slow data. This statement is almost always true. Either to increase revenue, reduce cost, reduce risk, or improve the customer experience.

Data at rest means to store data in a database, data warehouse, or data lake. This way, data is processed too late in many use cases – even if a real-time streaming component (like Kafka) ingests the data. The data processing is still a web service call, SQL query, or map-reduce batch process away from providing a result to your problem.

Don’t get me wrong. Data at rest is not a bad thing. Several use cases, such as reporting (business intelligence), analytics (batch processing), and model training (machine learning) work very well with this approach. But real-time beats batch in almost all other use cases.

The Kappa architecture simplifies the infrastructure for batch AND real-time workloads

The Kappa architecture is an event-based software architecture that can handle all data at any scale in real-time for transactional AND analytical workloads.

The central premise behind the Kappa architecture is that you can perform both real-time and batch processing with a single technology stack. That’s a very different approach than the well-known Lambda architecture. The latter separates batch and real-time workloads in separate infrastructures and technology stacks.

The heart of a Kappa infrastructure is streaming architecture. First, the event streaming platform log stores incoming data. From there, a stream processing engine processes the data continuously in real-time or ingests the data into any other analytics database or business application via any communication paradigm and speed, including real-time, near real-time, batch, and request-response.

Kappa Architecture with one Pipeline for Real Time and Batch

Learn more about the benefits and trade-offs of the Kappa architecture in my article “Kappa Architecture is Mainstream Replacing Lambda“.

Lesson 4: Understand the trade-offs between data sharing at rest and a streaming data exchange.

Ask yourself: How do I need to share data with other internal business units or external organizations?

Use cases for hybrid and multi-cloud replication with data streaming, data lakes, data warehouses, and lakehouses

Many good reasons exist to replicate data across data centers, regions, or cloud providers:

  • Disaster recovery and high availability: Create a disaster recovery cluster and failover during an outage.
  • Global and multi-cloud replication: Move and aggregate data across regions and clouds.
  • Data sharing: Share data with other teams, lines of business, or organizations.
  • Data migration: Migrate data and workloads from one cluster to another (like from a legacy on-premise data warehouse to a cloud-native data lakehouse).

Real-time data replication beats slow data sharing

The story around internal or external data sharing is not different from other applications. Real-time replication beats slow data exchanges. Hence, storing data at rest and then replicating it to another data center, region, or cloud provider is an anti-pattern if real-time information adds business value.

The following example shows how independent stakeholders (= domains in different enterprises) use a cross-company streaming data exchange:

Stream Data Exchange and Sharing with Data Mesh in Motion

Innovation does not stop at the own border. Streaming replication is relevant for all use cases where real-time is better than slow data (valid for most scenarios). A few examples:

  • End-to-end supply chain optimization from suppliers to the manufacturer to the intermediary to the aftersales
  • Track and trace across countries
  • Integration of 3rd party add-on services to the own digital product
  • Open APIs for embedding and combining external services to build a new product

Read the details about a “Streaming Data Exchange with Kafka and a Data Mesh in Motion vs. Data Sharing at Rest in the Data Warehouse or Data Lake” for more details.

Also, understand why APIs (= REST / HTTP) and data streaming (= Apache Kafka) are complementary, not competitive!

Lesson 5: Data mesh is not a single product or technology.

Ask yourself: How do I create a flexible and agile enterprise architecture to innovate more efficiently and solve my business problems faster?

Data Mesh is a Logical View, not Physical!

Data mesh shifts to a paradigm that draws from modern distributed architecture: considering domains as the first-class concern, applying platform thinking to create a self-serve data infrastructure, treating data as a product, and implementing open standardization to enable an ecosystem of interoperable distributed data products.

Here is an example of a Data Mesh:

Data Mesh with Apache Kafka

TL;DR: Data Mesh combines existing paradigms, including Domain-driven Design, Data Marts, Microservices, and Event Streaming.

A data warehouse or data lake is NOT and CAN NOT BECOME the entire data mesh!

The heart of a Data Mesh infrastructure should be real-time, decoupled, reliable, and scalableKafka is a modern cloud-native enterprise integration platform (also often called iPaaS today). Therefore, Kafka provides all the capabilities for the foundation of a Data Mesh.

However, not all components can or should be Kafka-based. The beauty of microservice architectures is that every application can choose the right technologies. An application might or might not include databases, analytics tools, or other complementary components. The input and output data ports of the data product should be independent of the chosen solutions:

Data Stores within the Data Product with Snowflake MongoDB Oracle et al

Kafka can be a strategic component of a cloud-native data mesh, not more and not less. But even if you do not use data streaming and build a data mesh only with data at rest, there is still no silver bullet. Don’t try to build a data mesh with a single product, technology, or vendor. No matter if that tool focuses on real-time data streaming, batch processing and analytics, or API-based interfaces. Tools like Starburst – a SQL-based MPP query engine powered by open source Trino (formerly Presto) – enable analytics on top of different data stores.

Best practices for a cloud-native data warehouse go beyond a SaaS product

Building a cloud-native data warehouse or data lake is an enormous project. It requires data ingestion, data integration, connectivity to analytics platforms, data privacy and security patterns, and much more. All of that is needed before the actual tasks like reporting or analytics can even begin.

The complete enterprise architecture beyond the scope of the data warehouse or data lake is even more complex. Best practices must be applied to build a resilient, scalable, elastic, and cost-efficient data analytics infrastructure. SLAs, latencies, and uptime have very different requirements across business domains. Best of breed approaches choose the right tool for the job. True decoupling between business units and applications allows focusing on solving specific business problems.

Separation of storage and compute, unified real-time pipelines instead of separating batch and real-time, avoiding anti-patterns like Reverse ETL, and appropriate data sharing concepts enable a successful journey to cloud-native data analytics.

For more details, browse other posts of this blog series:

  1. Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
  2. Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
  3. Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
  4. Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
  5. THIS POST: Lessons Learned from Building a Cloud-Native Data Warehouse

How did you modernize your data warehouse or data lake? Do you agree with the lessons I learned? What other experiences did you have? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Best Practices for Building a Cloud-Native Data Warehouse or Data Lake appeared first on Kai Waehner.

]]>
Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization https://www.kai-waehner.de/blog/2022/07/18/case-studies-cloud-native-data-streaming-for-data-warehouse-modernization/ Mon, 18 Jul 2022 08:48:48 +0000 https://www.kai-waehner.de/?p=4663 The concepts and architectures of a data warehouse, a data lake, and data streaming are complementary to solving business problems. Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let's explore this dilemma in a blog series. This is part 4: Case Studies for cloud-native data streaming and data warehouses.

The post Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization appeared first on Kai Waehner.

]]>
The concepts and architectures of a data warehouse, a data lake, and data streaming are complementary to solving business problems. Storing data at rest for reporting and analytics requires different capabilities and SLAs than continuously processing data in motion for real-time workloads. Many open-source frameworks, commercial products, and SaaS cloud services exist. Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let’s explore this dilemma in a blog series. Learn how to build a modern data stack with cloud-native technologies. This is part 4: Case Studies for cloud-native data streaming and data warehouse modernization.

Case Studies for Cloud Native Analytics with Data Warehouse Data Lake Data Streaming Lakehouse

Blog Series: Data Warehouse vs. Data Lake vs. Data Streaming

This blog series explores concepts, features, and trade-offs of a modern data stack using data warehouse, data lake, and data streaming together:

  1. Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
  2. Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
  3. Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
  4. THIS POST: Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
  5. Lessons Learned from Building a Cloud-Native Data Warehouse

Stay tuned for a dedicated blog post for each topic as part of this blog series. I will link the blogs here as soon as they are available (in the next few weeks). Subscribe to my newsletter to get an email after each publication (no spam or ads).

Case studies: Cloud-native data streaming for data warehouse modernization

Every project is different. This is true for data streaming, analytics, and other software development. The following shows three case studies with significantly different architectures and technologies for data warehouse modernization. The examples come from various verticals: Software and cloud business, financial services, logistics and transportation, and the travel and accommodation industry.

Confluent: Data warehouse modernization from batch ETL with Stitch to streaming ETL with Kafka

The article “Streaming ETL SFDC Data for Real-Time Customer Analytics” explores how Confluent eats its dog food to modernize the internal data warehouse pipeline.

The use case is straightforward and standard across most organizations: Extract, transform, and load (ETL) Salesforce data into a Google BigQuery data warehouse, so that the business can use the data. But it is more complex than it sounds:

Organizations often rely on a third-party ETL tool to periodically load data from a CRM and other applications to their data warehouse. These batch tools introduce a lag between when the business events are captured in Salesforce and when they are made available for consumption and processing. The batch workloads commonly result in discrepancies between Salesforce reports and internal dashboards, leading to concerns about the integrity and reliability of the data.

Confluent used Talend’s Stitch batch ETL tool in the beginning. The old architecture looked like this:

Batch ETL with Salesforce and Talend Stitch

The consequence of batch ETL and a 3rd party tool in the middle lead to insufficient and inconsistent information updates.

Over the past few years, Confluent has invested in building stream processing capabilities into the internal data warehouse pipeline. Confluent leverages its own fully managed Confluent Cloud connectors (in this case, the Salesforce CDC source and BigQuery sink connectors), Schema Registry for data governance, and ksqlDB + Kafka Streams for reliable streaming ETL to send SFDC data to BigQuery. Here is the modernized architecture:

Real time streaming ETL architecture with Salesforce CDC and Apache Kafka Connect to Google BigQuery

Paypal: Reducing the time for readouts from 12 hours to a few seconds for 30 billion events per day

Paypal has plenty of Kafka projects for many critical and analytical workloads. In this use case, it scales the Kafka Consumer for 30-35 Billion events per day to migrate its analytical workloads to the Google Cloud Platform (GCP).

Paypal Cloud-native Data Warehouse with Apache Kafka and Google BigQuery

A streaming application ingests the events from Kafka directly to BigQuery. This is a critical project for PayPal as most of the analytical readouts are based on this. The outcome of the data warehouse modernization and building a cloud-native architecture: Reduce the time for readouts from 12 hours to a few seconds.

Read more about this success story in the PayPal Technology Blog.

Shippeo: From on-premise databases to multiple cloud-native data lakes

Shippeo provides real-time and multimodal transportation visibility for logistics providers, shippers, and carriers. Its software uses automation and artificial intelligence to share real-time insights, enable better collaboration, and unlock your supply chain’s full potential. The platform can give instant access to predictive, real-time information for every delivery.

Shippeo described how they integrated traditional databases (MySQL and PostgreSQL) and cloud-native data warehouses (Snowflake and BigQuery) with Apache Kafka and Debezium:

From Mysql and Postgresql to Snowflake and BigQuery with Kafka and Debezium at Shippeo

This is an excellent example of cloud-native enterprise architecture leveraging a “best of breed” approach for data warehousing and analytics. Kafka decouples the analytical workloads from the transactional systems and handles the backpressure for slow consumers.

Sykes Cottages: Fully-managed end-to-end pipeline with Confluent Cloud, Kafka Connect, Snowflake

Sykes Holiday Cottages are one of the UK’s leading and fastest-growing independent holiday cottage rental agencies representing over 19,000 cottages across the UK, Ireland, and New Zealand.

The experience of its customers on the web is a top priority and is one way to stay competitive. The goal is to match customers to their perfect holiday cottage experience and delight at each stage along the way. Getting the data pipeline to fuel this innovation is critical. Data warehouse modernization and data streaming enabled new ways to further innovate the web experience through a data-driven approach.

From inconsistent and slow batch workloads…

While serving its purpose for several years, the existing pipeline had problems impairing this cycle. Very early in this pipeline, the ETL process turned the data into rows and columns (structured data). Various copies were made, and the results were presented via a static report. Data engineers were needed for changes, such as new events or contextual information. The scale was also challenging as this has to be done manually in the main.

Sykes Cottages Modern Cloud Native Data Pipeline

Critically keeping the data in a semi-structured format until it is ingested into the warehouse and then using ELT to do a single transformation of the data, Sykes Holiday Cottages can simplify the pipeline and make it much more agile.

… to event-based real-time updates and continuous stream processing

New web events (and any context that goes with it) can be wrapped up within a message and can flow all the way to the warehouse without a single code change. The new events are then available to the web teams either through a query or the visualization tool.

The current throughput is around 50k (peaking at over 300k) messages per minute. As new events are captured, this will grow considerably. Additionally, each of the above components must scale accordingly.

The new architecture enables the web teams to capture new events. And analyze the data using self-service tools with no dependency on data engineering.

Sykes Cottages Legacy Data Pipeline

In conclusion, the business case for doing this is compelling. Based on our testing and projections, we expect at least 10x ROI over three years for this investment.

In Sykes Holiday Cottages’ blog post, learn more details: Why Sykes Cottages partnered with Snowflake and Confluent to drive enhanced customer experience.

Doordash: From multiple pipelines to data streaming for Snowflake integration

Even digital natives – that started their business in the cloud without legacy applications in their own data centers – need to modernize the enterprise architecture to improve business processes, reduce costs, and provide real-time information to its downstream applications.

It is cost inefficient to build multiple pipelines that are trying to achieve similar purposes. Doordash used cloud-native AWS messaging and streaming systems like Amazon SQS and Amazon Kinesis for data ingestion into the Snowflake data warehouse:

Legacy data pipeline at DoorDashMixing different kinds of data transport and going through multiple messaging/queueing systems without carefully designed observability around it leads to difficulties in operations.

These issues resulted in high data latency, significant cost, and operational overhead at Doordash. Therefore, Doordash moved to a cloud-native streaming platform powered by Apache Kafka and Apache Flink for continuous stream processing before ingesting data into Snowflake:

Cloud-native data streaming powered by Apache Kafka and Apache Flink for Snowflake integration at Doordash

The move to a data streaming platform provides many benefits to Doordash:

  • Heterogeneous data sources and destinations, including REST APIs using the Confluent rest proxy
  • Easily accessible
  • End-to-end data governance with schema enforcement and schema evolution with Confluent Schema Registry
  • Scalable, fault-tolerant, and easy to operate for a small team

All the details about this cloud-native infrastructure optimization are in Doordash’s engineering blog post: “Building Scalable Real Time Event Processing with Kafka and Flink“.

Real-world case studies for cloud-native projects prove the business value

Data warehouse and data lake modernization only make sense if there is a business value. Elastic scale, reduced operations complexity, and faster time to market are significant advantages of cloud services like Snowflake, Databricks, or Google BigQuery.

Data streaming plays a vital role in these initiatives to integrate with legacy and cloud-native data sources, continuous streaming ETL, true decoupling between the data sources, and multiple data sinks (lakes, warehouses, business applications).

The case studies of Confluent, Paypal, Shippeo, and Sykes Cottages showed their different success stories of moving into cloud-native infrastructure to rain real-time visibility and analytics capabilities. Elastic scale and fully-managed end-to-end pipelines are crucial success factors in gaining business value with consistently up-to-date information.

For more details, browse other posts of this blog series:

  1. Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
  2. Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
  3. Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
  4. THIS POST: Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
  5. Lessons Learned from Building a Cloud-Native Data Warehouse

Do you have another success story to share? Or are your projects for data lake and data warehouse modernization still ongoing? Do you use separate infrastructure for specific use cases or build a monolithic lakehouse instead? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization appeared first on Kai Waehner.

]]>
Data Warehouse and Data Lake Modernization: From Legacy On-Premise to Cloud-Native Infrastructure https://www.kai-waehner.de/blog/2022/07/15/data-warehouse-data-lake-modernization-from-legacy-on-premise-to-cloud-native-saas-with-data-streaming/ Fri, 15 Jul 2022 06:03:28 +0000 https://www.kai-waehner.de/?p=4649 The concepts and architectures of a data warehouse, a data lake, and data streaming are complementary to solving business problems. Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let's explore this dilemma in a blog series. This is part 3: Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure.

The post Data Warehouse and Data Lake Modernization: From Legacy On-Premise to Cloud-Native Infrastructure appeared first on Kai Waehner.

]]>
The concepts and architectures of a data warehouse, a data lake, and data streaming are complementary to solving business problems. Storing data at rest for reporting and analytics requires different capabilities and SLAs than continuously processing data in motion for real-time workloads. Many open-source frameworks, commercial products, and SaaS cloud services exist. Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let’s explore this dilemma in a blog series. Learn how to build a modern data stack with cloud-native technologies. This is part 3: Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure.

Data Warehouse and Data Lake Modernization with Data Streaming

Blog Series: Data Warehouse vs. Data Lake vs. Data Streaming

This blog series explores concepts, features, and trade-offs of a modern data stack using data warehouse, data lake, and data streaming together:

  1. Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
  2. Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
  3. THIS POST: Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
  4. Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
  5. Lessons Learned from Building a Cloud-Native Data Warehouse

Stay tuned for a dedicated blog post for each topic as part of this blog series. I will link the blogs here as soon as they are available (in the next few weeks). Subscribe to my newsletter to get an email after each publication (no spam or ads).

Data warehouse modernization: From legacy on-premise to cloud-native infrastructure

Many people talk about data warehouse modernization when they move to a cloud-native data warehouse. Though, what does data warehouse modernization mean? Why do people move away from their on-premise data warehouse? What are the benefits?

Many projects I have seen in the wild went through the following steps:

  1. Select a cloud-native data warehouse
  2. Get data into the new data warehouse
  3. [Optional] Migrate from the old to the new data warehouse

Let’s explore these steps in more detail and understand the technology and architecture options.

1. Selection of a cloud-native data warehouse

Many years ago, cloud computing was a game-changer for operating infrastructure. AWS innovated by providing not just EC2 virtual machines but also storage, like AWS S3 as a service.

Cloud-native data warehouse offerings are built on the same fundamental change. Cloud providers brought their analytics cloud services, such as AWS Redshift, Azure Synapse, or GCP BigQuery. Independent vendors rolled out a cloud-native data warehouse or data lake SaaS such as Snowflake, Databricks, and many more. While each solution has its trade-offs, a few general characteristics are true for most of them:

  • Cloud-native: A modern data warehouse is elastic, scales for small up to extreme workloads, and automates most business processes around development, operations, and monitoring.
  • Fully managed: The vendor takes over the operations burden. This includes scaling, failover handling, upgrades, and performance tuning. Some offerings are truly serverless, while many services require capacity planning and manual or automated scaling up and down.
  • Consumption-based pricing: Pay-as-you-go enables getting started in minutes and scaling costs with broader software usage. Most enterprise deployments allow commitment to getting price discounts.
  • Data sharing: Replicating data sets across regions and environments is a common feature to offer data locality, privacy, lower latency, and regulatory compliance.
  • Multi-cloud and hybrid deployments: While cloud providers usually only offer the 1st party service on their cloud infrastructure, 3rd party vendors provide a multi-cloud strategy. Some vendors even offer hybrid environments, including on-premise and edge deployments.

Plenty of comparisons exist in the community, plus analyst research from Gartner, Forrester, et al. Looking at vendor information and trying out the various cloud products using free credits is crucial, too. Finding the right cloud-native data warehouse is its own challenge and not in this blog post.

2. Data streaming as (near) real-time and hybrid integration layer

Data ingestion into data warehouses and data lakes was already covered in part two of this blog series. The more real-time, the better for most business applications. Near real-time ingestion is possible with specific tools (like AWS Kinesis or Kafka) or as part of the data fabric (the streaming data hub where a tool like Kafka plays a bigger role than just data ingestion).

The often more challenging part is data integration. Most data warehouse and data lake pipelines require ETL to ingest data. As the next-generation analytics platform is crucial for making the right business decisions, the data ingestion and integration platform must also be cloud-native! Tools like Kafka provide the reliable and scalable integration layer to get all required data into the data warehouse.

Integration of legacy on-premise data into the cloud-native data warehouse

In a greenfield project, the project team is lucky. Data sources run in the same cloud, using open and modern APIs, and scale as well as the cloud-native data warehouse.

Unfortunately, the reality is brownfield almost always, even if all applications run in public cloud infrastructure. Therefore, the integration and replication of data from legacy and on-premise applications is a general requirement.

Data is typically consumed from legacy databases, data warehouses, applications, and other proprietary platforms. The replication into the cloud data warehouse usually needs to be near real-time and reliable.

A data streaming platform like Kafka is perfect for replicating data across data centers, regions, and environments because of its elastic scalability and true decoupling capabilities. Kafka enables connectivity to modern AND legacy systems via connectors, proprietary APIs, programming languages, or open REST interfaces:

Accelerate modernization from on-prem to AWS with Kafka and Confluent Cluster Linking

A common scenario in such a brownfield project is the clear separation of concerns and true decoupling between legacy on-premise and modern cloud workloads. Here, Kafka is deployed on-premise to connect to legacy applications.

Tools like MirrorMaker, Replicator, or Confluent Cluster Linking replicate events in real-time into the Kafka cluster in the cloud. The Kafka brokers provide access to the incoming events. Downstream consumers read the data into the data sinks at their own pace; real-time, near real-time, batch, or request-response via API. Streaming ETL is possible at any site – where it makes the most sense from a business or security perspective and is the most cost-efficient.

Example: Confluent Cloud + Snowflake = Cloud-native Data Warehouse Modernization

Here is a concrete example of data warehouse modernization using cloud-native data streaming and data warehousing with Confluent Cloud and Snowflake:

Cloud-native Data Warehouse Modernization with Apache Kafka Confluent Snowflake

For modernizing the data warehouse, data is ingested from various legacy and modern data sources using different communication paradigms, APIs, and integration strategies. The data is transmitted in motion from data sources via Kafka (and optional preprocessing) into the Snowflake data warehouse. The whole end-to-end pipeline is scalable, reliable, and fully managed, including the connectivity and ingestion between the Kafka and Snowflake clusters.

However, there is more to the integration and ingestion layer: The data streaming platform stores the data for true decoupling and slow downstream applications; not every consumer is or can be real-time. Most enterprise architectures do not ingest data into a single data warehouse or data lake or lakehouse. The reality is that different downstream applications need access to the same information; even though vendors of data warehouses and data lakes tell you differently, of course 🙂

By consuming events from the streaming data hub, each application domain decides by itself if it

  • processes events within Kafka with stream processing tools like Kafka Streams or ksqlDB
  • builds own downstream applications with its code and technologies (like Java, .NET, Golang, Python)
  • integrates with 3rd party business applications like Salesforce or SAP
  • ingests the raw or preprocessed and curated data from Kafka into the sink system (like a data warehouse or data lake)

3. Data warehouse modernization and migration from legacy to cloud-native

An often misunderstood concept is the buzz around data warehouse modernization: Companies rarely take the data of the existing on-premise data warehouse or data lake, write a few ETL jobs, and put the data into a cloud-native data warehouse for the sake of doing it.

If you think about a one-time lift-and-shift from an on-premise data warehouse to the cloud, then a traditional ETL tool or a replication cloud service might be the easiest. However, usually, data warehouse modernization is more than that! 

What is data warehouse modernization?

A data warehouse modernization can mean many things, including replacing and migrating the existing data warehouse, building a new cloud-native data warehouse from scratch, or optimizing a legacy ETL pipeline of a cloud-native data warehouse.

In all these cases, data warehouse modernization requires business justification, for instance:

  • Application issues in the legacy data warehouse, such as too slow data processing with legacy batch workloads, result in wrong or conflicting information for the business people.
  • Scalability issues in the on-premise data warehouse as the data volume grows too much.
  • Cost issues because the legacy data warehouse does not offer reasonable pricing with pay-as-you-go models.
  • Connectivity issues as legacy data warehouses were not built with an open API and data sharing in mind. Cloud-native data warehouses run on cost-efficient and scalable object storage, separate storage from computing, and allow data consumption and sharing. (but keep in mind that Reverse ETL is often an anti-pattern!)
  • A strategic move to the cloud with all infrastructure. The analytics platform is no exception if all old and new applications go to the cloud.

Cloud-native applications usually come with innovation, i.e., new business processes, data formats, and data-driven decision-making. From a data warehouse perspective, the best modernization is to start from scratch. Consume data directly from the existing data sources, ETL it, and do business intelligence on top of the new data structures.

I have seen many more projects where customers use change data capture (CDC) from Oracle databases (i.e., the leading core system) instead of trying to replicate data from the legacy data warehouse (i.e., the analytics platform) as scalability, cost, and later shutdown of legacy infrastructure benefits from this approach.

Data warehouse migration: Continuous vs. cut-over

The project is usually a cut-over when you need to do a real modernization (i.e., migration) from a legacy data warehouse to a cloud-native one. This way, the first project phase integrates the legacy data sources with the new data warehouse. The old and new data warehouse platforms operate in parallel, so that old and new business processes go on. After some time (months or years later), when the business is ready to move, the old data warehouse will be shut down after legacy applications are either migrated to the new data warehouse or replaced with new applications:

Data Warehouse Offloading Integration and Replacement with Data Streaming 

My article “Mainframe Integration, Offloading and Replacement with Apache Kafka” illustrates this offloading and long-term migration process. Just scroll to the section “Mainframe Offloading and Replacement in the Next 5 Years” in that post and replace the term ‘mainframe’ with ‘legacy data warehouse’ in your mind.

A migration and cut-over is its project and can include the legacy data warehouse; or not. Data lake modernization (e.g., from a self- or partially managed Cloudera cluster running on-premise in the data center to a fully managed Databricks or Snowflake cloud infrastructure) follows the same principles. And mixing the data warehouse (reporting) and data lake (big data analytics) into a single infrastructure does not change this either.

Data warehouse modernization is NOT a big bang and NOT a single tool approach!

Most data warehouse modernization projects are ongoing efforts over a long period. You must select a cloud-native data warehouse, get data into the new data warehouse from various sources, and optionally migrate away from legacy on-premise infrastructure.

Data streaming for data ingestion, business applications, or data sharing in real-time should always be a separate component in the enterprise architecture. It has very different requirements regarding SLAs, uptime, through, latency, etc. Putting all real-time and analytical workloads into the same cluster makes little sense from a cost, risk, or business value perspective. The idea of a modern data flow and building a data mesh is the separation of concerns with domain-driven design and focusing on data products (using different, independent APIs, technologies, and cloud services).

For more details, browse other posts of this blog series:

  1. Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
  2. Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
  3. THIS POST: Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
  4. Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
  5. Lessons Learned from Building a Cloud-Native Data Warehouse

What cloud-native data warehouse(s) do you use? How does data streaming fit into your journey? Did you integrate or replace your legacy on-premise data warehouse(s); or start from greenfield in the cloud? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Data Warehouse and Data Lake Modernization: From Legacy On-Premise to Cloud-Native Infrastructure appeared first on Kai Waehner.

]]>