Data Mesh Archives - Kai Waehner

Policy Enforcement and Data Quality for Apache Kafka with Schema Registry

Kai Waehner — Mon, 16 Oct 2023 06:18:22 +0000

Good data quality is one of the most critical requirements in decoupled architectures, like microservices or data mesh. Apache Kafka became the de facto standard for these architectures. But Kafka is a dumb broker that only stores byte arrays. The Schema Registry enforces message structures. This blog post looks at enhancements to leverage data contracts for policies and rules to enforce good data quality on field-level and advanced use cases like routing malicious messages to a dead letter queue.

From point-to-point and spaghetti to decoupled microservices with Apache Kafka

Point-to-point HTTP / REST APIs create tightly couple services. Data lakes and lakehouses enforce a monolithic architecture instead of open-minded data sharing and choice of the best technology for a problem. Hence, Apache Kafka became the de facto standard for microservice and data mesh architectures. And data streaming with Kafka complementary (not competitive!) to APIs, data lakes / lakehouses, and other data platforms.

What makes Apache Kafka so popular

A scalable and decoupled architecture as a single source of record for high-quality, self-service access to real-time data streams, but also batch and request-response communication.

Difference between Kafka and ETL / ESB / iPaaS

Enterprise integration is more challenging than ever before. The IT evolution requires the integration of more and more technologies. Companies deploy applications across the edge, hybrid, and multi-cloud architectures.

Point-to-point integration is not good enough. Traditional middleware such as MQ, ETL, ESB does not scale well enough or only processes data in batch instead of real-time. Integration Platform as a Service (iPaaS) solutions are cloud-native but only allow point-to-point integration.

Apache Kafka is the new black for integration projects. Data streaming is a new software category.

Domain-driven design, microservices, data mesh…

The approaches use different principles and best practices. But reality is that the key for a long-living and flexible enterprise architecture is decoupled, independent applications. However, these applications need to share data in good quality with each other.

Apache Kafka shines here. It decouples applications because of its event store. Consumers don’t need to know producers. Domains build independent applications with its own technologies, APIs and cloud services:

Replication between different Kafka clusters enables a global data mesh across data centres and multiple cloud providers or regions. But unfortunately, Apache Kafka itself misses data quality capabilities. That’s where the Schema Registry comes into play.

The need for good data quality and data governance in Kafka Topics

To ensure data quality in a Kafka architecture, organizations need to implement data quality checks, data cleansing, data validation, and monitoring processes. These measures help in identifying and rectifying data quality issues in real time, ensuring that the data being streamed is reliable, accurate, and consistent.

Why you want good data quality in Kafka messages

Data quality is crucial for most Kafka-based data streaming use cases for several reasons:

Real-time decision-making: Data streaming involves processing and analyzing data as it is generated. This real-time aspect makes data quality essential because decisions or actions based on faulty or incomplete data can have immediate and significant consequences.
Data accuracy: High-quality data ensures that the information being streamed is accurate and reliable. Inaccurate data can lead to incorrect insights, flawed analytics, and poor decision-making.
Timeliness: In data streaming, data must be delivered in a timely manner. Poor data quality can result in delays or interruptions in data delivery, affecting the effectiveness of real-time applications.
Data consistency: Inconsistent data can lead to confusion and errors in processing. Data streaming systems must ensure that data adheres to a consistent schema and format to enable meaningful and accurate analysis. No matter if a producer or consumer uses real-time data streaming, batch processing, or request-response communication with APIs.
Data integration: Data streaming often involves combining data from various sources, such as sensors, databases, and external feeds. High-quality data is essential for seamless integration and for ensuring that data from different sources can be harmonized for analysis.
Regulatory compliance: In many industries, compliance with data quality and data governance regulations is mandatory. Failing to maintain data quality in data streaming processes can result in legal and financial repercussions.
Cost efficiency: Poor data quality can lead to inefficiencies in data processing and storage. Unnecessary processing of low-quality data can strain computational resources and lead to increased operational costs.
Customer satisfaction: Compromised data quality in applications directly impacts customers, it can lead to dissatisfaction, loss of trust, and even attrition.

Rules engine and policy enforcement in Kafka Topics with Schema Registry

Confluent designed the Schema Registry to manage and store the schemas of data that are shared between different systems in a Kafka-based data streaming environment. Messages from Kafka producers are validated against the schema.

The Schema Registry provides a serving layer for your metadata. It provides a RESTful interface for storing and retrieving your Avro, JSON Schema, and Protobuf schemas. It stores a versioned history of all schemas based on a specified subject name strategy, provides multiple compatibility settings, and allows evolution of schemas according to the configured compatibility settings and expanded support for these schema types.

Schema Registry provides serializers that plug into Apache Kafka clients that handle schema storage and retrieval for Kafka messages that are sent in any of the supported formats.

Schema Registry is available on GitHub under the Confluent Community License that allows deployment in production scenarios with no licensing costs. It became the de facto standard for ensuring data quality and governance in Kafka projects across all industries.

Enforcing the message structure as the foundation of good data quality

Confluent Schema Registry enforces message structure by serving as a central repository for schemas in a Kafka-based data streaming ecosystem. Here’s how it enforces message structure and rejects invalid messages:

Data messages produced by Kafka producers must adhere to the registered schema. A message is rejected if a message doesn’t match the schema. This behaviour ensures that only well-structured data are published and processes.

Schema Registry even supports schema evolution for data interoperability using different schema versions in producers and consumers. Find a detailed explanation and the limitations in the Confluent documentation.

Validation of schemas happens on the client side in Schema Registry. This is not good enough for some scenarios, like regulated markets, where the infrastructure provider cannot trust each data producer. Hence, Confluent’s commercial offering added broker-side schema validation.

Attribute-based policies and rules in data contracts

The validation of message schema is a great first step. However, many use cases require schema validation and policy enforcement on field level, i.e. validating each attribute of the message by itself with custom rules. Welcome to Data Contracts:

Disclaimer: The following add-on for Confluent Schema Registry is only available for Confluent Platform and Confluent Cloud. If you use any other Kafka service and schema registry, take this solution as an inspiration for building your own data governance suite – or migrate to Confluent

Data contracts support various rules, including data quality rules, field-level transformations, event-condition-action rules, and complex schema evolution. Look at the Confluent documentation “Data Contracts for Schema Registry” to learn all the details.

Data contracts and data quality rules for Kafka messages

As described in the Confluent documentation, a data contract specifies and supports the following aspects of an agreement:

Structure: This is the part of the contract that is covered by the schema, which defines the fields and their types.
Integrity constraints: This includes declarative constraints or data quality rules on the domain values of fields, such as the constraint that an age must be a positive integer.
Metadata: Metadata is additional information about the schema or its constituent parts, such as whether a field contains sensitive information. Metadata can also include documentation for a data contract, such as who created it.
Rules or policies: These data rules or policies can enforce that a field that contains sensitive information must be encrypted, or that a message containing an invalid age must be sent to a dead letter queue.
Change or evolution: This implies that data contracts are versioned, and can support declarative migration rules for how to transform data from one version to another, so that even changes that would normally break downstream components can be easily accommodated.

Example: PII data enforcing encryption and error-handling with a dead letter queue

One of the built-in rule types is Google Common Expression (CEL), which supports data quality rules.

Here is an example where a specific field is tagged as PII data. Rules can enforce good data quality or encryption of an attribute like the credit card number:

You can also configure advanced routing logic. For instance, error handling: If the expression “size(message.id) == 9” is not validated, then the streaming platform forwards the message to a dead letter queue for further processing with the configuration: “dlq.topic”: “bad-data”.

Dead letter queue (DLQ) is its own complex (but very important) topic. Check out the article “Error Handling via Dead Letter Queue in Apache Kafka” to learn from real-world implementations of Uber, CrowdStrike, Santander Bank, and Robinhood.

Data contracts as a foundation for new data streaming products and integration with Apache Flink

Schema Registry should be the foundation of any Kafka project. Data contracts enforce good data quality and interoperability between independent microservices. Each business unit and its data products can choose any technology or API. But data sharing with others works only with good (enforced) data quality.

No matter if you use Confluent Cloud or not, you can learn from this SaaS offering how schemas and data contracts enable data consistency and faster time to market for innovation. Products like Data Catalog, Data Lineage, Confluent Stream Sharing, or the out-of-the-box integration with serverless Apache Flink rely on a good internal data governance strategy with schemas and data contracts.

Do you already leverage data contracts in your Confluent environment? If you are not a Confluent user, how do you solve data consistency issues and enforce good data quality? Let’s connect on LinkedIn and discuss it! Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

The post Policy Enforcement and Data Quality for Apache Kafka with Schema Registry appeared first on Kai Waehner.

Apache Kafka for Data Consistency (and Real-Time Data Streaming)

Kai Waehner — Tue, 27 Dec 2022 08:46:04 +0000

Real-time data beats slow data in almost all use cases. But as essential is data consistency across all systems, including non-real-time legacy systems and modern request-response APIs. Apache Kafka’s most underestimated feature is the storage component based on the append-only commit log. It enables loose coupling for domain-driven design with microservices and independent data products in a data mesh. This blog post explores how Kafka enables data consistency with a real-world case study from financial services.

Apache Kafka = Real-time data streaming

Real-time beats slow data. It is that easy in almost all use cases. Ask any executive or business person: What’s better? If you can consume and use information now or later? The value of data goes down over time:

Apache Kafka is the de facto standard for real-time data streaming. Check out the data streaming landscape 2023 to learn more about Kafka-related products and cloud services.

So far, so good. However, one valid question always comes up: “Why is Apache Kafka different from a real-time message broker like RabbitMQ, IBM MQ, NATS, or Amazon SQS?”

TL;DR: Message brokers provide real-time messaging capabilities to produce and consume messages. Apache Kafka is a data streaming platform that combines messaging, storage, data integration, and stream processing capabilities.

My comparison of message brokers and data streaming explored the differences using ten characteristics. It is all about the storage component of Apache Kafka in the discussion of data consistency. Let’s explore why…

Real-time means many things…

… from deterministic systems with hard real-time up to minutes or even hours. Always define your requirements for real-time data processing and the end-to-end latency (not just the messaging component).

I clarified when to use Apache Kafka for real-time workloads in separate blog posts:

Data consistency = The biggest challenge of the enterprise architecture

Data consistency refers to whether the same data kept at different places does or does not match. The data is processed in many ways across the enterprise architecture:

Real-time: Message brokers or data streaming platforms transfer or process data when it is in motion.
Near real-time: Platforms ingest data into data lakes and data warehouses in seconds or minutes.
Batch: Reporting and analytics of historical data.
Request-response: Interactive API or SQL queries to collect specific information.
A point-in-time replay of historical data: Troubleshooting, incident management, regulatory reporting, and similar scenarios.

The applications and data platforms use very different (old and new) technologies, products, cloud services, and APIs. Integration and data consistency across the different communication paradigms is a massive challenge within the spaghetti architecture:

The consequence of inconsistent data is obvious:

Bad customer experience, e.g., late notification about flight delays or cancellations.
Revenue loss, e.g., inventory not up-to-date, missed or too late detection of fraud.
Increased cost, e.g., slow or wrong decisions in logistics across the supply chain.
Increased risk, e.g., unrecognized data breaches, compliance issues.

This is where the storage component of Apache Kafka makes the difference…

Apache Kafka = Streaming platform to decouple any application and communication paradigm

Kafka is an append-only commit log. Consumers are independent of each other and independent of producers. They interact at their own pace with their own communication paradigm and pull the information from the Kafka log.

This enables independent consumption and processing of data consistently. It does not matter what technology or communication paradigm the downstream consumer application uses:

This example of a real-time locating system (RTLS) for asset tracking can be built with Kafka straightforwardly. Downstream applications consume events in real-time, near real-time, batch, or via request-response HTTP/REST APIs. With other technologies, like real-time message queues, you need to add additional platforms for storage, integration, and data processing. Data streaming provides a single (scalable and reliable) platform.

Apache Kafka enables domain-driven design and data mesh architectures

Domain-driven Design (DDD) needs decoupled applications. Push-based message brokers or HTTP/REST web services enforce tight coupling between the systems. This creates the above spaghetti architecture.

On the other side, Apache Kafka truly decouples the domains, no matter what technologies or communication paradigms each domain uses. Loose coupling is the norm with Kafka:

This is why the storage component using the combination of real-time messaging and a distributed commit log enables data consistency across technologies and communication paradigms.

With Tiered Storage for Kafka, storage and compute are separated to enable long-term storage in Kafka for the replayability of historical data. This is not needed for every use case. Kafka does not replace your favorite database. But it is beneficial for some use cases, e.g., model training with TensorFlow and Kafka to leverage machine learning with a Kappa architecture.

Domain-driven design is the foundation of modern microservice architecture or data mesh. For that reason, many cloud-native enterprise architectures are built using Kafka as the heart of the infrastructure for real-time data sharing plus loose coupling between the data products.

Let’s look at a practical real-world example where the added value of Kafka was not its real-time capability but enabling data consistency across systems…

Erste Group Bank: A case study for data consistency with Apache Kafka

Erste Group Bank AG (Erste Group) is an Austrian financial service provider in Central and Eastern Europe serving 15.7 million clients in over 2,700 branches in seven countries. The bank presented its data streaming journey at Confluent’s Data in Motion 2022 tour in Zurich.

Having a strong mission to increase its customer experience, Erste Group is putting more and more data into action. Making this happen can be summarized as a race for consistency across our channels, squaring the circle between latency and data volumes.

The digital transformation at Erste Group required challenging integration across different technologies and communication paradigms. The following sections describe why Apache Kafka was chosen.

Hyper-personalized mobile banking

A great user experience in the new mobile app “Georg” is a crucial strategic component of Erste Group’s digital transformation to increase customer experience and revenue.

Here is how Erste Group promotes its mobile app: “For 8 million people in 6 countries, banking has a name. George. George empowers everyone to understand, manage, and improve their financial health. Simple. Intelligent. Personal. Unique.”

The following diagram shows the positive and negative reviews of customers, including Erste Group’s mobile app “George” and competitive banking apps:

Source: Erste Group Bank AG

A great mobile app user experience requires the combination of many technologies in the backend. Data streaming enables the foundation in the backend to build an intuitive mobile app across various European countries.

Scalability and accurate information at the right time in the right context make the difference:

Source: Erste Group Bank AG

Fully managed data streaming for omnichannel data consistency

Here comes the surprising part of why I chose this case study for the blog post: While the real-time capability of Apache Kafka at any scale is essential, the critical aspect of the technology choice was data consistency across platforms and communication paradigms:

Source: Erste Group Bank AG

Erste Group built an enterprise architecture with data streaming to enable asynchronous decoupled domains with event sourcing. The responsibility is split across cross-functional teams like Digital and Business Intelligence. However, consistent data is served in different ways for various downstream consumers:

Stream processing for real-time subscriptions
A serving layer for API integration via request-response protocols like HTTP/REST
Tiered Storage for long-term replayability of historical data to enable analytical queries

Source: Erste Group Bank AG

The infrastructure is fully managed in Confluent Cloud to enable focusing on business problems and innovation. DevOps and MLOps automate the development and monitoring lifecycle of the applications.

Data consistency is as critical as real-time data

Apache Kafka is the de facto standard for real-time data streaming. In addition, most enterprise architectures leverage the append-only commit log for loosely coupling to enable agile and elastic microservice architectures. The vision of building data products in a decentralized data mesh is made possible with Apache Kafka.

This post showed the case study of Erste Group to enable data consistency across domains and technologies with fully managed Kafka in the cloud. Obviously, we just explored the foundation. Data sharing across organizations and enforced data governance, including access control, encryption, and audit logging, are mandatory to realize a data mesh in the real world. I discussed these topics in my overview of the top 5 trends for data streaming in 2023.

Let’s connect on LinkedIn and discuss it! Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka for Data Consistency (and Real-Time Data Streaming) appeared first on Kai Waehner.

Decentralized Data Mesh with Data Streaming in Financial Services

Kai Waehner — Fri, 28 Oct 2022 02:55:52 +0000

Digital transformation requires agility and fast time to market as critical factors for success in any enterprise. The decentralization with a data mesh separates applications and business units into independent domains. Data sharing in real-time with data streaming helps to provide information in the proper context to the correct application at the right time. This blog post explores a case study from the financial services sector where a data mesh was built across countries for loosely coupled data sharing but standardized enterprise-wide data governance.

Data mesh and the need for real-time data streaming

If there were a buzzword of the hour, it would undoubtedly be “data mesh“! This new architectural paradigm unlocks analytic and transactional data at scale and enables rapid access to an ever-growing number of distributed domain datasets for various usage scenarios. The data mesh addresses the most common weaknesses of the traditional centralized data lake or data platform architecture. And the heart of a decentralized data mesh infrastructure must be real-time, reliable, and scalable:

The digital transformation in financial services

The new enterprise reality in the financial services sector: Innovate or be disrupted!

A few initiatives I have seen in banks around the world with real-time data leveraging data streaming:

Legacy modernization, e.g., mainframe offloading and replacement with Apache Kafka
Middleware modernization with scalable, open infrastructures replacing ETL, ESB, and iPaaS platforms
Hybrid cloud data replication for disaster recovery, migration, and other scenarios
Transactions and analytics in real-time at any scale
Fraud detection in real-time with Kafka and Flink to prevent fraud before it happens
Cloud-native core banking to enable modern business processes
Digital payment innovation with Apache Kafka as the data hub for Cryptocurrency, DeFi, NFT, and Metaverse (beyond the buzz)

Let’s look at a practical example from the real world.

Raiffeisen Bank International – A bank transformation across 12 countries

Raiffeisen Bank International (RBI) is scaling an event-driven architecture across the group as part of a bank-wide transformation program. This includes the creation of a reference architecture and the re-use of technology and concepts across 12 countries.

The universal bank is headquartered in Vienna, Austria. It has decades of experience (and related legacy infrastructure) in retail, corporate and markets, and investment banking.

Let’s explore the journey of Raiffeisen Bank’s digital transformation. If you want to listen to the story told by themselves, watch the free on-demand webinar.

Building a data mesh without knowing it…

Raiffeisen Bank, operating across 12 countries, has all the apparent challenges and requirements for data sharing across applications, platforms, and governments.

Raiffeisen Bank built a decentralized data mesh enterprise architecture with real-time data sharing as the fundamental key to its digital transformation. They did not even know about it because the buzzword did not exist when they started making it… But there are good reasons for using data streaming as the data hub:

The enterprise architecture of RBI’s data mesh with data streaming

The reference architecture includes data streaming as the heart of the infrastructure. It is real-time, scalable, and decoupled independent domains and applications. Open Banking APIs exist for request-response communication:

Source: Raiffeisen Bank International

The three core principles of the enterprise architecture ensure an agile, scalable, and future-ready infrastructure across the countries:

API: Internal APIs standardized based on domain-driven design
Group integration: Live, connected with 11 countries, 320 APIs available, constantly increasing
EDA: Event-driven reference architecture created and roll-out ongoing, group Layer live with the first use cases

The combination of data streaming with Apache Kafka and request-response with REST / HTTP is prevalent in enterprise architectures. Having said that, more and more use cases directly leverage a stream data exchange for data sharing across business units or organizations.

Decoupling with decentralized data streaming as the integration layer

The whole IT platform and technology stack is built for re-use in the group:

Source: Raiffeisen Bank International

Raiffeisen Bank’s reference architecture has all the characteristics that define a data mesh:

Loose coupling between applications, databases, and business units with domain-driven design
Independent microservices and data products (like different core banking platforms or individual analytics in the countries)
Data sharing in real-time via a decentralized data streaming platform (fully-managed in the cloud where possible, but freedom of choice for each country
Enterprise-wide API contacts (= schemas in the Kafka world)

Data governance in regulated banking across the data mesh

Financial service is a regulated market around the world. PCI, GDPR, and other compliance requirements are mandatory, whether you build monoliths or a decentralized data mesh.

Raiffeisenbank international built its data mesh with data governance, legal compliance, and data privacy in mind from the beginning:

Source: Raiffeisen Bank International

Here are the fundamental principles of Raiffeisen Bank’s data governance strategy:

Central integration layer for data sharing across the independent groups in real-time for transactional and analytical workloads
Cloud-first strategy (when it makes sense) with fully-managed Confluent Cloud for data streaming
Group-wide standardized event taxonomy and API contracts with Schema Registry
Group-wide governance with event product owners across the group
Platform as a service for self-service for internal customers within the different groups

Combining these paradigms and rules enables independent data processing and innovation while still being compliant and enabling data sharing across different groups.

The heart of a data mesh beats in real-time

Independent applications, domains, and organizations built separate data products in a data mesh. Real-time data sharing across these units with standardized and loosely coupled events is a critical success factor. Each downstream consumer gets the data as needed: Real-time, near real-time, batch, or request-response.

The case study from Raiffeisen Bank International showed how to build a powerful and flexible data mesh leveraging cloud-native data streaming powered by Apache Kafka. While this example comes from financial services, the principles and architectures apply to any vertical. The business objects and interfaces look different. But the significant challenges are very similar across industries.

How do you build a data mesh? Do you use batch technology like ETL tools and data lakes or rely on real-time data streaming for data sharing and integration? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Decentralized Data Mesh with Data Streaming in Financial Services appeared first on Kai Waehner.

A Real-Time Supply Chain Control Tower powered by Kafka

Kai Waehner — Fri, 23 Sep 2022 05:55:26 +0000

A modern supply chain requires just-in-time production, global logistics, and complex manufacturing processes. Intelligent control of the supply network is becoming increasingly demanding, not just because of Covid. At the same time, digitalization is generating exponentially growing data streams along the value chain. This blog post explores a solution that links the control with the data streams leveraging Apache Kafka to provide end-to-end visibility of your supply chain. All the digital information flows into a unified central nervous system enabling comprehensive control and timely response. The idea of the Supply Chain Control Tower becomes a reality: An integrated data cockpit with real-time access to all levels and systems of the supply chain.

What is a supply chain control tower?

A supply chain is a network of companies and people that are involved in the production and delivery of a product or service. Today, many supply chains are global and involve intra-logistics, widespread enterprise logistics, and B2B data sharing for end-to-end supply chains.

Supply chain management (SCM)

Supply Chain Management (SCM) involves planning and coordinating all the people, processes, and technology involved in creating value for a company. This includes cross-cutting processes, including purchasing/procurement, logistics, operations/manufacturing, and others. Automation, robustness, flexibility, real-time, and hybrid deployment (edge + cloud) are essential for future success and a pre-requirement for end-to-end visibility across the supply chain, regardless of industry.

Source: Aadini (YouTube Channel)

The challenge with logistics and supply chain is that you have a lot of commercial off-the-shelf applications (ERP, WMS, TMS, DPS, CRM, SRM, etc.) in place that are highly specialized and highly advanced in their function.

Challenges of batch workloads across the supply chain

Batch workloads create many issues in a supply chain. The Covid pandemic showed this:

Missing information: Intra-logistics within a distribution or fulfillment center, across buildings, and regions, and between business partners.
Cost rising: Lower overall equipment efficiency (OEE). Lower availability. Increase cost for production and buyers (B2B and B2C).
Customers churning: Bad customer experience and contract discussion as a consequence of failing with delivery guarantees and other SLAs.
Revenue decreasing: Less production and/or sales means less revenue.

The specialized systems like an ERP, WMS, TMS, DPS, CRM, or SRM are often modernized and real-time or nearly real-time in their operation. However, the integrations between them are often still not real-time. For instance, batch waves in a WMS are being replaced with real-time order allocation processes, but the link between the WMS and the ERP is still batch.

Real-time end-to-end monitoring

A supply chain control tower provides end-to-end visibility and real-time monitoring across the supply chain:

Source: Aadini (YouTube Channel)

The control tower helps answer questions such as

What is happening now?
Why is this happening?
What might happen next?
How can we perform better?
What if…?

A supply chain control tower combines technology, processes, and people. Check out this tremendous seven-minute YouTube video to get a simple explanation of the supply chain control tower “for dummies”. Most importantly, the evolution of software enables real-time automation instead of just human visual monitoring.

A Kafka-native supply chain control tower

Apache Kafka is the de facto standard for data streaming. It enables real-time data integration and processing at any scale. This is a crucial requirement for building a supply chain control tower.

I blogged about Apache Kafka for Supply Chain Management (SCM) Optimization before. Check it out to learn about real-time applications for logistics, inventory management, track and trace, and other real-time deployments from BMW, Bosch, Baader, and more.

Let’s recap the added value of Kafka for improving the supply chain in one picture of business value across use cases:

If you look at the definitions of a supply chain control tower, well, that’s more or less the same A supply chain control tower is only possible with real-time data correlation. That’s why Kafka is the perfect fit.

Supply chains are global and rely on the collaboration between independent business units within enterprises, and B2B communication:

Source: Aadini (YouTube Channel)

The software industry pitches a new paradigm and architecture pattern these days: The Data Mesh. It allows independent, decoupled domains using its own technology, API, and data structures. But the data sharing is standardized, compatible, real-time, and reliable. The heart of such a data mesh beats real-time with the power of Apache Kafka:

Kafka-native tools like MirrorMaker or Confluent Cluster Linking enable reliable real-time replication across data centers, clouds, regions, and even between independent companies.

Kafka as the data hub for a real-time supply chain control tower

A supply chain control tower coordinates demand, supply, trading, and logistics across various domains, technologies, APIs, and business processes.

The control tower is not just a Kafka cluster – but Kafka is underpinning business logic, integrations, and visualizations that aggregate all the data from the various sources and make it more valuable from end to end in real-time:

The data communication happens in real-time, near real-time, batch, or request-response. Whatever the source and sink applications support. However, the heart of the enterprise architecture is real-time and scalable. This enables future modernization and replacing legacy batch systems with modern real-time services.

Visualization AND automatic resolutions as core components of a modern control tower

The first goal of a supply chain control tower is end-to-end visibility in real-time. Creating real-time dashboards, alerts, and integration with 3rd party monitoring tools is a massive value.

However, the built pipeline enables several additional new use cases. Data integration and correlation across legacy and modern applications is the significant change for innovation regarding supply chain optimization: Instead of just visualizing in real-time what happens, new applications can take automated actions and decisions to solve problems automatically based on real-time information.

Real-world Kafka examples across the supply chain for visibility and automation

Modern supply chains rely on real-time data across suppliers, fulfillment centers, warehouses, stores, and consumers. Here are a few real-world examples across verticals using Apache Kafka as a real-time supply chain control tower:

Retail: Walmart monitors information across the supply chain, including distribution centers, fulfillment centers, stores, mobile apps, and B2B interfaces. Examples for use cases are local and global inventory management and replenishment systems.
Manufacturing: BMW ingests data from its smart factories into the cloud to provide access to the data for visibility and new automation applications. Baader built a real-time locating system (RTLS) to optimize end-to-end monitoring and advanced calculations like routing and estimated arrival in real time.
Logistics: DHL, Swiss Post, Hermes, SBB, and many other companies in the logistics and transportation space built real-time track and trace platforms to optimize the visibility and efficiency of their business processes.
Mobility Services: Almost any ride-hailing or food delivery app across the planet like Uber, Lyft, Grab, or Free Now (former MyTaxi) leverages Apache Kafka for real-time notifications to the front-end and data correlation in the backend to implement real-time driver-rider-matching, routing, payment, fraud detection, and many other use cases.
Food: Albertsons, Instacart, Domino’s Pizza, Migros, and many other food-related enterprises improve business processes across the food value chain with real-time services.

No matter what industry you work in. Learn from other companies across verticals. Challenges to improving the supply chain are very similar everywhere.

Postmodern ERP / MES / TMS powered by Apache Kafka

While the supply chain control tower provides end-to-end visibility, each system has similar requirements. Enterprise resource planning (ERP) has existed for many years. It is often monolithic, complex, proprietary, batch, and not scalable. The same is true for MES, TMS, and many other platforms for logistics and supply chain management.

Postmodern ERP represents the next generation of ERP architectures. It is real-time, scalable, and open. A Postmodern ERP combines open source technologies and proprietary standard software. Many solutions are cloud-native or even offered as fully managed SaaS.

Like end-users, software vendors leverage data streaming with Apache Kafka to implement a Postmodern ERP, MES, CRM, SRM, or TMS:

Logistics and supply chain management require real-time data

Data streams grow exponentially along the value chain. End-to-end visibility of the supply chain is crucial for optimized business processes. Real-time order management and inventory management are just two examples.

Visualization and taking actions in real-time via automated processes either reduce cost and risk or increases the revenue and customer experience. A real-time supply chain control tower enables this kind of innovation. The foundation of such a strategic component needs to be real-time, scalable, and reliable. That’s why the data streaming platform Apache Kafka is the perfect fit for building a control tower.

Various success stories across industries prove the value of data streaming across the supply chain. Even software vendors of products for the supply chain like ERP, WMS, TMS, DPS, CRM, or SRM build their next-generation software on top of Apache Kafka.

How do you bring visibility into your supply chain? Did you already build a real-time control tower? What role plays data streaming in these scenarios? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post A Real-Time Supply Chain Control Tower powered by Kafka appeared first on Kai Waehner.

The Heart of the Data Mesh Beats Real-Time with Apache Kafka

Kai Waehner — Thu, 28 Jul 2022 06:08:38 +0000

There is no single technology or product for a data mesh!

This post explores how Apache Kafka, as an open and scalable decentralized real-time platform, can be the basis of a data mesh infrastructure and – complemented by many other data platforms like a data warehouse, data lake, and lakehouse – solve real business problems.

There is no silver bullet or single technology/product/cloud service for implementing a data mesh. The key outcome of a data mesh architecture is the ability to build data products; with the right tool for the job. A good data mesh combines data streaming technology like Apache Kafka or Confluent Cloud with cloud-native data warehouse and data lake architectures from Snowflake, Databricks, Google BigQuery, et al.

What is a data mesh?

I won’t write yet another article describing the concepts of a data mesh. Zhamak Dehghani coined the term in 2019. The following data mesh architecture from 30,000-foot view explains the basic idea well:

I summarize data mesh as the following three bullet points:

An architecture paradigm with several historical influences (domain-driven design, microservices, data marts, data streaming)
Not specific to a single technology or product; no single vendor can implement a data mesh alone
Handling data as a product is a fundamental change, enabling a more flexible architecture and independent solving of separate business problems
Decentralized services, not just analytics but also transactional workloads

Why handle data as a product?

Talking about innovative technology is insufficient to introduce a new architectural paradigm. Consequently, measuring the business value of the enterprise architecture is critical, too.

McKinsey finds that “when companies instead manage data like a consumer product—be it digital or physical—they can realize near-term value from their data investments and pave the way for quickly getting more value tomorrow. Creating reusable data products and patterns for piecing together data technologies enables companies to derive value from data today and tomorrow”:

For McKinsey, the benefits of this approach can be significant:

New business use cases can be delivered as much as 90 percent faster.
The total cost of ownership, including technology, development, and maintenance, can decline by 30 percent.
The risk and data-governance burden can be reduced.

What is data streaming with Apache Kafka and its relation to data mesh?

A data mesh enables flexibility through decentralization and best-of-breed data products. The heart of data sharing requires reliable real-time data at any scale between data producers and data consumers. Additionally, true decoupling between the decentralized data products is key to the success of the data mesh paradigm. Each domain must have access to shared data but also the ability to choose the right tool (i.e., technology, API, product, or SaaS) to solve its business problems.

That’s where data streaming fits into the data mesh story:

The de facto standard for data streaming is Apache Kafka. A cloud-native data streaming infrastructure that can link clusters with each other out-of-the-box enables building a modern data mesh. No Data Mesh will use just one technology or vendor. Learn from inspiring posts from your favorite data products vendors like AWS, Snowflake, Databricks, Confluent, and many more to successfully define and build your custom Data Mesh. Data Mesh is a journey, not a big bang. A data warehouse or data lake (or in modern days, a lakehouse) cannot be the only infrastructure for data mesh and data products.

I covered how to leverage the capabilities of Apache Kafka and its ecosystem like Kafka Connect, ksqlDB, Cluster Linking, etc. to build the heart of a data mesh in a separate blog post: Streaming Data Exchange with Kafka and a Data Mesh in Motion.

Example: Real-time data fabric in hybrid cloud

Here is one example spanning a streaming Data Mesh across multiple cloud providers like AWS, Azure, GCP, or Alibaba, and on-premise / edge sites:

This example shows all the characteristics discussed in the above sections for a Data Mesh:

Decentralized real-time infrastructure across domains and infrastructures
True decoupling between domains within and between the clouds
Several communication paradigms, including data streaming, RPC, and batch
Data integration with legacy and cloud-native technologies
Continuous stream processing where it adds value, and batch processing in some analytics sinks

Presentation: Building a decentralized data mesh with data streaming at its heart

The following slide deck walks you through the motivation, principles, and architectures of building a real-time data mesh powered by Apache Kafka using the Kappa architecture, hybrid cloud, and stream data sharing:

The data mesh provides flexibility and freedom of technology choice for each data product

The heart of a decentralized data mesh infrastructure must be real-time, reliable, and scalable. As the de facto standard for data streaming, Apache Kafka plays a crucial role in a cloud-native data mesh architecture. Nevertheless, data mesh is not bound to a specific technology. The beauty of the decentralized architecture is the freedom of technology choice for each business unit when building its data products.

Data sharing between domains within and across organizations is another aspect where data streaming helps in a data mesh. Real-time data beats slow data. That is not just true for most business problems across industries but also for replicating data between data centers, clouds, regions, or organizations. A streaming data exchange enables data sharing in real-time to build a data mash in motion.

Did you already start building your Data Mesh? What does the enterprise architecture look like? What frameworks, products, and cloud services do you use? Is the heart of your data mesh real-time in motion or some lakehouse at rest? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post The Heart of the Data Mesh Beats Real-Time with Apache Kafka appeared first on Kai Waehner.

Best Practices for Building a Cloud-Native Data Warehouse or Data Lake

Kai Waehner — Thu, 21 Jul 2022 09:40:51 +0000

The concepts and architectures of a data warehouse, a data lake, and data streaming are complementary to solving business problems. Storing data at rest for reporting and analytics requires different capabilities and SLAs than continuously processing data in motion for real-time workloads. Many open-source frameworks, commercial products, and SaaS cloud services exist. Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let’s explore this dilemma in a blog series. Learn how to build a modern data stack with cloud-native technologies. This is part 5: Best Practices for Building a Cloud-Native Data Warehouse or Data Lake.

Blog Series: Data Warehouse vs. Data Lake vs. Data Streaming

This blog series explores concepts, features, and trade-offs of a modern data stack using data warehouse, data lake, and data streaming together:

Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
THIS POST: Best Practices for Building a Cloud-Native Data Warehouse or Data Lake

Stay tuned for a dedicated blog post for each topic as part of this blog series. I will link the blogs here as soon as they are available (in the next few weeks). Subscribe to my newsletter to get an email after each publication (no spam or ads).

Best practices for building a cloud-native data warehouse or data lake

Let’s explore the following lessons learned from building cloud-native data analytics infrastructure with data warehouses, data lakes, data streaming, and lakehouses:

Lesson 1: Process and store data in the right place.
Lesson 2: Don’t design for data at rest to reverse it.
Lesson 3: There is no need for a lambda architecture to separate batch and real-time workloads
Lesson 4: Understand the trade-offs between data sharing at rest and a streaming data exchange.
Lesson 5: Data mesh is not a single product or technology.

Let’s get started…

Lesson 1: Process and store data in the right place.

Ask yourself: What is the use case for your data?

Here are a few use case examples for data and exemplary tools to implement the business case:

Recurring reporting for management => Data warehouse and its out-of-the-box reporting tools
Interactive analysis of structured and unstructured data => Business intelligence tools like Tableau, Power BI, Qlik, or TIBCO Spotfire on top of the data warehouse or another data store
Transactional business workloads => Custom Java application running in a Kubernetes environment or serverless cloud infrastructure
Advanced analytics to find insights in historical data => Raw data sets stored in a data lake for applying powerful algorithms with AI / Machine Learning technologies such as TensorFlow
Real-time actions on new events => Streaming applications to process and correlate data continuously while it is relevant

Real-time or batch compute on the right platform as needed

Batch workloads run best in an infrastructure that was built for this. For instance, Hadoop or Apache Spark. Real-time workloads run best in an infrastructure that was built for this. For example, Apache Kafka.

However, sometimes, both platforms could be used. Understand the underlying infrastructure to leverage it the best way. Apache Kafka can replace a database! Nevertheless, it should only be done in the few scenarios where it makes sense (i.e., simplifies the architecture or adds business value).

For example, replayability as a sequence of events (with guaranteed ordering with time stamps) is built into the immutable Kafka log. Replaying and re-processing historical data from Kafka are straightforward and a perfect use case for many scenarios, including:

New consumer application
Error-handling
Compliance / regulatory processing
Query and analyze existing events
Schema changes in the analytics platform
Model training

On the other side, if you need to do complex analytics like map-reduce or shuffling, SQL queries with tens of JOINs, a robust time-series analysis of sensor events, a search index based on ingested log information, and so on. Then you better choose Spark, Rockset, Apache Druid, or Elasticsearch for that use case.

Tiered storage with cloud-native object storage for cost-efficiency

A single storage infrastructure cannot solve all these problems, even if the “lakehouse vendors” tell you so. Hence, ingesting all data into a single system will fail to succeed with the above use cases. Choose a best-of-breed approach instead with the right tools for the job.

Modern, cloud-native systems decouple storage and compute. This is true for data streaming platforms like Apache Kafka and analytics platforms like Apache Spark, Snowflake, or Google BigQuery. SaaS solutions implement innovative tiered storage solutions (under the hood so you don’t see them) for cost-efficient separation between storage and compute.

Even modern data streaming services leverage tiered storage:

Lesson 2: Don’t design for data at rest to reverse it.

Ask yourself: Is there any added business value if you process data now instead of later (whatever later means)?

If yes, then don’t store the data in a database or data lake, or data warehouse as the first step. The data is stored at rest then and not available for real-time processing anymore. A data streaming platform like Apache Kafka is the right choice if real-time data beats slow data in your use case!

I find it amazing how many people put all their raw data into data storage just to find out that they could leverage the data in real-time later. Reverse ETL tools are spun up then to access the data in the lakehouse again via change data capture (CDC) or similar approaches. Or if you use Spark Structured Streaming (= “real-time”), but the first thing to get the data for “real-time stream processing” is reading it from an S3 object storage (= “at rest and too late”) is unfitting.

Reverse ETL is NOT the right approach for real-time use cases…

If you store data in a data warehouse or data lake, you cannot process it in real-time anymore as it is already stored at rest. These data stores are built for indexing, search, batch processing, reporting, model training, and other use cases that make sense in the storage system. But you cannot consume the data in real-time in motion from storage at rest:

… data streaming is built for continuously processing data in real-time

That’s where event streaming comes into play. Platforms like Apache Kafka enable processing data in motion in real-time for transactional and analytical workloads.

Reverse ETL is not needed in modern event-driven architecture! It is “built-in” into the architecture out-of-the-box. Each consumer directly consumes the data in real-time, if appropriate and technically feasible. And data warehouses or data lakes still consume it at their own pace in near-real-time or batch:

Again, this does not mean you should not put data at rest in your data warehouse or data lake. But only do that if you need to analyze the data later. The data storage at rest is NOT appropriate for real-time workloads.

Learn more about this topic in my blog post “When to Use Reverse ETL and when it is an Anti-Pattern“.

Lesson 3: There is no need for a lambda architecture to separate batch and real-time workloads

Ask yourself: What is the easiest way to consume and process incoming data with my favorite data analytics technology?

Real-time data beats slow data, but NOT always!

Think about your industry, business units, problems you solve, and innovative new applications you build. Real-time data beats slow data. This statement is almost always true. Either to increase revenue, reduce cost, reduce risk, or improve the customer experience.

Data at rest means to store data in a database, data warehouse, or data lake. This way, data is processed too late in many use cases – even if a real-time streaming component (like Kafka) ingests the data. The data processing is still a web service call, SQL query, or map-reduce batch process away from providing a result to your problem.

Don’t get me wrong. Data at rest is not a bad thing. Several use cases, such as reporting (business intelligence), analytics (batch processing), and model training (machine learning) work very well with this approach. But real-time beats batch in almost all other use cases.

The Kappa architecture simplifies the infrastructure for batch AND real-time workloads

The Kappa architecture is an event-based software architecture that can handle all data at any scale in real-time for transactional AND analytical workloads.

The central premise behind the Kappa architecture is that you can perform both real-time and batch processing with a single technology stack. That’s a very different approach than the well-known Lambda architecture. The latter separates batch and real-time workloads in separate infrastructures and technology stacks.

The heart of a Kappa infrastructure is streaming architecture. First, the event streaming platform log stores incoming data. From there, a stream processing engine processes the data continuously in real-time or ingests the data into any other analytics database or business application via any communication paradigm and speed, including real-time, near real-time, batch, and request-response.

Learn more about the benefits and trade-offs of the Kappa architecture in my article “Kappa Architecture is Mainstream Replacing Lambda“.

Ask yourself: How do I need to share data with other internal business units or external organizations?

Use cases for hybrid and multi-cloud replication with data streaming, data lakes, data warehouses, and lakehouses

Many good reasons exist to replicate data across data centers, regions, or cloud providers:

Disaster recovery and high availability: Create a disaster recovery cluster and failover during an outage.
Global and multi-cloud replication: Move and aggregate data across regions and clouds.
Data sharing: Share data with other teams, lines of business, or organizations.
Data migration: Migrate data and workloads from one cluster to another (like from a legacy on-premise data warehouse to a cloud-native data lakehouse).

The story around internal or external data sharing is not different from other applications. Real-time replication beats slow data exchanges. Hence, storing data at rest and then replicating it to another data center, region, or cloud provider is an anti-pattern if real-time information adds business value.

The following example shows how independent stakeholders (= domains in different enterprises) use a cross-company streaming data exchange:

Innovation does not stop at the own border. Streaming replication is relevant for all use cases where real-time is better than slow data (valid for most scenarios). A few examples:

End-to-end supply chain optimization from suppliers to the manufacturer to the intermediary to the aftersales
Track and trace across countries
Integration of 3rd party add-on services to the own digital product
Open APIs for embedding and combining external services to build a new product

Read the details about a “Streaming Data Exchange with Kafka and a Data Mesh in Motion vs. Data Sharing at Rest in the Data Warehouse or Data Lake” for more details.

Also, understand why APIs (= REST / HTTP) and data streaming (= Apache Kafka) are complementary, not competitive!

Lesson 5: Data mesh is not a single product or technology.

Ask yourself: How do I create a flexible and agile enterprise architecture to innovate more efficiently and solve my business problems faster?

Data Mesh is a Logical View, not Physical!

Data mesh shifts to a paradigm that draws from modern distributed architecture: considering domains as the first-class concern, applying platform thinking to create a self-serve data infrastructure, treating data as a product, and implementing open standardization to enable an ecosystem of interoperable distributed data products.

Here is an example of a Data Mesh:

TL;DR: Data Mesh combines existing paradigms, including Domain-driven Design, Data Marts, Microservices, and Event Streaming.

A data warehouse or data lake is NOT and CAN NOT BECOME the entire data mesh!

The heart of a Data Mesh infrastructure should be real-time, decoupled, reliable, and scalable. Kafka is a modern cloud-native enterprise integration platform (also often called iPaaS today). Therefore, Kafka provides all the capabilities for the foundation of a Data Mesh.

However, not all components can or should be Kafka-based. The beauty of microservice architectures is that every application can choose the right technologies. An application might or might not include databases, analytics tools, or other complementary components. The input and output data ports of the data product should be independent of the chosen solutions:

Kafka can be a strategic component of a cloud-native data mesh, not more and not less. But even if you do not use data streaming and build a data mesh only with data at rest, there is still no silver bullet. Don’t try to build a data mesh with a single product, technology, or vendor. No matter if that tool focuses on real-time data streaming, batch processing and analytics, or API-based interfaces. Tools like Starburst – a SQL-based MPP query engine powered by open source Trino (formerly Presto) – enable analytics on top of different data stores.

Best practices for a cloud-native data warehouse go beyond a SaaS product

Building a cloud-native data warehouse or data lake is an enormous project. It requires data ingestion, data integration, connectivity to analytics platforms, data privacy and security patterns, and much more. All of that is needed before the actual tasks like reporting or analytics can even begin.

The complete enterprise architecture beyond the scope of the data warehouse or data lake is even more complex. Best practices must be applied to build a resilient, scalable, elastic, and cost-efficient data analytics infrastructure. SLAs, latencies, and uptime have very different requirements across business domains. Best of breed approaches choose the right tool for the job. True decoupling between business units and applications allows focusing on solving specific business problems.

Separation of storage and compute, unified real-time pipelines instead of separating batch and real-time, avoiding anti-patterns like Reverse ETL, and appropriate data sharing concepts enable a successful journey to cloud-native data analytics.

For more details, browse other posts of this blog series:

Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
THIS POST: Lessons Learned from Building a Cloud-Native Data Warehouse

How did you modernize your data warehouse or data lake? Do you agree with the lessons I learned? What other experiences did you have? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Best Practices for Building a Cloud-Native Data Warehouse or Data Lake appeared first on Kai Waehner.

Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?

Kai Waehner — Mon, 27 Jun 2022 12:35:22 +0000

The concepts and architectures of a data warehouse, a data lake, and data streaming are complementary to solving business problems. Storing data at rest for reporting and analytics requires different capabilities and SLAs than continuously processing data in motion for real-time workloads. Many open-source frameworks, commercial products, and SaaS cloud services exist. Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let’s explore this dilemma in a blog series. Learn how to build a modern data stack with cloud-native technologies. This is part 1: Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?

Blog Series: Data Warehouse vs. Data Lake vs Data Streaming

This blog series explores concepts, features, and trade-offs of a modern data stack using a data warehouse, data lake, and data streaming together:

THIS POST: Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
Lessons Learned from Building a Cloud-Native Data Warehouse

I also created a short ten minute video that explains the differences between data streaming and a lakehouse (and why they are complementary):

The value of data: Transactional vs. analytical workloads

The last decade offered many articles, blogs, and presentations about data becoming the new oil. Today, nobody questions that data-driven business processes change the world and enable innovation across industries.

Data-driven business processes require both real-time data processing and batch processing. Think about the following flow of events across applications, domains, and organizations:

An event is business information or technical information. Events happen all the time. A business process in the real world requires the correlation of various events.

How critical is an event?

The criticality of an event defines the outcome. Potential impacts can be increased revenue, reduced risk, reduced cost, or improved customer experience.

Business transaction: Ideally, zero downtime and zero data loss. Example: Payments need to be processed exactly once.
Critical analytics: Ideally, zero downtime. Data loss of a single sensor event might be okay. Alerting on the aggregation of events is more critical. Example: Continuous monitoring of IoT sensor data and a (predictive) machine failure alert.
Non-critical analytics: Downtime and data loss are not good but do not kill the whole business. It is an accident, but not a disaster. Example: Reporting and business intelligence to forecast demand.

When to process an event?

Real-time usually means end-to-end processing within milliseconds or seconds. If you don’t need real-time decisions, batch processing (i.e., after minutes, hours, days) or on-demand (i.e., request-reply) is sufficient.

Business transactions are often real-time: A transaction like a payment usually requires real-time processing (e.g., before the customer leaves the store; before you ship the item; before you leave the ride-hailing car).
Critical analytics is usually real-time: Critical analytics very often requires real-time processing (e.g., detecting the fraud before it happens; predicting a machine failure before it breaks; upselling to a customer before he leaves the store).
Non-critical analytics is usually not real-time: Finding insights in historical data is usually done in a batch process using paradigms like complex SQL queries, map-reduce, or complex algorithms (e.g., reporting; model training with machine learning algorithms; forecasting).

With these basics about processing events, let’s understand why storing all events in a single, central data lake is not the solution to all problems.

Flexibility through decentralization and best-of-breed

The traditional data warehouse respectively data lake approach is to ingest all data from all sources into a central storage system for centralized data ownership. The sky (and your budget) is the limit with current big data and cloud technologies.

However, architectural concepts like domain-driven design, microservices, and data mesh show that decentralized ownership is the right choice for modern enterprise architecture:

No worries. The data warehouse and the data lake are not dead but more relevant than ever before in a data-driven world. Both make sense for many use cases. Even in one of these domains, larger organizations don’t use a single data warehouse or data lake. Selecting the right tool for the job (in your domain or business unit) is the best way to solve the business problem.

There are good reasons people are pleased with Databricks for batch ETL, machine learning, and now even data warehouse, but still prefer a lightweight cloud SQL database like AWS RDS (fully managed PostgreSQL) for some use cases.

There are good reasons happy Splunk users also ingest some data into Elasticsearch instead. And why Cribl is getting more and more traction in this space, too.

There are good reasons some projects leverage Apache Kafka as the database. Storing data long-term in Kafka makes only sense for some specific use cases (like compacted topics, key/value queries, streaming analytics). Kafka does not replace other databases or data lakes.

Choose the right tool for the job with decentralized data ownership!

With this in mind, let’s explore the use cases and added value of a modern data warehouse (and how it relates to data lakes and the new buzz lakehouse).

Data warehouse: Reporting and business intelligence with data at rest

A data warehouse (DWH) provides capabilities for reporting and data analysis. It is considered a core component of business intelligence.

Use cases for data at rest

No matter if you use a product called a data warehouse, data lake, or lakehouse. The data is stored at rest for further processing:

Reporting and Business Intelligence: Fast and flexible availability of reports, statistics, and key figures, for example, to identify correlations between market and service offering
Data Engineering: Integration of data from differently structured and distributed data sets to enable the identification of hidden relationships between data
Big Data Analytics and AI / Machine Learning: Global view of the source data and thus overarching evaluations to find unknown insights to improve business processes and interrelationships.

Some readers might say: Only the first is a use case for a data warehouse, and the other two are for a data lake or lakehouse! It all depends on the definition.

Data warehouse architecture

DWHs are central repositories of integrated data from disparate sources. They store historical data in one storage system. The data is stored at rest, i.e., saved for later analysis and processing. Business users analyze the data to find insights.

The data is uploaded from operational systems, such as IoT data, ERP, CRM, and many other applications. Data cleansing and data quality assurance are crucial pieces in the DWH pipeline. Extract, Transform, Load (ETL) or Extract, Load, Transform (ELT) are the two main approaches to building a data warehouse system. Data marts help to focus on a single subject or line of business within the data warehouse ecosystem.

The relation of the data warehouse to data lake and lakehouse

The focus of a data warehouse is reporting and business intelligence using structured data. Contrarily, the data lake is a synonym for storing and processing raw big data. A data lake was built with technologies like Hadoop, HDFS, and Hive in the past. Today, the data warehouse and data lake merged into a single solution. A cloud-native DWH supports big data. Similarly, a cloud-native data lake needs business intelligence with traditional tools.

Databricks: The evolution from the data lake to the data warehouse

That’s true for almost all vendors. For example, look at the history of one of the leading big data vendors: Databricks, known for being THE Apache Spark company. The company started as a commercial vendor behind Apache Spark, a big data batch processing platform. The platform was enhanced with (some) real-time workloads using micro-batching. A few milestones later, Databricks is an entirely different company today, focusing on cloud, data analytics, and data warehousing. Databricks’ strategy changed from:

open source to the cloud
self-managed software to fully-managed serverless offerings
focus on Apache Spark to AI / Machine Learning and later added data warehousing capabilities
a single product to a vast product portfolio around data analytics, including standardized data formats (“Delta Lake”), governance, ETL tooling (Delta Live Tables), and more,

Vendors like Databricks and AWS also coined a new buzzword for this merge of the data lake, data warehouse, business intelligence, and real-time capabilities: The Lakehouse.

The Lakehouse (sometimes called Data Lakehouse) is nothing new. It combines characteristics of separate platforms. I wrote an article about building a cloud-native serverless lakehouse on AWS using Kafka in conjunction with AWS analytics platforms.

Snowflake: The evolution from the data warehouse to the data lake

Snowflake came from the other direction. It was the first genuinely cloud-native data warehouse available on all major clouds. Today, Snowflake provides many more capabilities beyond the traditional business intelligence spectrum. For instance, data and software engineers have features to interact with Snowflake’s data lake through other technologies and APIs. The data engineer requires a Python interface to analyze historical data, while the software engineer prefers real-time data ingestion and analysis at any scale.

No matter if you build a data warehouse, data lake, or lakehouse: The crucial point is understanding the difference between data in motion and data at rest to find the right enterprise architecture and components for your solution. The following sections explore why a good data warehouse architecture requires both and how they complement each other very well.

Transactional real-time workloads should not run within the data warehouse or data lake! Separation of concerns is critical due to different uptime SLAs, regulatory and compliance laws, and latency requirements.

Data streaming: Supplementing the modern data warehouse with data in motion

Let’s clarify: Data streaming is NOT the same as data ingestion! You can use a data streaming technology like Apache Kafka for data ingestion into a data warehouse or data lake. Most companies do this. Fine and valuable.

BUT: A data streaming platform like Apache Kafka is so much more than just an ingestion layer. Hence, it differs significantly from ingestion engines like AWS Kinesis, Google Pub/Sub, and similar tools.

Data streaming is NOT the same as data ingestion

Data streaming provides messaging, persistence, integration, and processing capabilities. High scalability for millions of messages per second, high availability including backward compatibility and rolling upgrades for mission-critical workloads, and cloud-native features are some of the built-in features.

The de facto standard for data streaming is Apache Kafka. Therefore, I mainly use Kafka for data streaming architectures and use cases.

Transactional and analytics use cases for data streaming with Apache Kafka

The number of different use cases for data streaming is almost endless. Please remember that data streaming is NOT just a message queue for data ingestion! While data ingestion into a data lake was the first prominent use case, this implies <5% of actual Kafka deployments. Business applications, streaming ETL middleware, real-time analytics, and edge/hybrid scenarios are some of the other examples:

The persistence layer of Kafka enables decentralized microservice architectures for agile and truly decoupled applications.

Keep in mind that Apache Kafka supports transactional and analytics workloads. Both usually have very different uptime, latency, and data loss SLAs. Check out this post and slide deck to learn more about data streaming use cases across industries powered by Apache Kafka.

The next post of this blog series explores why data streaming with tools like Apache Kafka is the prevalent technology for data ingestion into data warehouses and data lakes.

Don’t (try to) use the data warehouse or data lake for data streaming

This blog post explored the difference between data at rest and data in motion:

A data warehouse is excellent for reporting and business intelligence.
A data lake is perfect for big data analytics and AI / Machine Learning.
Data streaming enables real-time use cases.
A decentralized, flexible enterprise architecture is required to build a modern data stack around microservices and data mesh.

None of the technologies is a silver bullet. Choose the right tool for the problem. Monolithic architectures do not solve today’s business problems. Storing all data only at rest does not help with the demand for real-time use cases.

The Kappa architecture is a modern approach for real-time and batch workloads to avoid a much more complex infrastructure using the Lambda architecture. Data streaming complements the data warehouse and data lake. The connectivity between these systems is available out-of-the-box if you choose the right vendors (that are often strategic partners, not competitors, as some people think).

For more details, browse to other posts of this blog series:

THIS POST: Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
Lessons Learned from Building a Cloud-Native Data Warehouse

How do you combine data warehouse and data streaming today? Is Kafka just your ingestion layer into the data lake? Do you already leverage data streaming for additional real-time use cases? Or is Kafka already the strategic component in the enterprise architecture for decoupled microservices and a data mesh? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies? appeared first on Kai Waehner.

Streaming Data Exchange with Kafka and a Data Mesh in Motion

Kai Waehner — Sun, 14 Nov 2021 13:25:45 +0000

Data Mesh is a new architecture paradigm that gets a lot of buzzes these days. Every data and platform vendor describes how to build the best Data Mesh with their platform. The Data Mesh story includes cloud providers like AWS, data analytics vendors like Databricks and Snowflake, and Event Streaming solutions like Confluent. This blog post looks into this principle deeper to explore why no single technology is the perfect fit to build a Data Mesh. Examples show why an open and scalable decentralized real-time platform like Apache Kafka is often the heart of the Data Mesh infrastructure, complemented by many other data platforms, to solve business problems.

Data at Rest vs. Data in Motion

Before we get into the Data Mesh discussion, it is crucial to clarify the difference and relevance of Data at Rest and Data in Motion:

Data at Rest: Data is ingested and stored in a storage system (database, data warehouse, data lake). Business logic and queries execute against the storage. Everyday use cases include reporting with business intelligence tools, model training in machine learning, and complex batch analytics like shuffling or map and reduce. As the data is at rest, the processing is too late for real-time use cases.
Data in Motion: Data is processed and correlated continuously while new events are fed into the platform. Business logic and queries execute in real-time. Common real-time use cases include inventory management, order processing, fraud detection, predictive maintenance, and many other use cases.

Real-time Data Beats Slow Data

Real-time beats slow data in almost all use cases across industries. Hence, ask yourself or your business team how they want or need to consume and process the data in the next project. Data at Rest and Data in Motion have trade-offs. Therefore, both concepts are complementary. For this reason, modern cloud infrastructures leverage both in their architecture. Serverless Event Streaming with Kafka combined with the AWS Lakehouse is a great resource to learn more.

However, while connecting a batch system to a real-time nervous system is possible, the other way round – connecting a real-time consumer to batch storage – is not possible. The Kappa vs. Lambda Architecture discussion gives more insights into this.

Kafka is a database. So, it is also possible to use it for data at rest. For instance, the replayability of historical events in guaranteed ordering is essential and helpful for many use cases. However, long-term storage in Kafka has several limitations, like limited query capabilities. Hence, for many use cases, event streaming and other storage systems are complementary, not competitive.

Data Mesh – An Architecture Paradigm

Data mesh is an implementation pattern (not unlike microservices or domain-driven design) but applied to data. Thoughtworks coined the term. You will find tons of resources on the web. Zhamak Dehghani gave a great introduction about “How to build the Data Mesh Foundation and its Relation to Event Streaming” at the Kafka Summit Europe 2021.

Domain-driven + Microservices + Event Streaming

Data Mesh is not an entirely new paradigm. It has several historical influences:

The architectural paradigm unlocks analytical data at scale, rapidly unlocking access to an ever-growing number of distributed domain data sets for a proliferation of consumption scenarios such as machine learning, analytics, or data-intensive applications across the organization. A data mesh addresses the common failure modes of the traditional centralized data lake or data platform architecture.

Data Mesh is a Logical View, not Physical!

Here is an example of a Data Mesh:

TL;DR: Data Mesh combines existing paradigms, including Domain-driven Design, Data Marts, Microservices, and Event Streaming.

Data as the Product

However, the differentiating aspect focuses on product thinking (“Microservice for Data”) with data as a first-class product. Data products are a perfect fit for Event Streaming with Data in Motion to build innovative new real-time use cases.

A Data Mesh with Event Streaming

Why is Event Streaming a good fit for data mesh?

Streams are real-time, so you can propagate data throughout the mesh immediately, as soon as new information is available. Streams are also persisted and replayable, so they let you capture both real-time AND historical data with one infrastructure. And because they are immutable, they make for a great source of record, which is helpful for governance.

Data in Motion is crucial for most innovative use cases. And as discussed before, real-time data beats slow data in almost all scenarios. Hence, it makes sense that the heart of a Data Mesh architecture is an Event Streaming platform. It provides true decoupling, scalable real-time data processing, and highly reliable operations across the edge, data center, and multi-cloud.

Kafka Streaming API – The De Facto Standard for Data in Motion

The Kafka API is the de facto standard for Event Streaming. I won’t explore this discussion again and again. Here are a few references before we move to the “Kafka + Data Mesh” content…

A Kafka-powered Data Mesh

I highly recommend watching Ben Stopford’s and Michael Noll’s talk about “Apache Kafka and the Data Mesh“. Several of the screenshots in this post are from that presentation, too. Kudos to my two colleagues! The talk explores the key concepts of a Data Mesh and how they are related to Event Streaming:

Domain-driven Decentralization
Data as a Self-serve Product
First-class Data Platform
Federated Governance

Let’s now explore how Event Streaming with Kafka fits into the Data Mesh architecture and how other solutions like a database or data lake complement it.

Data product, a “microservice for the data world”:

A node on the data mesh, situated within a domain.
Produces and possibly consumes high-quality data within the mesh.
Encapsulates all the elements required for its function, namely data plus code plus infrastructure.

A Data Mesh is not just one Technology!

The heart of a Data Mesh infrastructure must be real-time, decoupled, reliable, and scalable. Kafka is a modern cloud-native enterprise integration platform (also often called iPaaS today). Therefore, Kafka provides all the capabilities for the foundation of a Data Mesh.

However, not all components can or should be Kafka-based. Choose the right tool for a problem. Let’s explore in the following subsections how Kafka-native technologies and other solutions are used in a Data Mesh together.

Stream Processing within the Data Product with Kafka Streams and ksqlDB

An event-based data product aggregates and correlates information from one or more data sources in real-time. Stateless and stateful stream processing is implemented with Kafka-native tools such as Kafka Streams or ksqlDB:

Variety of Protocols and Communication Paradigms within the Data Product – HTTP, gRPC, MQTT, and more

Obviously, not every application uses just Event Streaming as a technology and communication paradigm. The above picture shows how one consumer application could also be a request/response technology like HTTP or gRPC to do a pull query. In contrast, another application continuously consumes the streaming push query with a native Kafka consumer in any programming language, such as Java, Scala, C, C++, Python, Go, etc.

The data product often includes complementary technologies. For instance, if you built a connected car infrastructure, you likely use MQTT for the last-mile integration, ingest the data into Kafka, and further processing with Event Streaming. The “Kafka + MQTT Blog Series” is an excellent example from the IoT space to learn about building a data product with complementary technologies.

Variety of Solutions within the Data Product – Event Streaming, Data Warehouse, Data Lake, and more

The beauty of microservice architectures is that every application can choose the right technologies. An application might or might not include databases, analytics tools, or other complementary components. The input and output data ports of the data product should be independent of the chosen solutions:

Kafka Connect is the right Kafka-native technology to connect other technologies and communication paradigms with the Event Streaming platform. Evaluate if you need another integration middleware (like an ETL or ESB) or if the Kafka infrastructure is the better enterprise integration platform (iPaaS) for your data product within the data mesh.

A Global Streaming Data Exchange

The Data Mesh concept is relevant for global deployments, not just within a single project or region. Multiple Kafka clusters are the norm, not an exception. I wrote about customers using Event Streaming with Kafka in global architectures a long time ago.

Various architectures exist to deploy Kafka across data centers and multiple clouds. Some use cases require low latency and deploy some Kafka instances at the edge or in a 5G zone. Other use cases replicate data between regions, countries, or continents across the globe for disaster recovery, aggregation, or analytics use cases.

Here is one example spanning a streaming Data Mesh across multiple cloud providers like AWS, Azure, GCP, or Alibaba, and on-premise / edge sites:

This example shows all the characteristics discussed in the above sections for a Data Mesh:

Decentralized real-time infrastructure across domains and infrastructures
True decoupling between domains within and between the clouds
Several communication paradigms, including data streaming, RPC, and batch
Data integration with legacy and cloud-native technologies
Continuous stream processing where it adds value, and batch processing in some analytics sinks

Example: A Streaming Data Exchange across Domains in the Automotive Industry

The following example from the automotive industry shows how independent stakeholders (= domains in different enterprises) use a cross-company streaming data exchange:

Innovation does not stop at the own border. Streaming replication is relevant for all use cases where real-time is better than slow data (valid for most scenarios). A few examples:

End-to-end supply chain optimization from suppliers to the OEM to the intermediary to the aftersales
Track and trace across countries
Integration of 3rd party add-on services to the own digital product
Open APIs for embedding and combining external services to build a new product

I could go on and on with the list. Many data products need to be accessible by 3rd party in real-time at scale. Some API gateway or API management tool comes into play in such a situation.

A real-world example of a streaming data exchange powered by Kafka is the mobility service Here Technologies. They expose the Kafka API to directly consume streaming data from their mapping services (as an alternative option to their HTTP API):

However, even if all collaborating partners use Kafka under the hood in their architecture, exposing the Kafka API directly to the outside world does not always make sense. Some technical capabilities (e.g., access control or connectivity to thousands of devices) and missing business functions (e.g., for monetization or reporting) of the Kafka ecosystem bring an API layer on top of the Event Streaming infrastructure into play in many real-world deployments.

Open API for 3rd Party Integration and Streaming API Management

API Gateways and API Management tools exist in many varieties, including open-source frameworks, commercial products, and SaaS cloud offerings. Features include technical routing, access control, monetization, and reporting.

However, most people still implement the Open API concept with RPC in mind. I guess 95+% still use HTTP(S) to make APIs accessible to other stakeholders (e.g., other business units or external parties). RPC makes little sense in a streaming Data Mesh architecture if the data needs to be processed at scale in real-time.

There is still an impedance mismatch between Event Streaming and API Management. But it gets better these days. Specifications like AsyncAPI, calling itself the “industry standard for defining asynchronous APIs”, and similar approaches bring Open API to the data streaming world. My post “Kafka versus API Management with tools like MuleSoft, Kong, or Apigee” is still pretty much accurate if you want to dive deeper into this discussion. IBM API Connect was one of the first vendors that integrated Kafka via Async API.

A great example of the evolution from RPC to streaming APIs is the machine learning space. “Streaming Machine Learning with Kafka-native Model Deployment” explores how model servers such as Seldon enhance their product with a native Kafka API besides HTTP and gRPC request-response communication:

Journey to the Streaming Data Mesh with Kafka

The paradigm shift is enormous. Data Mesh is not a free lunch. The same was and still is true for microservice architectures, domain-driven design, Event Streaming, and other modern design principles.

In analogy to Confluent’s maturity model for Event Streaming, our team has described the journey for deploying a streaming Data Mesh:

The efforts likely take a few years in most scenarios. The shift is not just about technologies, but, as necessary are adjustments to organizations and business processes. I guess most companies are still in a very early stage. Let me know where you are on this journey!

Streaming Data Exchange as Foundation for a Data Mesh

A Data Mesh is an implementation pattern, not a specific technology. However, most modern enterprise architectures require a decentralized streaming data infrastructure to build valuable and innovative data products in independent, truly decoupled domains. Hence, Kafka, being the de facto standard for Event Streaming, comes into play in many Data Mesh architectures.

Many Data Mesh architectures span across many domains in various regions or even continents. The deployments run at the edge, on-prem, and multi-cloud. The integration connects to many solutions, technologies with different communication paradigms.

A cloud-native Event Streaming infrastructure with the capability to link clusters with each other out-of-the-box enables building a modern Data Mesh. No Data Mesh will use just one technology or vendor. Learn from the inspiring posts from your favorite data products vendors like AWS, Snowflake, Databricks, Confluent, and many more to define and build your custom Data Mesh successfully. Data Mesh is a journey, not a big bang.

Did you already start building your Data Mesh? How does the enterprise architecture look like? What frameworks, products, and cloud services do you use? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Streaming Data Exchange with Kafka and a Data Mesh in Motion appeared first on Kai Waehner.

Data Mesh Archives - Kai Waehner

Policy Enforcement and Data Quality for Apache Kafka with Schema Registry

From point-to-point and spaghetti to decoupled microservices with Apache Kafka

What makes Apache Kafka so popular

Difference between Kafka and ETL / ESB / iPaaS

Domain-driven design, microservices, data mesh…

The need for good data quality and data governance in Kafka Topics

Why you want good data quality in Kafka messages

Rules engine and policy enforcement in Kafka Topics with Schema Registry

Enforcing the message structure as the foundation of good data quality

Attribute-based policies and rules in data contracts

Data contracts and data quality rules for Kafka messages

Example: PII data enforcing encryption and error-handling with a dead letter queue

Data contracts as a foundation for new data streaming products and integration with Apache Flink

Apache Kafka for Data Consistency (and Real-Time Data Streaming)

Apache Kafka = Real-time data streaming

Real-time means many things…

Data consistency = The biggest challenge of the enterprise architecture

Apache Kafka = Streaming platform to decouple any application and communication paradigm

Apache Kafka enables domain-driven design and data mesh architectures

Erste Group Bank: A case study for data consistency with Apache Kafka

Hyper-personalized mobile banking

Fully managed data streaming for omnichannel data consistency

Data consistency is as critical as real-time data

Decentralized Data Mesh with Data Streaming in Financial Services

Data mesh and the need for real-time data streaming

The digital transformation in financial services

Raiffeisen Bank International – A bank transformation across 12 countries

Building a data mesh without knowing it…

The enterprise architecture of RBI’s data mesh with data streaming

Decoupling with decentralized data streaming as the integration layer

Data governance in regulated banking across the data mesh

The heart of a data mesh beats in real-time

A Real-Time Supply Chain Control Tower powered by Kafka

What is a supply chain control tower?

Supply chain management (SCM)

Challenges of batch workloads across the supply chain

Real-time end-to-end monitoring

A Kafka-native supply chain control tower

Global data mesh for real-time data sharing

Kafka as the data hub for a real-time supply chain control tower

Visualization AND automatic resolutions as core components of a modern control tower

Real-world Kafka examples across the supply chain for visibility and automation

Postmodern ERP / MES / TMS powered by Apache Kafka

Logistics and supply chain management require real-time data

The Heart of the Data Mesh Beats Real-Time with Apache Kafka

There is no single technology or product for a data mesh!

What is a data mesh?

Why handle data as a product?

What is data streaming with Apache Kafka and its relation to data mesh?

Example: Real-time data fabric in hybrid cloud

Presentation: Building a decentralized data mesh with data streaming at its heart

The data mesh provides flexibility and freedom of technology choice for each data product

Best Practices for Building a Cloud-Native Data Warehouse or Data Lake

Blog Series: Data Warehouse vs. Data Lake vs. Data Streaming

Best practices for building a cloud-native data warehouse or data lake

Lesson 1: Process and store data in the right place.

Real-time or batch compute on the right platform as needed

Tiered storage with cloud-native object storage for cost-efficiency

Lesson 2: Don’t design for data at rest to reverse it.

Reverse ETL is NOT the right approach for real-time use cases…

… data streaming is built for continuously processing data in real-time

Lesson 3: There is no need for a lambda architecture to separate batch and real-time workloads

Real-time data beats slow data, but NOT always!

The Kappa architecture simplifies the infrastructure for batch AND real-time workloads

Lesson 4: Understand the trade-offs between data sharing at rest and a streaming data exchange.

Use cases for hybrid and multi-cloud replication with data streaming, data lakes, data warehouses, and lakehouses

Real-time data replication beats slow data sharing

Lesson 5: Data mesh is not a single product or technology.

Data Mesh is a Logical View, not Physical!

A data warehouse or data lake is NOT and CAN NOT BECOME the entire data mesh!

Best practices for a cloud-native data warehouse go beyond a SaaS product

Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?

Blog Series: Data Warehouse vs. Data Lake vs Data Streaming

The value of data: Transactional vs. analytical workloads

How critical is an event?

When to process an event?

Flexibility through decentralization and best-of-breed

Data warehouse: Reporting and business intelligence with data at rest

Use cases for data at rest