Data Sharing Archives - Kai Waehner https://www.kai-waehner.de/blog/category/data-sharing/ Technology Evangelist - Big Data Analytics - Middleware - Apache Kafka Wed, 30 Apr 2025 07:04:07 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.2 https://www.kai-waehner.de/wp-content/uploads/2020/01/cropped-favicon-32x32.png Data Sharing Archives - Kai Waehner https://www.kai-waehner.de/blog/category/data-sharing/ 32 32 Real-Time Data Sharing in the Telco Industry for MVNO Growth and Beyond with Data Streaming https://www.kai-waehner.de/blog/2025/04/30/real-time-data-sharing-in-the-telco-industry-for-mvno-growth-and-beyond-with-data-streaming/ Wed, 30 Apr 2025 07:04:07 +0000 https://www.kai-waehner.de/?p=7786 The telecommunications industry is transforming rapidly as Telcos expand partnerships with MVNOs, IoT platforms, and enterprise customers. Traditional batch-driven architectures can no longer meet the demands for real-time, secure, and flexible data access. This blog explores how real-time data streaming technologies like Apache Kafka and Flink, combined with hybrid cloud architectures, enable Telcos to build trusted, scalable data ecosystems. It covers the key components of a modern data sharing platform, critical use cases across the Telco value chain, and how policy-driven governance and tailored data products drive new business opportunities, operational excellence, and regulatory compliance. Mastering real-time data sharing positions Telcos to turn raw events into strategic advantage faster and more securely than ever before.

The post Real-Time Data Sharing in the Telco Industry for MVNO Growth and Beyond with Data Streaming appeared first on Kai Waehner.

]]>
The telecommunications industry is entering a new era. Partnerships with MVNOs, IoT platforms, and enterprise customers demand flexible, secure, and real-time access to network and customer data. Traditional batch-driven architectures are no longer sufficient. Instead, real-time data streaming combined with policy-driven data sharing provides a powerful foundation for building scalable data products for internal and external consumers. A modern Telco must manage data collection, processing, governance, data sharing, and distribution with the same rigor as its core network services. Leading Telcos now operate centralized real-time data streaming platforms to integrate and share network events, subscriber information, billing records, and telemetry from thousands of data sources across the edge and core networks.

Data Sharing for MVNO Growth and Beyond with Data Streaming in the Telco Industry

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including a dedicated chapter about the telco industry.

Data Streaming in the Telco Industry

Telecommunications networks generate vast amounts of data every second. Every call, message, internet session, device interaction, and network event produces valuable information. Historically, much of this data was processed in batches — often hours or even days after it was collected. This delayed model no longer meets the needs of modern Telcos, partners, and customers.

Data streaming transforms how Telcos handle information. Instead of storing and processing data later, it is ingested, processed, and acted upon in real time as it is generated. This enables continuous intelligence across all parts of the network and business.

Learn more about “The Top 20 Problems with Batch Processing (and How to Fix Them with Data Streaming)“.

Business Value of Data Streaming in the Telecom Sector

Key benefits of data streaming for Telcos include:

  • Real-Time Visibility: Immediate insight into network health, customer behavior, fraud attempts, and service performance.
  • Operational Efficiency: Faster detection and resolution of issues reduces downtime, improves customer satisfaction, and lowers operating costs.
  • New Revenue Opportunities: Real-time data enables new services such as dynamic pricing, personalized offers, and proactive customer support.
  • Enhanced Security and Compliance: Immediate anomaly detection and instant auditability support regulatory requirements and protect against cyber threats.

Technologies like Apache Kafka and Apache Flink are now core components of Telco IT architectures. They allow Telcos to integrate massive, distributed data flows from radio access networks (RAN), 5G core systems, IoT ecosystems, billing and support platforms, and customer devices.

Modern Telcos use data streaming to not only improve internal operations but also to deliver trusted, secure, and differentiated services to external partners such as MVNOs, IoT platforms, and enterprise customers.

Learn More about Data Streaming in Telco

Learn more about data streaming in the telecommunications sector:

Data streaming is not an allrounder to solve every problem. Hence, a modern enterprise architecture combines data streaming with purpose-built telco-specific platforms and SaaS solutions, and data lakes/warehouses/lakehouses like Snowflake or Databricks for the analytical workloads.

I already wrote about the combination of data streaming platforms like Confluent together with Snowflake and Microsoft Fabric. A blog series about data streaming with Confluent combined with AI and analytics using Databricks is coming right after this blog post here.

Building a Real-Time Data Sharing Platform in the Telco Industry with Data Streaming

By mastering real-time data streaming, Telcos unlock the ability to share valuable insights securely and efficiently with internal divisions, IoT platforms, and enterprise customers.

Mobile Virtual Network Operators (MVNOs) — companies that offer mobile services without owning their own network infrastructure — are an equally important group of consumers. As an MVNO delivers niche services, competitive pricing, and tailored customer experiences, real-time data sharing becomes essential to support their growth and enable differentiation in a highly competitive market.

Real-Time Data Sharing Between Organizations Is Necessary in the Telco Industry

A strong real-time data sharing platform in the telco industry integrates multiple types of components and stakeholders, organized into four critical areas:

Data Sources

A real-time data platform aggregates information from a wide range of technical systems across the Telco infrastructure.

  • Radio Access Network (RAN) Metrics: Capture real-time information about signal quality, handovers, and user session performance.
  • 5G Core Network Functions: Manage traffic flows, session lifecycles, and device mobility through UPF, SMF, and AMF components.
  • Operational Support Systems (OSS) and Business Support Systems (BSS): Provide data for service assurance, provisioning, customer management, and billing processes.
  • IoT Devices: Send continuous telemetry data from connected vehicles, industrial assets, healthcare monitors, and consumer electronics.
  • Customer Premises Equipment (CPE): Supply performance and operational data from routers, gateways, modems, and set-top boxes.
  • Billing Events: Stream usage records, real-time charging information, and transaction logs to support accurate billing.
  • Customer Profiles: Update subscription plans, user preferences, device types, and behavioral attributes dynamically.
  • Security Logs: Capture authentication events, threat detections, network access attempts, and audit trail information.

Stream Processing

Stream processing technologies ensure raw events are turned into enriched, actionable data products as they move through the system.

  • Real-Time Data Ingestion: Continuously collect and process events from all sources with low latency and high reliability.
  • Data Aggregation and Enrichment: Transform raw network, billing, and device data into structured, valuable datasets.
  • Actionable Data Products: Create enriched, ready-to-consume information for operational and business use cases across the ecosystem.

Data Governance

Effective governance frameworks guarantee that data sharing is secure, compliant, and aligned with commercial agreements.

  • Policy-Based Access Control: Enforce business, regulatory, and contractual rules on how data is shared internally and externally.
  • Data Protection Techniques: Apply masking, anonymization, and encryption to secure sensitive information at every stage.
  • Compliance Assurance: Meet regulatory requirements like GDPR, CCPA, and telecom-specific standards through real-time monitoring and enforcement.

Data Consumers

Multiple internal and external stakeholders rely on tailored, policy-controlled access to real-time data streams to achieve business outcomes.

  • MVNO Partners: Consume real-time network metrics, subscriber insights, and fraud alerts to offer better customer experiences and safeguard operations.
  • Internal Telco Divisions: Use operational data to improve network uptime, optimize marketing initiatives, and detect revenue leakage early.
  • IoT Platform Services: Rely on device telemetry and mobility data to improve fleet management, predictive maintenance, and automated operations.
  • Enterprise Customers: Integrate real-time network insights and SLA compliance monitoring into private network and corporate IT systems.
  • Regulatory and Compliance Bodies: Access live audit streams, security incident data, and privacy-preserving compliance reports as required by law.

Key Data Products Driving Value for Data Sharing in the Telco Industry

In modern Telco architectures, data products act as the building blocks for a data mesh approach, enabling decentralized ownership, scalable integration with microservices, and direct access for consumers across the business and partner ecosystem.

Data Sharing in Telco with a Data Mesh and Data Products using Data Streaming with Apache Kafka

The right data products accelerate time-to-insight and enable additional revenue streams. Leading Telcos typically offer:

  • Network Quality Metrics: Monitoring service degradation, latency spikes, and coverage gaps continuously.
  • Customer Behavior Analytics: Tracking app usage, mobility patterns, device types, and engagement trends.
  • Fraud and Anomaly Detection Feeds: Capturing unusual usage, SIM swaps, or suspicious roaming activities in real time.
  • Billing and Charging Data Streams: Delivering session records and consumption details instantly to billing systems or MVNO partners.
  • Device Telemetry and Health Data: Providing operational status and error signals from smartphones, CPE, and IoT devices.
  • Subscriber Profile Updates: Streaming changes in service plans, device upgrades, or user preferences.
  • Location-Aware Services Data: Powering geofencing, smart city applications, and targeted marketing efforts.
  • Churn Prediction Models: Scoring customer retention risks based on usage behavior and network experience.
  • Network Capacity and Traffic Forecasts: Helping optimize resource allocation and investment planning.
  • Policy Compliance Monitoring: Ensuring real-time validation of internal and external SLAs, privacy agreements, and regulatory requirements.

These data products can be offered via APIs, secure topics, or integrated into partner platforms for direct consumption.

How Each Data Consumer Gains Strategic Value

Real-time data streaming empowers each data consumer within the Telco ecosystem to achieve specific business outcomes, drive operational excellence, and unlock new growth opportunities based on continuous, trusted insights.

Internal Telco Divisions

Real-time insights into network behavior allow proactive incident management and customer support. Marketing teams optimize campaigns based on live subscriber data, while finance teams minimize revenue leakage by tracking billing and usage patterns instantly.

MVNO Partners

Access to live network quality indicators helps MVNOs improve customer satisfaction and loyalty. Real-time fraud monitoring protects against financial losses. Tailored subscriber insights enable MVNOs to offer personalized plans and upsells based on actual usage.

IoT Platform Services

Large-scale telemetry streaming enables better device management, predictive maintenance, and operational automation. Real-time geolocation data improves logistics, fleet management, and smart infrastructure performance. Event-driven alerts help detect and resolve device malfunctions rapidly.

Enterprise Customers

Private 5G networks and managed services depend on live analytics to meet SLA obligations. Enterprises integrate real-time network telemetry into their own systems for smarter decision-making. Data-driven optimizations ensure higher uptime, better resource utilization, and enhanced customer experiences.

Building a Trusted Data Ecosystem for Telcos with Real-Time Streaming and Hybrid Cloud

Real-time data sharing is no longer a luxury for Telcos — it is a competitive necessity. A successful platform must balance openness with control, ensuring that every data exchange respects privacy, governance, and commercial boundaries.

Hybrid cloud architectures play a critical role in this evolution. They enable Telcos to process, govern, and share real-time data across on-premises infrastructure, edge environments, and public clouds seamlessly. By combining the flexibility of cloud-native services with the security and performance of on-premises systems, hybrid cloud ensures that data remains accessible, scalable, cost-efficient and compliant wherever it is needed.

Hybrid 5G Telco Architecture with Data Streaming with AWS Cloud and Confluent Edge and Cloud

By deploying scalable data streaming solutions across a hybrid cloud environment, Telcos enable secure, real-time data sharing with MVNOs, IoT platforms, enterprise customers, and internal business units. This empowers critical use cases such as dynamic quality of service monitoring, real-time fraud detection, customer behavior analytics, predictive maintenance for connected devices, and SLA compliance reporting — all without compromising performance or regulatory requirements.

The future of telecommunications belongs to those who implement real-time data streaming and controlled data sharing — turning raw events into strategic advantage faster, more securely, and more effectively than ever before.

How do you share data in your organization? Do you already leverage data streaming or still operate in batch mode? Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases.

The post Real-Time Data Sharing in the Telco Industry for MVNO Growth and Beyond with Data Streaming appeared first on Kai Waehner.

]]>
Data Streaming as the Technical Foundation for a B2B Marketplace https://www.kai-waehner.de/blog/2025/03/05/data-streaming-as-the-technical-foundation-for-a-b2b-marketplace/ Wed, 05 Mar 2025 06:26:59 +0000 https://www.kai-waehner.de/?p=7288 A B2B data marketplace empowers businesses to exchange, monetize, and leverage real-time data through self-service platforms featuring subscription management, usage-based billing, and secure data sharing. Built on data streaming technologies like Apache Kafka and Flink, these marketplaces deliver scalable, event-driven architectures for seamless integration, real-time processing, and compliance. By exploring successful implementations like AppDirect, this post highlights how organizations can unlock new revenue streams and foster innovation with modern data marketplace solutions.

The post Data Streaming as the Technical Foundation for a B2B Marketplace appeared first on Kai Waehner.

]]>
A B2B data marketplace is a groundbreaking platform enabling businesses to exchange, monetize, and use data in real time. Beyond the basic promise of data sharing, these marketplaces are evolving into self-service platforms with features such as subscription management, usage-based billing, and secure data monetization. This post explores the core technical and functional aspects of building a data marketplace for subscription commerce using data streaming technologies like Apache Kafka. Drawing inspiration from real-world implementations like AppDirect, the post examines how these capabilities translate into a robust and scalable architecture.

Data Streaming with Apache Kafka and Flink as the Backbone for a B2B Data Marketplace

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases.

Subscription Commerce with a Digital Marketplace

Subscription commerce refers to business models where customers pay recurring fees—monthly, annually, or usage-based—for access to products or services, such as SaaS, streaming platforms, or subscription boxes.

Digital marketplaces are online platforms where multiple vendors can sell their products or services to customers, often incorporating features like catalog management, payment processing, and partner integrations.

Together, subscription commerce and digital marketplaces enable businesses to monetize recurring offerings efficiently, manage customer relationships, and scale through multi-vendor ecosystems. These solutions enables organizations to sell own or third-party recurring technology services through a white-labeled marketplace, or streamline procurement with an internal IT marketplace to manage and acquire services. The platform empowers digital growth for businesses of all sizes across direct and indirect go-to-market channels.

The Competitive Landscape for Subscription Commerce

The subscription commerce and digital marketplace space includes several prominent players offering specialized solutions.

Zuora leads in enterprise-grade subscription billing and revenue management, while Chargebee and Recurly focus on flexible billing and automation for SaaS and SMBs. Paddle provides global payment and subscription management tailored to SaaS businesses. AppDirect stands out for enabling SaaS providers and enterprises to manage subscriptions, monetize offerings, and build partner ecosystems through a unified platform.

For marketplace platforms, CloudBlue (from Ingram Micro) enables as-a-service ecosystems for telcos and cloud providers, and Mirakl excels at building enterprise-level B2B and B2C marketplaces.

Solutions like ChannelAdvisor and Vendasta cater to resellers and localized businesses with marketplace and subscription tools. Each platform offers unique capabilities, making the choice dependent on specific needs like scalability, industry focus, and integration requirements.

What Makes a B2B Data Marketplace Technically Unique?

A data marketplace is more than a repository; it is a dynamic, decentralized platform that enables the continuous exchange of data streams across organizations. Its key distinguishing features include:

  1. Real-Time Data Sharing: Enables instantaneous exchange and consumption of data streams.
  2. Decentralized Design: Avoids reliance on centralized data hubs, reducing latency and risk of single points of failure.
  3. Fine-Grained Access Control: Ensures secure and compliant data sharing.
  4. Self-Service Capabilities: Simplifies the discovery and consumption of data through APIs and portals.
  5. Usage-Based Billing and Monetization: Tracks data usage in real time to enable flexible pricing models.

These characteristics require a scalable, fault-tolerant, and real-time data processing backbone. Enter data streaming with the de facto standard Apache Kafka.

Data Streaming as the Backbone of a B2B Data Marketplace

At the heart of a B2B data marketplace lies data streaming, a technology paradigm enabling continuous data flow and processing. Kafka’s publish-subscribe architecture aligns seamlessly with the marketplace model, where data producers publish streams that consumers can subscribe to in real time.

Event-driven Architecture for Data Streaming with Apache Kafka and Flink

Why Apache Kafka for a Data Marketplace?

A data streaming platform uniquely combines different characteristics and capabilities:

  1. Scalability and Fault Tolerance: Kafka’s distributed architecture allows for handling large volumes of data streams, ensuring high availability even during failures.
  2. Event-Driven Design: Kafka provides a natural fit for event-driven architectures, where data exchanges trigger workflows, such as subscription activation or billing.
  3. Stream Processing with Kafka Streams or ksqlDB: Real-time transformation, filtering, and enrichment of data streams can be performed natively, ensuring the data is actionable as it flows.
  4. Integration with Ecosystem: Kafka’s connectors enable seamless integration with external systems such as billing platforms, monitoring tools, and data lakes.
  5. Security and Compliance: Built-in features like TLS encryption, SASL authentication, and fine-grained ACLs ensure the marketplace adheres to strict security standards.

I wrote a separate article that explores how an Event-driven Architecture (EDA) and Apache Kafka build the foundation of a streaming data exchange.

Architecture Overview

Modern architectures for data marketplaces are often inspired by Domain-Driven Design (DDD), microservices, and the principles of a data mesh.

  • Domain-Driven Design helps structure the platform around distinct business domains, ensuring each part of the marketplace aligns with its core functionality, such as subscription management or billing.
  • Microservices decompose the marketplace into independently deployable services, promoting scalability and modularity.
  • A Data mesh decentralizes data ownership, empowering individual teams or providers to manage and share their datasets while adhering to shared governance policies.

Decentralised Data Products with Data Streaming leveraging Apache Kafka in a Data Mesh

Together, these principles create a flexible, scalable, and business-aligned architecture. A high-level architecture for such a marketplace involves:

  1. Data Providers: Publish real-time data streams to Kafka Topics. Use Kafka Connect to ingest data from external sources.
  2. Data Marketplace Platform: A front-end portal backed by Kafka for subscription management, search, and discovery. Kafka Streams or Apache Flink for real-time processing (e.g., billing, transformation). Integration with billing systems, identity management, and analytics platforms.
  3. Data Consumers: Subscribe to Kafka Topics, consuming data tailored to their needs. Integrate the marketplace streams into their own analytics or operational workflows.

Data Sharing Beyond Kafka with Stream Sharing and Self-Service Data Portal

A data streaming platoform enable simple and secure data sharing within or across organizations with chargeback capabilities built-in to build cost APIs and new business models. The following is an implementation leveraging Confluent’s Stream Sharing functionality in Confluent Cloud:

Confluent Stream Sharing for Data Sharing Beyond Apache Kafka
Source: Confluent

Data Marketplace Features and Their Technical Implementation

A robust B2B data marketplace should offer the following vendor-agnostic features:

Self-Service Data Discovery

Real-Time Subscription Management

  • Functionality: Enables users to subscribe to data streams with customizable preferences, such as data filters or frequency of updates.
  • Technical Implementation: Use Kafka’s consumer groups to manage subscriptions. Implement filtering logic with Kafka Streams or ksqlDB to tailor streams to user preferences.

Usage-Based Billing

  • Functionality: Tracks the volume or type of data consumed by each user and generates invoices dynamically.
  • Technical Implementation: Use Kafka’s log retention and monitoring tools to track data consumption. Integrate with a billing engine via Kafka Connect or RESTful APIs for real-time invoice generation.

Monetization and Revenue Sharing

  • Functionality: Facilitates revenue sharing between data providers and marketplace operators.
  • Technical Implementation: Build a revenue-sharing logic layer using Kafka Streams or Apache Flink, processing data usage metrics. Store provider-specific pricing models in a database connected via Kafka Connect.

Compliance and Data Governance

  • Functionality: Ensures data sharing complies with regulations (e.g., GDPR, HIPAA) and provides an audit trail.
  • Technical Implementation: Leverage Kafka’s immutable event log as an auditable record of all data exchanges. Implement data contracts for Kafka Topics with policies, role-based access control (RBAC), and encryption for secure sharing.

Dynamic Pricing Models

Marketplace Analytics

  • Functionality: Offers insights into usage patterns, revenue streams, and operational metrics.
  • Technical Implementation: Aggregate Kafka stream data into analytics platforms such as Snowflake, Databricks, Elasticsearch or Microsoft Fabri.

Real-World Success Story: AppDirect’s Subscription Commerce Platform Powered by a Data Streaming Platform

AppDirect is a leading subscription commerce platform that helps businesses monetize and manage software, services, and data through a unified digital marketplace. It provides tools for subscription billing, usage tracking, partner management, and revenue sharing, enabling seamless B2B transactions.

AppDirect B2B Data Marketplace for Subscription Commerce
Source: AppDirect

AppDirect serves customers across industries such as telecommunications (e.g., Telstra, Deutsche Telekom), technology (e.g., Google, Microsoft), and cloud services, powering ecosystems for software distribution and partner-driven monetization.

The Challenge

AppDirect enables SaaS providers to monetize their offerings, but faced significant challenges in scaling its platform to handle the growing complexity of real-time subscription billing and data flow management.

As the number of vendors and consumers on the platform increased, ensuring accurate, real-time tracking of usage and billing became increasingly difficult. Additionally, the legacy systems struggled to support seamless integration, dynamic pricing models, and real-time updates required for a competitive marketplace experience.

The Solution

AppDirect implemented a data streaming backbone with Apache Kafka leveraging Confluent’s data streaming platform. This enabled:

  • Real-time billing for subscription services.
  • Accurate usage tracking and monetization.
  • Improved scalability with a distributed, event-driven architecture.

The Outcome

  • 90% reduction in time-to-market for new features.
  • Enhanced customer experience with real-time updates.
  • Seamless scaling to handle increasing vendor participation and data loads.

Advantages Over Competitors in the Subscription Commerce and Data Marketplace Business

Powered by the event-driven architecture and a data streaming platform, AppDirect distinguishes itself with from competitors in the subscription commerce and data marketplace business:

  • A unified approach to subscription management, billing, and partner ecosystem enablement.
  • Strong focus on the telecommunications and technology sectors.
  • Deep integrations for vendor and reseller ecosystems.

Data Streaming Revolutionizes B2B Data Sharing

The technical backbone of a B2B data marketplace relies on data streaming to deliver real-time data sharing, scalable subscription management, and secure monetization. Platforms like Apache Kafka and Confluent enable these features through their distributed, event-driven architecture, ensuring resilience, compliance, and operational efficiency.

By implementing these principles, organizations can build a modern, self-service data marketplace that fosters innovation and collaboration. The success of AppDirect highlights the potential of this approach, offering a blueprint for businesses looking to capitalize on the power of data streaming.

Whether you’re a data provider seeking additional revenue streams or a business aiming to harness external insights, a well-designed data marketplace is your gateway to unlocking value in the digital economy.

Stay ahead of the curve! Subscribe to my newsletter for insights into data streaming and connect with me on LinkedIn to continue the conversation. And make sure to download my free book about data streaming use cases.

The post Data Streaming as the Technical Foundation for a B2B Marketplace appeared first on Kai Waehner.

]]>
Data Streaming in Healthcare and Pharma: Use Cases and Insights from Cardinal Health https://www.kai-waehner.de/blog/2024/11/28/data-streaming-in-healthcare-and-pharma-use-cases-cardinal-health/ Thu, 28 Nov 2024 04:12:15 +0000 https://www.kai-waehner.de/?p=7047 This blog explores Cardinal Health’s journey, exploring how its event-driven architecture and data streaming power use cases like supply chain optimization, and medical device and equipment management. By integrating Apache Kafka with platforms like Apigee, Dell Boomi and SAP, Cardinal Health sets a benchmark for IT modernization and innovation in the healthcare and pharma sectors.

The post Data Streaming in Healthcare and Pharma: Use Cases and Insights from Cardinal Health appeared first on Kai Waehner.

]]>

The post Data Streaming in Healthcare and Pharma: Use Cases and Insights from Cardinal Health appeared first on Kai Waehner.

]]>
Top 5 Trends for Data Streaming with Kafka and Flink in 2024 https://www.kai-waehner.de/blog/2023/12/02/top-5-trends-for-data-streaming-with-apache-kafka-and-flink-in-2024/ Sat, 02 Dec 2023 10:54:38 +0000 https://www.kai-waehner.de/?p=5885 Do you wonder about my predicted TOP 5 data streaming trends with Apache Kafka and Flink in 2024 to set data in motion? Discover new technology trends and best practices for event-driven architectures, including data sharing, data contracts, serverless stream processing, multi-cloud architectures, and GenAI.

The post Top 5 Trends for Data Streaming with Kafka and Flink in 2024 appeared first on Kai Waehner.

]]>
Data Streaming is one of the most relevant buzzwords in tech to build scalable real-time applications and innovative business models. Do you wonder about my predicted TOP 5 data streaming trends in 2024 to set data in motion? Learn what role Apache Kafka and Apache Flink play. Discover new technology trends and best practices for event-driven architectures, including data sharing, data contracts, serverless stream processing, multi-cloud architectures, and GenAI.

Some followers might notice that this became a series with past posts about the top 5 data streaming trends for 2021, the top 5 for 2022, and the top 5 for 2023. Trends change over time, but the huge value of having a scalable real-time infrastructure as the central data hub stays. Data streaming with Apache Kafka is a journey and evolution to set data in motion.

Top 5 Trends for Data Streaming with Apache Kafka and Flink in 2024

The research and consulting company Gartner defines the top strategic technology trends every year. This time, the trends are around building new (AI) platforms and delivering value by automation, but also protecting investment. On a higher level, it is all about automating, scaling, and pioneering. Here is what Gartner expects for 2024:

Gartner Top Strategic Technology Trends 2024

It is funny (but not surprising): Gartner’s predictions overlap and complement the five trends I focus on for data streaming with Apache Kafka looking forward to 2024. I explore how data streaming enables faster time to market, good data quality across independent data products, and innovation with technologies like Generative AI.

The top 5 data streaming trends for 2024

I see the following topics coming up more regularly in conversations with customers, prospects, and the broader data streaming community across the globe:

  1. Data sharing for faster innovation with independent data products
  2. Data contracts for better data governance and policy enforcement
  3. Serverless stream processing for easier building of scalable and elastic streaming apps
  4. Multi-cloud deployments for cost-efficient delivering value where the customers sit
  5. Reliable Generative AI (GenAI) with embedded accurate, up-to-date information to avoid hallucination

The following sections describe each trend in more detail. The trends are relevant for many scenarios; no matter if you use open source Apache Kafka or Apache Flink, a commercial platform, or a fully managed cloud service like Confluent Cloud. I start each section with a real-world case study. The end of the article contains the complete slide deck and video recording.

Data sharing across business units and organizations

Data sharing refers to the process of exchanging or providing access to data among different individuals, organizations, or systems. This can involve sharing data within an organization or sharing data with external entities. The goal of data sharing is to make information available to those who need it, whether for collaboration, analysis, decision-making, or other purposes. Obviously, real-time data beats slow data for almost all data sharing use cases.

NASA: Real-time data sharing with Apache Kafka

NASA enables real-time data between space- and ground-based observatories. The
General Coordinates Network (GCN) allows real-time alerts in the astronomy community. With this system, NASA researchers, private space companies, and even backyard astronomy enthusiasts can publish and receive information about current activity in the sky.

NASA enables real-time data from Mars with Apache Kafka

Apache Kafka plays an essential role in astronomy research for data sharing. Particularly where black holes and neutron stars are involved, astronomers are increasingly seeking out the “time domain” and want to study explosive transients and variability. In response, observatories are increasingly adopting streaming technologies to send alerts to astronomers and to get their data to their science users in real time.

The talk “General Coordinates Network: Harnessing Kafka for Real-Time Open Astronomy at NASA” explores architectural choices, challenges, and lessons learned in adapting Kafka for open science and open data sharing at NASA.

NASA’s approach to OpenID Connect / OAuth2 in Kafka is designed to securely scale Kafka from access inside a single organization to access by the general public.

Stream data exchange with Kafka using cluster linking, stream sharing, and AsyncAPI

The Kafka ecosystem provides various functions to share data in real-time at any scale. Some are vendor-specific. I look at this from the perspective of Confluent, so that you see a lot of innovative options (even if you want to build it by yourself with open source Kafka):

  • Kafka Connect connector ecosystem to integrate with other data sources and sinks out-of-the-box
  • HTTP/REST proxies and connectors for Kafka to use simple and well understood request-response (HTTP is, unfortunately, also an anti-pattern for streaming data)
  • Cluster Linking for replication between Kafka clusters using the native Kafka protocol (instead of separate infrastructure like MirrorMaker)
  • Stream Sharing for exposing a Kafka Topic through a simple button click with access control, encryption, quotas, and chargeback billing APIs
  • Generation of AsyncAPI specs to share data with non-Kafka applications (like other message brokers or API gateways that support AsyncAPI, which is an open data for contract for asynchronous event-based messaging (similar to Swagger for HTTP/REST APIs)

Here is an example for Cluster Linking for bi-directional replication between Kafka clusters in the automotive industry:

Stream Data Exchange with Apache Kafka and Confluent Cluster Linking

And another example of stream sharing for easy access to a Kafka Topic in financial services:

Confluent Stream Sharing for Data Sharing Beyond Apache Kafka

To learn more, check out the article “Streaming Data Exchange with Kafka and a Data Mesh in Motion“.

Data contracts for data governance and policy enforcement

A data contract is an agreement or understanding that defines the terms and conditions governing the exchange or sharing of data between parties. It is a formal arrangement specifying how data will be handled, used, protected, and shared among entities. Data contracts are crucial when multiple parties need to interact with and utilize shared data, ensuring clarity and compliance with agreed-upon rules.

Raiffeisen Bank International: Data contracts for data sharing across countries

Raiffeisen Bank International (RBI) is scaling an event-driven architecture across the group as part of a bank-wide transformation program. This includes the creation of a reference architecture and the re-use of technology and concepts across 12 countries.

Data Mesh powered by Data Streaming at Raiffeisen Bank International

Learn more in the article “Decentralized Data Mesh with Data Streaming in Financial Services“.

Policy enforcement and data quality for Apache Kafka with Schema Registry

Good data quality is one of the most critical requirements in decoupled architectures like microservices or data mesh. Apache Kafka became the de facto standard for these architectures. But Kafka is a dumb broker that only stores byte arrays. The Schema Registry for Apache Kafka enforces message structures.

This blog post examines Schema Registry enhancements to leverage data contracts for policies and rules to enforce good data quality on field-level and advanced use cases like routing malicious messages to a dead letter queue.

Data Governance and Policy Enforcement with Data Contracts for Apache Kafka

For more details: Building a data mesh with decoupled data products and good data quality, governance, and policy enforcement.

Serverless stream processing refers to a computing architecture where developers can build and deploy applications without having to manage the underlying infrastructure.

In the context of stream processing, it involves the real-time processing of data streams without the need to provision or manage servers explicitly. This approach allows developers to focus on writing code and building applications. The cloud service takes care of the operational aspects, such as scaling, provisioning, and maintaining servers.

Designed to answer professional farmers’ needs, Sencrop offers a range of connected
weather stations that bring you precision agricultural weather data straight from your plots.

  • Over 20,000 connected ag-weather stations throughout Europe.
  • An intuitive, user-friendly application: Access accurate, ultra-local data to optimize your daily actions.
  • Prevent risks, reduce costs: Streamline inputs and reduce your environmental impact and associated costs.

Smart Agriculture with Kafka and Flink at Sencrop

Apache Kafka and Apache Flink increasingly join forces to build innovative real-time stream processing applications.

The Rise of Open Source Streaming Processing with Apache Kafka and Apache Flink

The Y-axis in the diagram shows the monthly unique users (based on statistics of Maven downloads).

Apache Kafka + Apache Flink = Match Made in Heaven” explores the benefits of combining both open-source frameworks. The article shows unique differentiators of Flink versus Kafka, and discusses when to use a Kafka-native streaming engine like Kafka Streams instead of Flink.

Unfortunately, operating a Flink cluster is really hard. Even harder than Kafka. Because Flink is not just a distributed system, it also has to keep state of applications for hours or even longer. Hence, serverless stream processing helps taking over the operation burden. And it makes the life of the developer easier, too.

Staying tuned for exciting cloud products offering serverless Flink in 2024. But be aware that some vendors use the same trick as for Kafka: Provisioning a Flink cluster and handing it over to you is NOT a serverless or fully-managed offering! For that reason, I compared Kafka products as self-driving cars vs. self-driving cars, i.e. cloud-based Kafka clusters you operate vs. truly fully managed services.

Multi-cloud for cost-efficient and reliable customer experience

Multi-cloud refers to a cloud computing strategy that uses services from multiple cloud providers to meet specific business or technical requirements. In a multi-cloud environment, organizations distribute their workloads across two or more cloud platforms, including public clouds, private clouds, or a combination of both.

The goal of a multi-cloud strategy is to avoid dependence on a single cloud provider and to leverage the strengths of different providers for various needs. Cost efficiency and regional laws (like operating in the United States or China) required different deployment strategies. Some countries do not provide a public cloud. A private cloud is the only option then.

New Relic: Multi-cloud Kafka deployments at extreme scale for real-time observability

New Relic is a software analytics company that provides monitoring and performance management solutions for applications and infrastructure. It’s designed to help organizations gain insights into the performance of their software and systems, allowing them to optimize and troubleshoot issues efficiently.

Observability has two key requirements: first, monitor data in real-time at any scale. Second, deploy the monitoring solution where the applications are running. The obvious consequence for New Relic is to process data with Apache Kafka, and multi-cloud where the customers are.

Multi Cloud Observability in Real-Time at extreme Scale with Apache Kafka at New Relic

Hybrid and multi-cloud data replication for cost-efficiency, low latency, or disaster recovery

Multi-cloud deployments of Apache Kafka have become the norm rather than an exception. Several scenarios require multi-cluster solutions with specific requirements and trade-offs:

  • Regional separation because of legal requirements
  • Independence of a single cloud provider
  • Disaster recovery
  • Aggregation for analytics
  • Cloud migration
  • Mission-critical stretched deployments

Hybrid Cloud Architecture with Apache Kafka

The blog post “Architecture Patterns for Distributed, Hybrid, Edge and Global Apache Kafka Deployments” explores various architectures and best practices.

Reliable Generative AI (GenAI) with accurate context to avoid hallucination

Generative AI is a class of artificial intelligence systems that generate new content, such as images, text, or even entire datasets, often by learning patterns and structures from existing data. These systems use techniques such as neural networks to create content that is not explicitly programmed but is instead generated based on the patterns and knowledge learned during training.

Elemental Cognition: GenAI platform powered by Apache Kafka

Elemental Cognition’s AI platform develops responsible and transparent AI that helps solve problems and deliver expertise that can be understood and trusted.

Confluent Cloud powers the AI platform to enable scalable real-time data and data integration use cases. I recommend looking at their website to learn from various impressive use cases.

Elemental Cognition - Real Time GenAI Platform powered by Apache Kafka and Confluent Cloud

Apache Kafka serves thousands of enterprises as the mission-critical and scalable real-time data fabric for machine learning infrastructures. The evolution of Generative AI (GenAI) with large language models (LLM) like ChatGPT changed how people think about intelligent software and automation. The relationship between data streaming and GenAI has enormous opportunities.

Apache Kafka as Mission Critical Data Fabric for GenAI” explores the use cases for combining data streaming with Generative AI.

An excellent example, especially for Generative AI, is context-specific customer service. The following diagram shows an enterprise architecture leveraging event-driven data streaming for data ingestion and processing across the entire GenAI pipeline:

Event-driven Architecture with Apache Kafka and Flink as Data Fabric for GenAI

Stream processing with Kafka and Flink enables data correlation of real-time and historical data. A stateful stream processor takes existing customer information from the CRM, loyalty platform, and other applications, correlates it with the query from the customer into the chatbot, and makes an RPC call to an LLM.

Stream Processing with Apache Flink SQL UDF and GenAI with OpenAI LLM

The article “Apache Kafka + Vector Database + LLM = Real-Time GenAI” explores possible architectures, examples, and trade-offs between event streaming and traditional request-response APIs and databases.

Slides and video recording for the data streaming trends in 2024 with Kafka and Flink

Do you want to look at more details? This section provides the entire slide deck and a video walking you through the content.

Slide deck

Here is the slide deck from my presentation:

Fullscreen Mode

Video recording

And here is the video recording of my presentation:

Video Recording: Top 5 Use Cases and Architectures for Data Streaming with Apache Kafka and Flink in 2024

2024 makes data streaming more mature, and Apache Flink becomes mainstream

I have two conclusions for data streaming trends in 2024:

  • Data streaming goes up in the maturity curve. More and more projects build streaming applications instead of just leveraging Apache Kafka as a dumb data pipeline between databases, data warehouses, and data lakes.
  • Apache Flink becomes mainstream. The open source framework shines with a scalable engine, multiple APIs like SQL, Java, and Python, and serverless cloud offerings from various software vendors. The latter makes building applications much more accessible.

Data sharing with data contracts is mandatory for a successful enterprise architecture with microservices or a data mesh. And data streaming is the foundation for innovation with technology trends like Generative AI. Therefore, we are just at the tipping point of adopting data streaming technologies such as Apache Kafka and Apache Flink.

What are your most relevant and exciting data streaming trends with Apache Kafka and Apache Flink in 2024 to set data in motion? What are your strategy and timeline? Do you use serverless cloud offerings or self-managed infrastructure? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Top 5 Trends for Data Streaming with Kafka and Flink in 2024 appeared first on Kai Waehner.

]]>
Apache Kafka for Data Consistency (and Real-Time Data Streaming) https://www.kai-waehner.de/blog/2022/12/27/apache-kafka-for-data-consistency-and-real-time-data-streaming/ Tue, 27 Dec 2022 08:46:04 +0000 https://www.kai-waehner.de/?p=5090 Real-time data beats slow data in almost all use cases. But as essential is data consistency across all systems, including non-real-time legacy systems and modern request-response APIs. Apache Kafka's most underestimated feature is the storage component based on the append-only commit log. It enables loose coupling for domain-driven design with microservices and independent data products in a data mesh. This blog post explores how Kafka enables data consistency with a real-world case study from financial services.

The post Apache Kafka for Data Consistency (and Real-Time Data Streaming) appeared first on Kai Waehner.

]]>
Real-time data beats slow data in almost all use cases. But as essential is data consistency across all systems, including non-real-time legacy systems and modern request-response APIs. Apache Kafka’s most underestimated feature is the storage component based on the append-only commit log. It enables loose coupling for domain-driven design with microservices and independent data products in a data mesh. This blog post explores how Kafka enables data consistency with a real-world case study from financial services.

Apache Kafka for Data Consistency (and Real-Time Data Streaming)

Apache Kafka = Real-time data streaming

Real-time beats slow data. It is that easy in almost all use cases. Ask any executive or business person: What’s better? If you can consume and use information now or later? The value of data goes down over time:

Forrester - Business Value of Real Time Data

Apache Kafka is the de facto standard for real-time data streaming. Check out the data streaming landscape 2023 to learn more about Kafka-related products and cloud services.

So far, so good. However, one valid question always comes up: “Why is Apache Kafka different from a real-time message broker like RabbitMQ, IBM MQ, NATS, or Amazon SQS?”

TL;DR: Message brokers provide real-time messaging capabilities to produce and consume messages. Apache Kafka is a data streaming platform that combines messaging, storage, data integration, and stream processing capabilities.

My comparison of message brokers and data streaming explored the differences using ten characteristics. It is all about the storage component of Apache Kafka in the discussion of data consistency. Let’s explore why…

Real-time means many things…

… from deterministic systems with hard real-time up to minutes or even hours. Always define your requirements for real-time data processing and the end-to-end latency (not just the messaging component).

I clarified when to use Apache Kafka for real-time workloads in separate blog posts:

Data consistency = The biggest challenge of the enterprise architecture

Data consistency refers to whether the same data kept at different places does or does not match. The data is processed in many ways across the enterprise architecture:

  • Real-time: Message brokers or data streaming platforms transfer or process data when it is in motion.
  • Near real-time: Platforms ingest data into data lakes and data warehouses in seconds or minutes.
  • Batch: Reporting and analytics of historical data.
  • Request-response: Interactive API or SQL queries to collect specific information.
  • A point-in-time replay of historical data: Troubleshooting, incident management, regulatory reporting, and similar scenarios.

The applications and data platforms use very different (old and new) technologies, products, cloud services, and APIs. Integration and data consistency across the different communication paradigms is a massive challenge within the spaghetti architecture:

Integration Mess in a Spaghetti Enterprise Architecture

The consequence of inconsistent data is obvious:

  • Bad customer experience, e.g., late notification about flight delays or cancellations.
  • Revenue loss, e.g., inventory not up-to-date, missed or too late detection of fraud.
  • Increased cost, e.g., slow or wrong decisions in logistics across the supply chain.
  • Increased risk, e.g., unrecognized data breaches, compliance issues.

This is where the storage component of Apache Kafka makes the difference…

Apache Kafka = Streaming platform to decouple any application and communication paradigm

Kafka is an append-only commit log. Consumers are independent of each other and independent of producers. They interact at their own pace with their own communication paradigm and pull the information from the Kafka log.

This enables independent consumption and processing of data consistently. It does not matter what technology or communication paradigm the downstream consumer application uses:

Data Consistency with Real-Time, Batch or Request Response Consumption

This example of a real-time locating system (RTLS) for asset tracking can be built with Kafka straightforwardly. Downstream applications consume events in real-time, near real-time, batch, or via request-response HTTP/REST APIs. With other technologies, like real-time message queues, you need to add additional platforms for storage, integration, and data processing. Data streaming provides a single (scalable and reliable) platform.

Apache Kafka enables domain-driven design and data mesh architectures

Domain-driven Design (DDD) needs decoupled applications. Push-based message brokers or HTTP/REST web services enforce tight coupling between the systems. This creates the above spaghetti architecture.

On the other side, Apache Kafka truly decouples the domains, no matter what technologies or communication paradigms each domain uses. Loose coupling is the norm with Kafka:

Decentralized Data Mesh powered by data streaming and Apache Kafka

This is why the storage component using the combination of real-time messaging and a distributed commit log enables data consistency across technologies and communication paradigms.

With Tiered Storage for Kafka, storage and compute are separated to enable long-term storage in Kafka for the replayability of historical data. This is not needed for every use case. Kafka does not replace your favorite database. But it is beneficial for some use cases, e.g., model training with TensorFlow and Kafka to leverage machine learning with a Kappa architecture.

Domain-driven design is the foundation of modern microservice architecture or data mesh. For that reason, many cloud-native enterprise architectures are built using Kafka as the heart of the infrastructure for real-time data sharing plus loose coupling between the data products.

Let’s look at a practical real-world example where the added value of Kafka was not its real-time capability but enabling data consistency across systems…

Erste Group Bank: A case study for data consistency with Apache Kafka

Erste Group Bank AG (Erste Group) is an Austrian financial service provider in Central and Eastern Europe serving 15.7 million clients in over 2,700 branches in seven countries. The bank presented its data streaming journey at Confluent’s Data in Motion 2022 tour in Zurich.

Having a strong mission to increase its customer experience, Erste Group is putting more and more data into action. Making this happen can be summarized as a race for consistency across our channels, squaring the circle between latency and data volumes.

The digital transformation at Erste Group required challenging integration across different technologies and communication paradigms. The following sections describe why Apache Kafka was chosen.

Hyper-personalized mobile banking

A great user experience in the new mobile app “Georg” is a crucial strategic component of Erste Group’s digital transformation to increase customer experience and revenue.

Here is how Erste Group promotes its mobile app: “For 8 million people in 6 countries, banking has a name. George. George empowers everyone to understand, manage, and improve their financial health. Simple. Intelligent. Personal. Unique.”

The following diagram shows the positive and negative reviews of customers, including Erste Group’s mobile app “George” and competitive banking apps:

Erste Bank Mobile App Reviews and Feedback
Source: Erste Group Bank AG

A great mobile app user experience requires the combination of many technologies in the backend. Data streaming enables the foundation in the backend to build an intuitive mobile app across various European countries.

Scalability and accurate information at the right time in the right context make the difference:

Erste Bank Data Platform Landscape
Source: Erste Group Bank AG

Fully managed data streaming for omnichannel data consistency

Here comes the surprising part of why I chose this case study for the blog post: While the real-time capability of Apache Kafka at any scale is essential, the critical aspect of the technology choice was data consistency across platforms and communication paradigms:

Real Time and Data Consistency with Apache Kafka and Confluent Cloud at Erste Bank Group
Source: Erste Group Bank AG

Erste Group built an enterprise architecture with data streaming to enable asynchronous decoupled domains with event sourcing. The responsibility is split across cross-functional teams like Digital and Business Intelligence. However, consistent data is served in different ways for various downstream consumers:

  • Stream processing for real-time subscriptions
  • A serving layer for API integration via request-response protocols like HTTP/REST
  • Tiered Storage for long-term replayability of historical data to enable analytical queries
Data Consistency with Stream Processing powered by Apache Kafka
Source: Erste Group Bank AG

The infrastructure is fully managed in Confluent Cloud to enable focusing on business problems and innovation. DevOps and MLOps automate the development and monitoring lifecycle of the applications.

Data consistency is as critical as real-time data

Apache Kafka is the de facto standard for real-time data streaming. In addition, most enterprise architectures leverage the append-only commit log for loosely coupling to enable agile and elastic microservice architectures. The vision of building data products in a decentralized data mesh is made possible with Apache Kafka.

This post showed the case study of Erste Group to enable data consistency across domains and technologies with fully managed Kafka in the cloud. Obviously, we just explored the foundation. Data sharing across organizations and enforced data governance, including access control, encryption, and audit logging, are mandatory to realize a data mesh in the real world. I discussed these topics in my overview of the top 5 trends for data streaming in 2023.

Let’s connect on LinkedIn and discuss it! Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka for Data Consistency (and Real-Time Data Streaming) appeared first on Kai Waehner.

]]>
Top 5 Data Streaming Trends for 2023 https://www.kai-waehner.de/blog/2022/12/15/top-5-data-streaming-trends-for-2023-with-apache-kafka/ Thu, 15 Dec 2022 06:14:03 +0000 https://www.kai-waehner.de/?p=4847 Data Streaming is one of the most relevant buzzwords in tech to build scalable real-time applications in the cloud and innovative business models. Do you wonder about my predicted TOP 5 data streaming trends in 2023 to set data in motion? Check out the following presentation and learn what role Apache Kafka plays. Learn about decentralized Data Mesh, cloud-native lakehouse, data sharing, improved user experience, and advanced data governance.

The post Top 5 Data Streaming Trends for 2023 appeared first on Kai Waehner.

]]>
Data Streaming is one of the most relevant buzzwords in tech to build scalable real-time applications in the cloud and innovative business models. Do you wonder about my predicted TOP 5 data streaming trends in 2023 to set data in motion? Check out the following presentation and learn what role Apache Kafka plays. Learn about decentralized Data Mesh, cloud-native lakehouse, data sharing, improved user experience, and advanced data governance.

Some followers might notice that this became a series with past posts about the top 5 data streaming trends for 2021 and the top 5 for 2022. Data streaming with Apache Kafka is a journey and evolution to set data in motion. Trends change over time, but the core value of a scalable real-time infrastructure as the central data hub stays.

Top Use Cases and Architectures for Data Streaming with Apache Kafka in 2023

The research and consulting company Gartner defines the top strategic technology trends every year. This time, the trends are more focused on particular niche concepts. On a higher level, it is all about optimizing, scaling, and pioneering. Here is what Gartner expects for 2023:

Gartner Strategic Technology Trends for 2023
Source: Gartner

It is funny (but not surprising): Gartner’s predictions overlap and complement the five trends I focus on for data streaming with Apache Kafka looking forward to 2023. I explore how data streaming enables better time to market with decentralized optimized architectures, cloud-native infrastructure for elastic scale, and pioneering innovative use cases to build valuable data products.

Hence, here you go with the top 5 trends in data streaming for 2023.

The top 5 data streaming trends for 2023

I see the following topics coming up more regularly in conversations with customers, prospects, and the broader Kafka community across the globe:

  1. Cloud-native lakehouses
  2. Decentralized data mesh
  3. Data sharing in real-time
  4. Improved developer and user experience
  5. Advanced data governance and policy enforcement

The following sections describe each trend in more detail. The end of the article contains the complete slide deck. The trends are relevant for various scenarios. No matter if you use open source Apache Kafka, a commercial platform, or a fully-managed cloud service like Confluent Cloud.

Kafka as data fabric for cloud-native lakehouses

Many data platform vendors pitch the lakehouse vision today. That’s the same story as the data lake in the Hadoop era with few new nuances. Put all your data into a single data store to save the world and solve every problem and use case:

One data lake or lakehouse for all data

In the last ten years, most enterprises realized this strategy did not work. The data lake is great for reporting and batch analytics, but not the right choice for every problem. Besides technical challenges, new challenges emerged: data governance, compliance issues, data privacy, and so on.

Applying a best-of-breed enterprise architecture for real-time and batch data analytics using the right tool for each job is a much more successful, flexible, and future-ready approach:

Data Streaming with Apache Kafka as Data Fabric for Databases, Data Lake, and Lakehouse Architectures

Data platforms like Databricks, Snowflake, Elastic, MongoDB, BigQuery, etc., have their sweet spots and trade-offs.

Data streaming increasingly becomes the real-time data fabric between all the different data platforms and other business applications leveraging the real-time Kappa architecture instead of the much more batch-focused Lamba architecture.

Decentralized data mesh with valuable data products

Focusing on business value by building data products in independent domains with various technologies is key to success in today’s agile world with ever-changing requirements and challenges. Data mesh came to the rescue and emerged as a next-generation design pattern, succeeding service-oriented architectures and microservices.

Two main proposals exist by vendors for building a data mesh: Data integration with data streaming enables fully decentralized business products. On the other side, data virtualization provides centralized queries:

Data Mesh with Data Streaming using Apache Kafka vs. Data Virtualization

Centralized queries are simple but do not provide a clean architecture and decoupled domains and applications. It might work well to solve a single problem in a project. However, I highly recommend building a decentralized data mesh with data streaming to decouple the applications, especially for strategic enterprise architectures.

Collaboration within and across organizations in real-time

Collaborating within and outside the organization with data sharing using Open APIs, streaming data exchange, and cluster linking enable many innovative business models:

Stream Data Exchange and Sharing with Data Mesh in Motion

The difference between data streaming to a database, data warehouse, or data lake is crucial: All these platforms enable data sharing at rest. The data is stored on a disk before it is replicated and shared within the organization or with partners. This is not real-time. You cannot connect a real-time consumer to data at rest.

However, real-time data beats slow data. Hence, sharing data in real-time with data streaming platforms like Apache Kafka or Confluent Cloud enables accurate data as soon as a change happens. A consumer can be real-time, near real-time, or batch. A streaming data exchange puts data in motion within the organization or for B2B data sharing and Open API business models.

AsyncAPI spec for Apache Kafka API schemas

AsyncAPI allows developers to define the interfaces of asynchronous APIs. It is protocol agnostic. Features include:

  • Specification of OpenAPI contracts (= schemas in the data streaming world)
  • Documentation of APIs
  • Code generation for many programming languages
  • Data governance
  • And much more…

Confluent Cloud recently added a feature for generating an AsyncAPI specification for Apache Kafka clusters.

AsyncAPI and Apache Kafka with Confluent Cloud

We don’t know yet where the market is going. Will AsynchAPI become the standard for OpenAPI in data streaming? Maybe. I see increasing demand for this specification by customers. Let’s review the status of AsynchAPI in a few quarters or years. But it has the potential.

Improved developer experience with low-code / no-code tools for Apache Kafka

Many analysts and vendors pitch low code/no code tools. Visual coding is nothing new. Very sophisticated, powerful, and easy-to-use solutions exist as IDE or cloud applications. The significant benefit is time-to-market for developing applications and easier maintenance. At least in theory.

These tools support various personas like developers, citizen integrators, and data scientists. At least in theory.

The reality is that:

  • Code is king
  • Development is about evolution
  • Open platforms win

Low code/no code is great for some scenarios and personas. But it is just one option of many. Let’s look at a few alternatives for building Kafka-native applications:

Kafka API vs Streams vs KSQL vs Visual Coding with Stream Designer

These Kafka-native technologies have their trade-offs. For instance, the Confluent Stream Designer is perfect for building streaming ETL pipelines between various data sources and sinks. Just click the pipeline and transformations together. Then deploy the data pipeline into a scalable, reliable, and fully-managed streaming application. The difference to separate tools like Apache Nifi is that the generated code run in the same streaming platform, i.e., one infrastructure end-to-end. This makes ensuring SLAs and latency requirements much more manageable and the whole data pipeline more cost-efficient.

However, the simpler a tool is, the less flexible it is. It is that easy. No matter which product or vendor you look at. This is not just true for Kafka-native tools.

And you are flexible with your tool choice per project or business problem. Add your favorite non-Kafka stream processing engine to the stack, for instance, Apache Flink. Or use a separate iPaaS middleware like Dell Boomi or SnapLogic.

Domain-driven design with dumb pipes and smart endpoints

The real benefit of data streaming is the freedom of choice for your favorite Kafka-native technology, open-source stream processing framework, or cloud-native iPaaS middleware.

Choose the proper library, tool, or SaaS for your project. Data streaming enables a decoupled domain-driven design with dumb pipes and smart endpoints:

Decentralized Data Mesh powered by data streaming and Apache Kafka

Data streaming with Apache Kafka is perfect for domain-driven design (DDD). On the contrary, often used point-to-point microservice architecture HTTP/REST web service or push-based message brokers like RabbitMQ create much stronger dependencies between applications.

Data governance across the data streaming pipeline

An enterprise architecture powered by data streaming enables easy access to data in real-time. Many enterprises leverage Apache Kafka as the central nervous system between all data sources and sinks.

The consequence of being able to access all data easily across business domains is two conflicting pressures on organizations: Unlock the data to enable innovation versus Lock up the data to keep it safe.

Achieving data governance across the end-to-end data streams with data lineage, event tracing, policy enforcement, and time travel to analyze historical events is critical for strategic data streaming in the enterprise architecture. Data governance on top of the streaming platform is required for end-to-end visibility, compliance, and security:

Data governance for streaming data with lineage, catalog, quality, policy management

Policy enforcement with schemas and API contracts

The foundation for data governance is the management of API contracts (so-called schemas in data streaming platforms like Apache Kafka). Solutions like Confluent enforce schemas along the data pipeline, including data producer, server, and consumer:

Data Governance and Policy Enforcement in Apache Kafka with Schema and API Contracts

Additional data governance tools like data lineage, catalog, or police enforcement are built on this foundation. The recommendation for any serious data streaming project is to use schema from the beginning. It is unnecessary for the first pipeline. But the following producers and consumers need a trusted environment with enforced policies to establish a decentralized data mesh architecture with independent but connected data products.

Slides and Video for Data Streaming Use Cases in 2023

Here is the slide deck from my presentation:

Fullscreen Mode

And here is the free on-demand video recording:

Video Recording - Top Trends for Data Streaming in 2023

Data streaming goes up in the maturity curve in 2023

It is still an early stage for data streaming in most enterprises. But the discussion goes beyond questions like “when to use Kafka?” or “which cloud service to use?”… In 2023, most enterprises look at more sophisticated challenges around their numerous data streaming projects.

The new trends are often related to each other. A data mesh enables the building of independent data products that focus on business value. Data sharing is a fundamental requirement for a data mesh. New personas access the data stream. Often, citizen developers or data scientists need easy tools to pioneer new projects. The enterprise architecture requires and enforces data governance across the pipeline for security, compliance, and privacy reasons.

Scalability and elasticity need to be there out of the box. Fully-managed data streaming is a brilliant opportunity for getting started in 2023 and moving up in the maturity curve from single projects to a central nervous system of real-time data.

What are your most relevant and exciting trends for data streaming and Apache Kafka in 2023 to set data in motion? What are your strategy and timeline? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Top 5 Data Streaming Trends for 2023 appeared first on Kai Waehner.

]]>