Analytics Archives - Kai Waehner

Databricks and Confluent Leading Data and AI Architectures – What About Snowflake, BigQuery, and Friends?

Kai Waehner — Thu, 15 May 2025 09:57:25 +0000

The modern data landscape is shaped by platforms that excel in different but increasingly overlapping domains. Confluent leads in data streaming with enterprise-grade infrastructure for real-time data movement and processing. Databricks and Snowflake dominate the lakehouse and analytics space—each with unique strengths. Databricks is known for scalable AI and machine learning pipelines, while Snowflake stands out for its simplicity, governed data sharing, and performance in cloud-native analytics.

This final blog in the series brings together everything covered so far and highlights how these technologies power real customer innovation. At Erste Bank, Confluent and Databricks are combined to build an event-driven architecture for Generative AI use cases in customer service. At Siemens, Confluent and Snowflake support a shift-left architecture to drive real-time manufacturing insights and medical AI—using streaming data not just for analytics, but also to trigger operational workflows across systems.

Together, these examples show why so many enterprises adopt a multi-platform strategy—with Confluent as the event-driven backbone, and Databricks or Snowflake (or both) as the downstream platforms for analytics, governance, and AI.

About the Confluent and Databricks Blog Series

This article is part of a blog series exploring the growing roles of Confluent and Databricks in modern data and AI architectures:

Blog 1: The Past, Present and Future of Confluent (The Kafka Company) and Databricks (The Spark Company)
Blog 2: Confluent Data Streaming Platform vs. Databricks Data Intelligence Platform for Data Integration and Processing
Blog 3: Shift-Left Architecture for AI and Data Warehousing with Confluent and Databricks
Blog 4: Databricks and Confluent in Enterprise Software Environments (with SAP as Example)
Blog 5 (THIS ARTICLE): Databricks and Confluent Leading Data and AI Architectures – and How They Compare to Competitors

Learn how these platforms will affect data use in businesses in future articles. Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including technical architectures and the relation to analytical platforms like Databricks and Snowflake.

The Broader Data Streaming and Lakehouse Landscape

The data streaming and lakehouse space continues to expand, with a variety of platforms offering different capabilities for real-time processing, analytics, and storage.

Data Streaming Market

On the data streaming side, Confluent is the leader. Other cloud-native services like Amazon MSK, Azure Event Hubs, and Google Cloud Managed Kafka provide Kafka-compatible offerings, though they vary in protocol support, ecosystem maturity, and operational simplicity. StreamNative, based on Apache Pulsar, competes with the Kafka offerings, while Decodable and DeltaStream leverage Apache Flink for real-time stream processing using a complementary approach. Startups such as AutoMQ and BufStream pitch reimagining Kafka infrastructure for improved scalability and cost-efficiency in cloud-native architectures.

The data streaming landscape is growing year by year. Here is the latest overview of the data streaming market:

Lakehouse Market

In the lakehouse and analytics platform category, Databricks leads with its cloud-native model combining compute and storage, enabling modern lakehouse architectures. Snowflake is another leading cloud data platform, praised for its ease of use, strong ecosystem, and ability to unify diverse analytical workloads. Microsoft Fabric aims to unify data engineering, real-time analytics, and AI on Azure under one platform. Google BigQuery offers a serverless, scalable solution for large-scale analytics, while platforms like Amazon Redshift, ClickHouse, and Athena serve both traditional and high-performance OLAP use cases.

The Forrester Wave for Lakehouses analyzes and explores the vendor options, showing Databricks, Snowflake and Google as the leaders. Unfortunately, it is not allowed to post the Forrester Wave, so you need to download it from a vendor.

Confluent + Databricks

This blog series highlights Databricks and Confluent because they represent a powerful combination at the intersection of data streaming and the lakehouse paradigm. Together, they enable real-time, AI-driven architectures that unify operational and analytical workloads across modern enterprise environments.

Each platform in the data streaming and Lakehouse space has distinct advantages, but none offer the same combination of real-time capabilities, open architecture, and end-to-end integration as Confluent and Databricks.

It’s also worth noting that open source remains a big – if not the biggest – competitor to all of these vendors. Many enterprises still rely on open-source data lakes built on Elastic, legacy Hadoop, or open table formats such as Apache Hudi—favoring flexibility and cost control over fully managed services.

Confluent: The Leading Data Streaming Platform (DSP)

Confluent is the enterprise-standard platform for data streaming, built on Apache Kafka and extended for cloud-native, real-time operations at global scale. The data streaming platform (DSP) delivers a complete and unified platform with multiple deployment options to meet diverse needs and budgets:

Confluent Cloud – Fully managed, serverless Kafka and Flink service across AWS, Azure, and Google Cloud
Confluent Platform – Self-managed software for on-premises, private cloud, or hybrid environments
WarpStream – Kafka-compatible, cloud-native infrastructure optimized for BYOC (Bring Your Own Cloud) using low-cost object storage like S3

Together, these options offer cost efficiency and flexibility across a wide range of streaming workloads:

Small-volume, mission-critical use cases such as payments or fraud detection, where zero data loss, strict SLAs, and low latency are non-negotiable
High-volume, analytics-driven use cases like clickstream processing for real-time personalization and recommendation engines, where throughput and scalability are key

Confluent supports these use cases with:

Cluster Linking for real-time, multi-region and hybrid cloud data movement
100+ prebuilt connectors for seamless integration with enterprise systems and cloud services
Apache Flink for rich stream processing at scale
Governance and observability with Schema Registry, Stream Catalog, role-based access control, and SLAs
Tableflow for native integration with Delta Lake, Apache Iceberg, and modern lakehouse architectures

While other providers offer fragments—such as Amazon MSK for basic Kafka infrastructure or Azure Event Hubs for ingestion—only Confluent delivers a unified, cloud-native data streaming platform with consistent operations, tooling, and security across environments.

Confluent is trusted by over 6,000 enterprises and backed by deep experience in large-scale streaming deployments, hybrid architectures, and Kafka migrations. It combines industry-leading technology with enterprise-grade support, expertise, and consulting services to help organizations turn real-time data into real business outcomes—securely, efficiently, and at any scale.

Databricks: The Leading Lakehouse for AI and Analytics

Databricks is the leading platform for unified analytics, data engineering, and AI—purpose-built to help enterprises turn massive volumes of data into intelligent, real-time decision-making. Positioned as the Data Intelligence Platform, Databricks combines a powerful lakehouse foundation with full-spectrum AI capabilities, making it the platform of choice for modern data teams.

Its core strengths include:

Delta Lake + Unity Catalog – A robust foundation for structured, governed, and versioned data at scale
Apache Spark – Distributed compute engine for ETL, data preparation, and batch/stream processing
MosaicML – End-to-end tooling for efficient model training, fine-tuning, and deployment of custom AI models
AI/ML tools for data scientists, ML engineers, and analysts—integrated across the platform
Native connectors to BI tools (like Power BI, Tableau) and MLOps platforms for model lifecycle management

Databricks directly competes with Snowflake, especially in the enterprise AI and analytics space. While Snowflake shines with simplicity and governed warehousing, Databricks differentiates by offering a more flexible and performant platform for large-scale model training and advanced AI pipelines.

The platform supports:

Batch and (sort of) streaming analytics
ML model training and inference on shared data
GenAI use cases, including RAG (Retrieval-Augmented Generation) with unstructured and structured sources
Data sharing and collaboration across teams and organizations with open formats and native interoperability

Databricks is trusted by thousands of organizations for AI workloads, offering not only powerful infrastructure but also integrated governance, observability, and scalability—whether deployed on a single cloud or across multi-cloud environments.

Combined with Confluent’s real-time data streaming capabilities, Databricks completes the AI-driven enterprise architecture by enabling organizations to analyze, model, and act on high-quality, real-time data at scale.

Stronger Together: A Strategic Alliance for Data and AI with Tableflow and Delta Lake

Confluent and Databricks are not trying to replace each other. Their partnership is strategic and product-driven.

Recent innovation: Tableflow + Delta Lake – this feature enables bi-directional data exchange between Kafka and Delta Lake.

Direction 1: Confluent streams → Tableflow → Delta Lake (via Unity Catalog)
Direction 2: Databricks insights → Tableflow → Kafka → Flink or other operational systems

This simplifies architecture, reduces cost and latency, and removes the need for Spark jobs to manage streaming data.

Source: Confluent

Confluent becomes the operational data backbone for AI and analytics. Databricks becomes the analytics and AI engine fed with data from Confluent.

Where needed, operational or analytical real-time AI predictions can be done within Confluent’s data streaming platform: with embedded or remote model inference, native integration for search with vector databases, and built-in models for common predictive use cases such as forecasting.

Erste Bank: Building a Foundation for GenAI with Confluent and Databricks

Erste Group Bank AG, one of the largest financial services providers in Central and Eastern Europe, is leveraging Confluent and Databricks to transform its customer service operations with Generative AI. Recognizing that successful GenAI initiatives require more than just advanced models, Erste Bank first focused on establishing a solid foundation of real-time, consistent, and high-quality data leveraging data streaming and an event-driven architecture.

Using Confluent, Erste Bank connects real-time streams, batch workloads, and request-response APIs across its legacy and cloud-native systems in a decoupled way but ensuring data consistency through Kafka. This architecture ensures that operational and analytical data — whether from core banking platforms, digital channels, mobile apps, or CRM systems — flows reliably and consistently across the enterprise. By integrating event streams, historical data, and API calls into a unified data pipeline, Confluent enables Erste Bank to create a live, trusted digital representation of customer interactions.

With this real-time foundation in place, Erste Bank leverages Databricks as its AI and analytics platform to build and scale GenAI applications. At the Data in Motion Tour 2024 in Frankfurt, Erste Bank presented a pilot project where customer service chatbots consume contextual data flowing through Confluent into Databricks, enabling intelligent, personalized responses. Once a customer request is processed, the chatbot triggers a transaction back through Kafka into the Salesforce CRM, ensuring seamless, automated follow-up actions.

Source: Erste Group Bank AG

By combining Confluent’s real-time data streaming capabilities with Databricks’ powerful AI infrastructure, Erste Bank is able to:

Deliver highly contextual, real-time customer service interactions
Automate CRM workflows through real-time event-driven architectures
Build a scalable, resilient platform for future AI-driven applications

This architecture positions Erste Bank to continue expanding GenAI use cases across financial services, from customer engagement to operational efficiency, powered by consistent, trusted, and real-time data.

Confluent: The Neutral Streaming Backbone for Any Data Stack

Confluent is not tied to a monolithic compute engine within a cloud provider. This neutrality is a strength:

Bridges operational systems (mainframes, SAP) with modern data platforms (AI, lakehouses, etc.)
An event-driven architecture built with a data streaming platform feeds multiple lakehouses at once
Works across all major cloud providers, including AWS, Azure, and GCP
Operates at the edge, on-prem, in the cloud and in hybrid scenarios
One size doesn’t fit all – follow best practices from microservices architectures and data mesh to tailor your architecture with purpose-built solutions.

The flexibility makes Confluent the best platform for data distribution—enabling decoupled teams to use the tools and platforms best suited to their needs.

Confluent’s Tableflow also supports Apache Iceberg to enable seamless integration from Kafka into lakehouses beyond Delta Lake and Databricks—such as Snowflake, BigQuery, Amazon Athena, and many other data platforms and analytics engines.

Example: A global enterprise uses Confluent as its central nervous system for data streaming. Customer interaction events flow in real time from web and mobile apps into Confluent. These events are then:

Streamed into Databricks once for multiple GenAI and analytics use cases.
Written to an operational PostgreSQL database to update order status and customer profiles
Pushed into an customer-facing analytics engine like StarTree (powered by Apache Pinot) for live dashboards and real-time customer behavior analytics
Shared with Snowflake through a lift-and-shift M&A use case to unify analytics from an acquired company

This setup shows the power of Confluent’s neutrality and flexibility: enabling real-time, multi-directional data sharing across heterogeneous platforms, without coupling compute and storage.

Snowflake: A Cloud-Native Companion to Confluent – Natively Integrated with Apache Iceberg and Polaris Catalog

Snowflake pairs naturally with Confluent to power modern data architectures. As a cloud-native SaaS from the start, Snowflake has earned broad adoption across industries thanks to its scalability, simplicity, and fully managed experience.

Together, Confluent and Snowflake unlock high-impact use cases:

Near real-time ingestion and enrichment: Stream operational data into Snowflake for immediate analytics and action.
Unified operational and analytical workloads: Combine Confluent’s Tableflow with Snowflake’s Apache Iceberg support through its open source Polaris catalog to bridge operational and analytical data layers.
Shift-left data quality: Improve reliability and reduce costs by validating and shaping data upstream, before it hits storage.

With Confluent as the streaming backbone and Snowflake as the analytics engine, enterprises get a cloud-native stack that’s fast, flexible, and built to scale. Many enterprises use Confluent as data ingestion platform for Databricks, Snowflake, and other analytical and operational downstream applications.

Shift Left at Siemens: Real-Time Innovation with Confluent and Snowflake

Siemens is a global technology leader operating across industry, infrastructure, mobility, and healthcare. Its portfolio includes industrial automation, digital twins, smart building systems, and advanced medical technologies—delivered through units like Siemens Digital Industries and Siemens Healthineers.

To accelerate innovation and reduce operational costs, Siemens is embracing a shift-left architecture to enrich data early in the pipeline before it reaches Snowflake. This enables reusable, real-time data products in the data streaming platform leveraging an event-driven architecture for data sharing with analytical and operational systems beyond Snowflake.

Siemens Digital Industries applies this model to optimize manufacturing and intralogistics, using streaming ETL to feed real-time dashboards and trigger actions like automated inventory replenishment—while continuing to use Snowflake for historical analysis, reporting, and long-term data modeling.

Source: Siemens Digital Industries

Siemens Healthineers embeds AI directly in the stream processor to detect anomalies in medical equipment telemetry, improving response time and avoiding costly equipment failures—while leveraging Snowflake to power centralized analytics, compliance reporting, and cross-device trend analysis.

Source: Siemens Healthineers

These success stories are part of The Data Streaming Use Case Show, my new industry webinar series. Learn more about Siemens’ usage of Confluent and Snowflake and watch the video recording about “shift left”.

Open Outlook: Agentic AI with Model-Context Protocol (MCP) and Agent2Agent Protocol (A2A)

While data and AI platforms like Databricks and Snowflake play a key role, some Agentic AI projects will likely rely on emerging, independent SaaS platforms and specialized tools. Flexibility and open standards are key for future success.

What better way to close a blog series on Confluent and Databricks (and Snowflake) than by looking ahead to one of the most exciting frontiers in enterprise AI: Agentic AI.

As enterprise AI matures, there is growing interest in bi-directional interfaces between operational systems and AI agents. Google’s A2A (Agent-to-Agent) architecture reinforces this shift—highlighting how intelligent agents can autonomously communicate, coordinate, and act across distributed systems.

Confluent + Databricks is an ideal combination to support these emerging Agentic AI patterns, where event-driven agents continuously learn from and act upon streaming data. Models can be embedded directly in Flink for low-latency applications or hosted and orchestrated in Databricks for more complex inference workflows.

The Model-Context-Protocol (MCP) is gaining traction as a design blueprint for standardized interaction between services, models, and event streams. In this context, Confluent and Databricks are well positioned to lead:

Confluent: Event-driven delivery of context, inputs, and actions
Databricks: Model hosting, training, inference, and orchestration
Jointly: Closed feedback loops between business systems and AI agents

Together with protocols like A2A and MCP, this architecture will shape the next generation of intelligent, real-time enterprise applications.

Confluent + Databricks: The Future-Proof Data Stack for AI and Analytics

Databricks and Confluent are not just partners. They are leaders in their respective domains. Together, they enable real-time, intelligent data architectures that support operational excellence and AI innovation.

Other AI and data platforms are part of the landscape, and many bring valuable capabilities. As explored in this blog series, the true decoupling using an event-driven architecture with Apache Kafka allows using any kind of combination of vendors and cloud services. I see many enterprises using Databricks and Snowflake integrated to Confluent. However, the alignment between Confluent and Databricks stands out due to its combination of strengths:

Confluent’s category leadership in data streaming, powering thousands of production deployments across industries
Databricks’ strong position in the lakehouse and AI space, with broad enterprise adoption for analytics and machine learning
Shared product vision and growing engineering and go-to-market alignment across technical and field organizations

For enterprises shaping a long-term data and AI strategy, this combination offers a proven, scalable foundation—bridging real-time operations with analytical depth, without forcing trade-offs between speed, flexibility, or future-readiness.

Stay tuned for deep dives into how these platforms are shaping the future of data-driven enterprises. Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including technical architectures and the relation to analytical platforms like Databricks and Snowflake.

The post Databricks and Confluent Leading Data and AI Architectures – What About Snowflake, BigQuery, and Friends? appeared first on Kai Waehner.

The Role of Data Streaming in McAfee’s Cybersecurity Evolution

Kai Waehner — Mon, 27 Jan 2025 07:33:30 +0000

In today’s digital age, cybersecurity is more vital than ever. Businesses and individuals face escalating threats such as malware, ransomware, phishing attacks, and identity theft. Combatting these challenges requires cutting-edge solutions that protect computers, networks, and devices. Beyond safeguarding digital assets, modern cybersecurity tools ensure compliance, privacy, and trust in an increasingly interconnected world.

As threats grow more sophisticated, the technologies powering cybersecurity solutions must advance to stay ahead. Data streaming technologies like Apache Kafka and Apache Flink have become foundational in this evolution, enabling real-time threat detection, prevention, and rapid response. These tools transform cybersecurity from static defenses to dynamic systems capable of identifying and neutralizing threats as they occur.

A notable example is McAfee, a global leader in cybersecurity, which has embraced data streaming to revolutionize its operations. By transitioning to an event-driven architecture powered by Apache Kafka, McAfee processes massive amounts of real-time data from millions of endpoints, ensuring instant threat identification and mitigation. This integration has enhanced scalability, reduced infrastructure complexity, and accelerated innovation, setting a benchmark for the cybersecurity industry.

Real-time data streaming is not just an advantage—it’s now a necessity for organizations aiming to safeguard digital environments against ever-evolving threats.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch.

Antivirus is NOT Enough: Supply Chain Attack

A supply chain attack occurs when attackers exploit vulnerabilities in an organization’s supply chain, targeting weaker links such as vendors or service providers to indirectly infiltrate the target.

For example, an attacker compromises Vendor 1, a software provider, by injecting malicious code into their product. Vendor 2, a service provider using Vendor 1’s software, becomes infected. The attacker then leverages Vendor 2’s connection to the Enterprise to access sensitive systems, even though Vendor 1 has no direct interaction with the enterprise.

Traditional antivirus software is insufficient to prevent such complex, multi-layered attacks. Ransomware often plays a role in supply chain attacks, as attackers use it to encrypt data or disrupt operations across compromised systems.

Modern solutions focus on real-time monitoring and event-driven architecture to detect and mitigate risks across the supply chain. These solutions utilize behavioral analytics, zero trust policies, and proactive threat intelligence to identify and stop anomalies before they escalate.

By providing end-to-end visibility, they protect organizations from cascading vulnerabilities that traditional endpoint security cannot address. In today’s interconnected world, comprehensive supply chain security is critical to safeguarding enterprises.

The Role of Data Streaming in Cybersecurity

Cybersecurity platforms must rely on real-time data for detecting and mitigating threats. Data streaming provides a backbone for processing massive amounts of security event data as it happens, ensuring swift and effective responses. My blog series on Kafka and cybersecurity looks deeply into these use cases.

To summarize:

Data Collection: A data streaming platforms powered by Apache Kafka collect logs, telemetry, and other data from devices and applications in real time.
Data Processing: Stream processing frameworks like Kafka Streams and Apache Flink continuously process this data with low latency at scale for analytics, identifying anomalies or potential threats.
Actionable Insights: The processed data feeds into Security Information and Event Management (SIEM) and Security Orchestration, Automation, and Response (SOAR) systems, enabling automated responses and better decision-making.

This approach transforms static, batch-driven cybersecurity operations into dynamic, real-time processes.

McAfee: A Real-World Data Streaming Success Story

McAfee is a global leader in cybersecurity, providing software solutions that protect computers, networks, and devices. Founded in 1987, the company has evolved from traditional antivirus software to a comprehensive suite of products focused on threat prevention, identity protection, and data security.

Source: McAfee

McAfee’s products cater to both individual consumers and enterprises, offering real-time protection through partnerships with global integrated service providers (ISPs) and telecom operators.

Mahesh Tyagarajan (VP, Platform Engineering and Architecture at McAfee) spoke with Confluent and Forrester about McAfee’s transition from a monolith to event-driven Microservices leveraging Apache Kafka in Confluent Cloud.

Data Streaming at McAfee with Apache Kafka Leveraging Confluent Cloud

As cyber threats have grown more complex, McAfee’s reliance on real-time data streaming has become essential. The company transitioned from a monolithic architecture to a microservices-based ecosystem with the help of Confluent Cloud, powered by Apache Kafka. The fully managed data streaming platform simplified infrastructure management, boosted scalability, and accelerated feature delivery for McAfee

Use Cases for Data Streaming

Real-Time Threat Detection: McAfee processes security events from millions of endpoints, ensuring immediate identification of malware or phishing attempts.
Subscription Management: Data streaming supports real-time customer notifications, updates, and billing processes.
Analytics and Reporting: McAfee integrates real-time data streams into analytics systems, providing insights into user behavior, threat patterns, and operational efficiency.

Transition to an Event-Driven Architecture and Microservices

By moving to an event-driven architecture with Kafka using Confluent Cloud, McAfee:

Standardized its data streaming infrastructure.
Decoupled systems using microservices, enabling scalability and resilience.
Improved developer productivity by reducing infrastructure management overhead.

This transition to data streaming with a fully managed, complete and secure cloud service empowered McAfee to handle high data ingestion volumes, manage hundreds of millions of devices, and deliver new features faster.

Business Value of Data Streaming

The adoption of data streaming delivered significant business benefits:

Improved Customer Experience: Real-time threat detection and personalized updates enhance trust and satisfaction.
Operational Efficiency: Automation and reduced infrastructure complexity save time and resources.
Scalability: McAfee can now support a growing number of devices and data sources without compromising performance.

Data Streaming as the Backbone of an Event-Driven Cybersecurity Evolution in the Cloud

McAfee’s journey showcases the transformative potential of data streaming in cybersecurity. By leveraging Apache Kafka as fully managed cloud service as the backbone of an event-driven microservices architecture, the company has enhanced its ability to detect threats, respond in real time, and deliver exceptional customer experiences.

For organizations looking to stay ahead in the cybersecurity race, investing in real-time data streaming technologies is not just an option—it’s a necessity. To learn more about how data streaming can revolutionize cybersecurity, explore my cybersecurity blog series and follow me for updates on LinkedIn or X (formerly Twitter).

The post The Role of Data Streaming in McAfee’s Cybersecurity Evolution appeared first on Kai Waehner.

How Microsoft Fabric Lakehouse Complements Data Streaming (Apache Kafka, Flink, et al.)

Kai Waehner — Sat, 12 Oct 2024 06:58:00 +0000

In today’s data-driven world, understanding data at rest versus data in motion is crucial for businesses. Data streaming frameworks like Apache Kafka and Apache Flink enable real-time data processing, offering quick insights and seamless system integration. They are ideal for applications that require immediate responses and handle transactional workloads. Meanwhile, lakehouses like Snowflake, Databricks, and Microsoft Fabric excel in long-term data storage and detailed analysis, perfect for reports and AI training. By leveraging both data streaming and lakehouse systems, businesses can effectively meet both short-term and long-term data needs. This blog post explores how these technologies complement each other in enterprise architecture.

This is part two of a blog series about Microsoft Fabric and its relation to other data platforms on the Azure cloud:

What is Microsoft Fabric for Azure Cloud (Beyond the Buzz) and how it Competes with Snowflake and Databricks
How Microsoft Fabric Lakehouse Complements Data Streaming (Apache Kafka, Flink, et al.)
When to Choose Apache Kafka vs. Azure Event Hubs vs. Confluent Cloud for a Microsoft Fabric Lakehouse

Subscribe to my newsletter to get an email about a new blog post every few weeks.

Data at Rest (Lakehouse) vs. Data in Motion (Data Streaming)

Data streaming technologies like Apache Kafka and Apache Flink enable continuous data processing while the data is in motion in an event-driven architecture. Data streaming enables immediate insights and seamless integration of data across systems. Kafka provides a robust real-time messaging and persistence platform, while Flink excels in low-latency stream processing, making them ideal for dynamic, stateful applications. A data streaming platform supports operational/transactional and analytical use cases.

Data lakes and data warehouses store data at rest before processing the data. The platforms are optimized for batch processing and long-term analytics, including AI/ML use cases such as model training. Some components provide near real-time capabilities, e.g. data ingestion or dashboards. Data lakes offer scalable, flexible storage for raw data, and data warehouses provide structured, high-performance environments for business intelligence and reporting, complementing the real-time capabilities of streaming technologies. Most leading data platforms provide a unified combination of data lake and data warehouse called lakehouse. Lakehouses are almost exclusively used for analytical workloads as they typically lack the SLAs and tight latency required for operational/transactional use cases.

Data streaming and lakehouses are complementary, with some overlaps but different sweet spots. If you want to learn more, check out these articles:

I also created a short ten minute video explaining the above concepts:

Let’s explore why data streaming and a lakehouse like Microsoft Fabric are complementary (with a few overlaps). I explained in the first blog of this series what Microsoft Fabric is. To understand the differences, it is important to understand what a data streaming platform really is.

Data Streaming in an Event-driven Architecture with Apache Kafka and Flink

There is a lot of confusion in the market. For instance, some folks still compare Apache Kafka to a message broker like RabbitMQ or IBM MQ. I mainly focus on Apache Kafka and Apache Flink as these are the de facto standards for data streaming across industries. Before talking about technologies and solutions, we need to start with the concept of an event-driven architecture as the foundation of data streaming.

Event-driven Architecture for Operational and Analytical Workloads

In today’s digital world, getting real-time data quickly is more important than ever. Traditional methods that process data in batches or via request-response APIs often cannot keep up when you need immediate insights.

Event-driven architecture offers a different approach by focusing on handling events – like transactions or user actions – as they happen. One of the key benefits of an event-driven architecture is its ability to decouple systems, meaning that different parts of a system can work independently. This makes it easier to scale and adapt to changes. An event-driven architecture excels in handling both operational and analytical workloads.

Source: Confluent

For operational tasks, the event-driven architecture enables real-time data processing, automating processes, enhancing customer experiences, and boosting efficiency. In e-commerce, for example, an event-driven system can instantly update inventory, trigger marketing campaigns, and detect fraud.

On the analytical side, the event-driven architecture allows organizations to derive insights from data as it flows, enabling real-time analytics and trend identification without the delays of batch processing. This is invaluable in sectors like finance and healthcare, where timely insights are crucial.

Building an event-driven architecture with data streaming technologies like Apache Kafka and Apache Flink enhances its potential. These platforms provide the infrastructure for high-throughput, low-latency data streams, enabling scalable and resilient event-driven systems.

Apache Kafka – The De Facto Standard for Event-driven Messaging and Integration

Apache Kafka has become the go-to platform for event-driven messaging and integration, transforming how organizations manage data in motion. Developed by LinkedIn and open-sourced, Kafka is a distributed streaming platform adept at handling real-time data feeds. Over 150,000 organizations use Kafka in the meantime.

Kafka’s architecture is based on a distributed commit log, ensuring data durability and consistency. It decouples data producers and consumers, allowing for flexible and scalable data architectures. Producers publish data to topics, and consumers subscribe independently, facilitating system evolution.

Beyond messaging, Kafka serves as a robust integration platform, connecting diverse systems and enabling seamless data flow. Its ecosystem of connectors allows integration with databases, cloud services, and legacy systems. This helps organizations in modernizing their data infrastructure step-by-step.

Kafka’s stream processing capabilities, through Kafka Streams and integration with Apache Flink, further enhance the value of the streaming data pipelines. Kafka Streams allows real-time data processing within Kafka to enable complex transformations and enrichments, driving data-driven innovation.

Apache Flink – The De Facto Standard for Stream Processing

Apache Flink stands out as the leading framework for stream processing. It offers a versatile platform for streaming ETL and stateful business applications. Flink processes data streams with low latency and high throughput, suitable for diverse use cases.

Source: Apache

Flink provides a unified programming model for batch and stream processing that allows developers to use the same API for real-time transactional or analytical batch tasks. This flexibility is a significant advantage as it enables varied data processing without separate tools.

A key feature of Flink is its stateful stream processing. This is crucial for maintaining state across events in real-time applications. Flink’s state management ensures accurate processing in complex scenarios. In contrast to many other stream processing solutions, Flink can do stateful processing even at an extreme scale (i.e., with a throughput of gigabytes per second).

Flink’s event time processing capabilities handle out-of-order or delayed events and ensure consistent results. Developers can define windows and triggers based on event timestamps, accommodating late-arriving data.

Apache Flink supports multiple programming languages, including Java, Python, and SQL, offering developers the flexibility to use their preferred language for building stream processing applications. This is a key differentiator to other stream processing engines, such as Kafka Streams or KSQL.

Kafka and Flink are a “Match Made in Heaven” for Data Streaming

The integration of Flink with Apache Kafka enhances its capabilities. Kafka serves as a reliable data source for Flink to enable seamless real-time data ingestion and processing. With Kafka’s persistent commit log, you can travel back in time and replay historical data in guaranteed ordering for analytical use cases. This combination supports high-volume, low-latency data pipelines, unlocking transactional real-time scenarios and batch analytics.

In summary, Apache Flink’s robust stream processing, combined with Apache Kafka, offers a powerful solution for organizations seeking to leverage real-time data. Whether for operational tasks, real-time analytics, or complex event processing (CEP), Flink provides the necessary flexibility and performance for data-driven innovation.

Microsoft Fabric and Data Streaming (Kafka, Flink): Complementary Forces, NOT Competitors

In the growing landscape of data management, it’s crucial to understand the complementary roles of Microsoft Fabric and data streaming technologies. While some may perceive these technologies as competitors, they actually serve distinct yet interconnected purposes that enhance an organization’s data strategy. And keep in mind that Microsoft Fabric is not just an offering for Azure cloud. Hybrid edge scenarios in the IoT space are perfect for Microsoft Fabric and data streaming together.

Microsoft Fabric’s Streaming Ingestion: A Common Feature Among Lakehouses

Microsoft Fabric, like other modern lakehouse platforms such as Snowflake and Databricks, offers streaming ingestion capabilities. This feature is essential for handling near real-time data flows. It allows organizations to capture and process data as it arrives in the lakehouse. However, it’s important to distinguish between operational and analytical workloads when considering the role of streaming ingestion.

Operational workloads benefit from the immediacy of streaming data, enabling real-time decision-making and process automation. In contrast, analytical workloads often require data to be stored at rest for in-depth analysis and reporting. Microsoft Fabric’s architecture focuses on streaming ingestion into robust storage solutions for analytical purposes.

The Fabric Real Time Intelligence Hub: Understanding Its Capabilities and Limitations

The integration of streaming ingestion into Microsoft Fabric is part of its Real Time Intelligence Hub, which aims to provide a comprehensive platform for managing real-time data. However, beyond the marketing and buzz around Fabric Real Time Intelligence Hub, it’s important to note that it doesn’t operate in true real-time.

Instead, Fabric’s “Real Time Intelligence Hub” uses Spark Streaming jobs to manipulate data, which can introduce some latency. And the infrastructure is not meant for critical SLAs that are required by operational / transactional systems. Additionally, the ingestion process is throttled when using Power BI and other batch analytics tools via an API gateway with a Kafka client.

Microsoft is strong in introducing new names for products or feature for Fabric that are actually just a new brand of existing services. If you find new terms such as “eventhouse” or “event streams feature in the Microsoft Fabric Real-Time Intelligence”, make sure to evaluate if this is really a new component or just some Fabric marketing.

Therefore, despite some overlapping with a data streaming platform, the collaboration between Microsoft Fabric and data streaming vendors like Confluent (Kafka, Flink) underscores the complementary nature of these platforms. By leveraging the strengths of both, organizations can build a robust data infrastructure that supports real-time operations and comprehensive analytics.

In conclusion, Microsoft Fabric and data streaming technologies such as Kafka and Flink are not competitors but complementary tools that, when used together, can significantly enhance an organization’s ability to manage and analyze data. By understanding the distinct roles each plays, businesses can create a more agile and responsive data strategy that meets both operational and analytical needs.

Enterprise Architecture with Data Streaming like Kafka / Flink and Lakehouse(s) like Microsoft Fabric

In the modern enterprise architecture landscape, data streaming and lakehouse platforms are pivotal in creating a robust and flexible data ecosystem. Data streaming technologies enable continuous data ingestion and processing for operational and analytical use cases.

Lakehouse platforms, like Microsoft Fabric, Snowflake and Databricks, provide a unified architecture that combines the best of data lakes and data warehouses, offering scalable storage and advanced analytics capabilities.

Together, these technologies empower businesses to handle both operational and analytical workloads efficiently, breaking down data silos and fostering a data-driven culture. By integrating data streaming with lakehouse architectures, enterprises can achieve seamless data flow and comprehensive insights across their operations.

Reverse ETL is an Anti-Pattern

Reverse ETL is the process of moving data from a data store at rest back into operational systems. It is often considered an anti-pattern in modern data architecture. This approach can lead to data inconsistencies, increased complexity, and higher maintenance costs, as it essentially reverses the natural flow of data in motion. Do NOT store data in MIcrosoft Fabric Lakehouse just to reverse it later into other operational systems!

Instead of relying on reverse ETL, organizations should focus on building real-time data pipelines that enable direct integration between data sources and operational systems. By leveraging an event-driven architecture and data streaming technologies, businesses can ensure that data is consistently updated and available where it’s needed most. This approach not only simplifies data management, but also enhances the accuracy and timeliness of insights.

Apache Iceberg as the De Facto Standard for an Open Table Format – Store Once, Analyze with any Tool

Apache Iceberg has emerged as the de facto standard for an open table format. It offers a opportunity for storing data once in an object store like Amazon S3 and analyzing data across various tools. With its ability to handle large-scale datasets and support ACID transactions, Iceberg provides a reliable and efficient way to manage data in a lakehouse environment.

Organizations can use their preferred analytics and processing tools without being locked into a specific vendor. This flexibility is crucial for businesses looking to maximize their data investments and adapt to changing technological landscapes. By adopting Apache Iceberg together with data streaming, enterprises can ensure data consistency and accessibility across all business units to drive better data-quality, insights and decision-making.

Shift Left Architecture to Support Operational and Analytical Workloads with an Event-driven Architecture

Traditionally, many organizations use data streaming with Kafka as a dumb pipeline to ingest all raw data into a data lake. The consequences are high compute cost for multiple (re-)processing of the raw data, inconsistencies across business units, and slow time to market for new applications.

The Shift Left Architecture is a forward-thinking approach that integrates operational and analytical workloads within an event-driven architecture. By shifting data processing closer to the source, this architecture enables real-time data ingestion and analysis, improving data quality for lakehouse ingestion, reducing latency and improving responsiveness.

Event-driven architectures, powered by technologies like Apache Kafka and Flink, facilitate the seamless flow and processing of data across systems. Shift Left ensures that both operational and analytical needs are met. This approach not only enhances the agility of data-driven applications, but also supports continuous improvement and innovation. By adopting a Shift Left Architecture, organizations can streamline their data processes, improve efficiency, and gain a competitive edge in the market.

Example: Confluent (Data Streaming) + Microsoft Fabric (Lakehouse) + Snowflake (Another Lakehouse)

An example of integrating data streaming and lakehouse technologies is the combination of Confluent, Microsoft Fabric, and Snowflake.

Confluent, built on Apache Kafka and Flink, provides a robust platform for real-time data streaming, enabling organizations to integrate with operational and analytical workloads.

Microsoft Fabric and Snowflake, both lakehouse platforms, offer scalable storage and advanced analytics capabilities to allow businesses performing in-depth analysis and reporting on historical data, near real-time analytics and AI model training.

Apache Iceberg enables storing data once and connects any analytical engine to the data, including lakehouses such as Microsoft Fabric or Snowflake, and unified batch and streaming frameworks such as Apache Flink. Iceberg improves the overall data quality for data sharing, reduces storage cost and enables a much faster rollout of new analytical applications.

By leveraging Confluent for data streaming and integrating it with Microsoft Fabric and Snowflake, organizations can create a comprehensive data architecture that supports both real-time operations and long-term analytics. This synergy not only enhances data accessibility and consistency but also empowers businesses to make data-driven decisions with confidence.

Microsoft Fabric Lakehouse + Data Streaming (Kafka, Flink) = Match Made in Heaven

In conclusion, the synergy between Microsoft Fabric and data streaming technologies like Apache Kafka and Apache Flink creates a powerful combination for modern data management. While Microsoft Fabric excels in providing robust analytics and storage capabilities, data streaming platforms offer real-time data processing and integration, ensuring that businesses can respond swiftly to operational demands.

By leveraging both technologies together, organizations can build a comprehensive data architecture that supports both immediate and long-term needs, enhancing their ability to make informed, data-driven decisions. This complementary relationship not only breaks down data silos, but also fosters a more agile and responsive data strategy. As businesses continue to navigate the complexities of data management, understanding and using the strengths of both Microsoft Fabric and data streaming with data streaming vendors like Confluent will be key to achieving a competitive edge.

The Shift Left Architecture, when paired with Apache Iceberg’s open table format, simplifies the integration of data streaming with one or more lakehouses. This combination enhances data quality for all data consumers and significantly reduces overall storage costs.

In part three of this blog series, I will dig deeper into the data streaming alternatives. When to choose open source frameworks such as Apache Kafka and Flink, a leading data streaming platform such as Confluent, or a native Azure service like Event Hubs. Primer: The trade-offs are huge. Do a proper evaluation BEFORE choosing your data streaming solution.

How do you see the combination of a lakehouse like Microsoft Fabric with data streaming? Do you already use both together? And what is your strategy for other data lakes and data warehouses you already have in your enterprise architecture, such as Databricks or Snowflake? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post How Microsoft Fabric Lakehouse Complements Data Streaming (Apache Kafka, Flink, et al.) appeared first on Kai Waehner.

The Shift Left Architecture – From Batch and Lakehouse to Real-Time Data Products with Data Streaming

Kai Waehner — Sat, 15 Jun 2024 06:12:44 +0000

Data integration is a hard challenge in every enterprise. Batch processing and Reverse ETL are common practices in a data warehouse, data lake or lakehouse. Data inconsistency, high compute cost, and stale information are the consequences. This blog post introduces a new design pattern to solve these problems: The Shift Left Architecture enables a data mesh with real-time data products to unify transactional and analytical workloads with Apache Kafka, Flink and Iceberg. Consistent information is handled with streaming processing or ingested into Snowflake, Databricks, Google BigQuery, or any other analytics / AI platform to increase flexibility, reduce cost and enable a data-driven company culture with faster time-to-market building innovative software applications.

Data Products – The Foundation of a Data Mesh

A data product is a crucial concept in the context of a data mesh that represents a shift from traditional centralized data management to a decentralized approach.

McKinsey finds that “when companies instead manage data like a consumer product—be it digital or physical—they can realize near-term value from their data investments and pave the way for quickly getting more value tomorrow. Creating reusable data products and patterns for piecing together data technologies enables companies to derive value from data today and tomorrow”:

According to McKinsey, the benefits of the data product approach can be significant:

New business use cases can be delivered as much as 90 percent faster.
The total cost of ownership, including technology, development, and maintenance, can decline by 30 percent.
The risk and data-governance burden can be reduced.

Data Product from a Technical Perspective

Here’s what a data product entails in a data mesh from a technical perspective:

Decentralized Ownership: Each data product is owned by a specific domain team. Applications are truly decoupled.
Sourced from Operational and Analytical Systems: Data products include information from any data source, including the most critical systems and analytics/reporting platforms.
Self-contained and Discoverable: A data product includes not only the raw data but also the associated metadata, documentation, and APIs.
Standardized Interfaces: Data products adhere to standardized interfaces and protocols, ensuring that they can be easily accessed and utilized by other data products and consumers within the data mesh.
Data Quality: Most use cases benefit from real-time data. A data product ensures data consistency across real-time and batch applications.
Value-Driven: The creation and maintenance of data products are driven by business value.

In essence, a data product in a data mesh framework transforms data into a managed, high-quality asset that is easily accessible and usable across an organization, fostering a more agile and scalable data ecosystem.

Anti-Pattern: Batch Processing and Reverse ETL

The “Modern” Data Stack leverages traditional ETL tools or data streaming for ingestion into a data lake, data warehouse or lakehouse. The consequence is a spaghetti architecture with various integration tools for batch and real-time workloads mixing analytical and operational technologies:

Reverse ETL is required to get information out of the data lake into operational applications and other analytical tools. As I have written about it previously, the combination of data lakes and Reverse ETL is an anti-pattern for the enterprise architecture largely due to the economic and organizational inefficiencies Reverse ETL creates. Event-driven data products enable a much simpler and more cost-efficient architecture.

One key reason for the need of batch processing and Reverse ETL patterns is the common use of the Lambda architecture: A data processing architecture that handles real-time and batch processing separately using different layers. This still widely exists in enterprise architectures. Not just for big data use cases like Hadoop/Spark and Kafka, but also for the integration with transactional systems like file-based legacy monoliths or Oracle databases.

Contrary, the Kappa Architecture handles both real-time and batch processing using a single technology stack. Learn more about “Kappa replacing Lambda Architecture” in its own article. TL;DR: The Kappa architecture is possible by bringing even legacy technologies into an event-driven architecture using a data streaming platform. Change Data Capture (CDC) is one of the most common helpers for this.

Traditional ELT in the Data Lake, Data Warehouse, Lakehouse

It seems like nobody does data warehouse anymore today. Everyone talks about a lakehouse merging data warehouse and data lake. Whatever term you use or prefer… The integration process these days looks like the following:

Just ingesting all the raw data into a data warehouse / data lake / lakehouse has several challenges:

Slower Updates: The longer the data pipeline and the more tools are used, the slower the update of the data product.
Longer Time-to-Market: Development efforts are repeated because each business unit needs to do the same or similar processing steps again instead of consuming from a curated data product.
Increased Cost: The cash cow of analytics platforms charge is compute, not storage. The more your business units use DBT, the better for the analytics SaaS provider.
Repeating Efforts: Most enterprises have multiple analytics platforms, including different data warehouses, data lakes, and AI platforms. ELT means doing the same processing again, again, and again.
Data Inconsistency: Reverse ETL, Zero ETL, and other integration patterns make sure that your analytical and especially operational applications see inconsistent information. You cannot connect a real-time consumer or mobile app API to a batch layer and expect consistent results.

Data Integration, Zero ETL and Reverse ETL with Kafka, Snowflake, Databricks, BigQuery, etc.

These disadvantages are real! I have not met a single customer in the past months who disagreed and told me these challenges do not exist. To learn more, check out my blog series about data streaming with Apache Kafka and analytics with Snowflake:

The blog series can be applied to any other analytics engine. It is a worthwhile read, no matter if you use Snowflake, Databricks, Google BigQuery, or a combination of several analytics and AI platforms.

The solution for this data mess creating data inconsistency, outdated information, and ever-growing cost is the Shift Left Architecture…

Shift Left to Data Streaming for Operational AND Analytical Data Products

The Shift Left Architecture enables consistent information from reliable, scalable data products, reduces the compute cost, and allows much faster time-to-market for operational and analytical applications with any kind of technology (Java, Python, iPaaS, Lakehouse, SaaS, “you-name-it”) and communication paradigm (real-time, batch, request-response API):

Shifting the data processing to the data streaming platform enables:

Capture and stream data continuously when the event is created
Create data contracts for downstream compatibility and promotion of trust with any application or analytics / AI platform
Continuously cleanse, curate and quality check data upstream with data contracts and policy enforcement
Shape data into multiple contexts on-the-fly to maximize reusability (and still allow downstream consumers to choose between raw and curated data products)
Build trustworthy data products that are instantly valuable, reusable and consistent for any transactional and analytical consumer (no matter if consumed in real-time or later via batch or request-response API)

While shifting to the left with some workloads, it is crucial to understand that developers/data engineers/data scientists can usually still use their favourite interface like SQL or a programming language such as Java or Python.

Shift Left Architecture with Apache Kafka, Flink and Iceberg

Data Streaming is the core fundament of the Shift Left Architecture to enable reliable, scalable real-time data products with good data quality. The following architecture shows how Apache Kafka and Flink connect any data source, curate data sets (aka stream processing / Streaming ETL) and share the processed events with any operational or analytical data sink:

The architecture shows an Apache Iceberg table as alternative consumer. Apache Iceberg is an open table format designed for managing large-scale datasets in a highly performant and reliable way, providing ACID transactions, schema evolution, and partitioning capabilities. It optimizes data storage and query performance, making it ideal for data lakes and complex analytical workflows. Iceberg evolves to the de facto standard with support from most major vendors in the cloud and data management space, including AWS, Azure, GCP, Snowflake, Confluent, and many more coming (like Databricks after its acquisition of Tabular).

From the data streaming perspective, the Iceberg table is just a button click away from the Kafka Topic and its Schema (using Confluent’s Tableflow – I am sure other vendors will follow soon with own solutions). The big advantage of Iceberg is that data needs to be stored only once (typically in a cost-efficient and scalable object store like Amazon S3). Each downstream application can consume the data with its own technology without any need for additional coding or connectors. This includes data lakehouses like Snowflake or Databricks AND data streaming engines like Apache Flink.

Video: Shift Left Architecture

I summarized the above architectures and examples for the Shift Left Architecture in a short ten minute video if you prefer listening to content:

Apache Iceberg – The New De Facto Standard for Lakehouse Table Format?

Apache Iceberg is such a huge topic and a real game changer for enterprise architectures, end users and cloud vendors. I will write another dedicated blog, including interesting topics such as:

Confluent’s product strategy to embed Iceberg tables into its data streaming platform
Snowflake’s open source Iceberg project Polaris
Databricks’ acquisition of Tabular (the company behind Apache Iceberg) and the relation to Delta Lake and open sourcing its Unity Catalog
The (expected) future of table format standardization, catalog wars, and other additional solutions like Apache Hudi or Apache XTable for omni-directional interoperability across lakehouse table formats.

Stay tuned and subscribe to my newsletter to receive new articles.

Business Value of the Shift Left Architecture

Apache Kafka is the de facto standard for data streaming building a Kappa Architecture. The Data Streaming Landscape shows various open source technologies and cloud vendors. Data Streaming is a new software category. Forrester published “The Forrester Wave: Streaming Data Platforms, Q4 2023“. The leaders are Microsoft, Google and Confluent, followed by Oracle, Amazon, Cloudera, and a few others.

Building data products more left in the enterprise architecture with a data streaming platform and technologies such as Kafka and Flink creates huge business value:

Cost Reduction: Reducing the compute cost in one or even multiple data platforms (data lake, data warehouse, lakehouse, AI platform, etc.).
Less Development Effort: Streaming ETL, data curation and data quality control already executed instantly (and only once) after the event creation.
Faster Time to Market: Focus on new business logic instead of doing repeated ETL jobs.
Flexibility: Best of breed approach for choosing the best and/or most cost-efficient technology per use case.
Innovation: Business units can choose any programming language, tool or SaaS to do real-time or batch consumption from data products to try and fail or scale fast.

The unification of transactional and analytical workloads is finally possible to enable good data quality, faster time to market for innovation and reduced cost of the entire data pipeline. Data consistency matters across all applications and databases… A Kafka Topic with a data contract (= Schema with policies) brings data consistency out of the box!

How does your data architecture look like today? Does the Shift Left Architecture make sense to you? What is your strategy to get there? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post The Shift Left Architecture – From Batch and Lakehouse to Real-Time Data Products with Data Streaming appeared first on Kai Waehner.

Apache Kafka in Manufacturing at Automotive Supplier Brose for Industrial IoT Use Cases

Kai Waehner — Thu, 13 Jun 2024 07:15:57 +0000

Data streaming unifies OT/IT workloads by connecting information from sensors, PLCs, robotics and other manufacturing systems at the edge with business applications and the big data analytics world in the cloud. This blog post explores how the global automotive supplier Brose deploys a hybrid industrial IoT architecture using Apache Kafka in combination with Eclipse Kura, OPC-UA, MuleSoft and SAP.

Data Streaming and Industrial IoT / Industry 4.0

Data streaming with Apache Kafka plays a critical role in Industrial IoT by enabling real-time data ingestion, processing, and analysis from various industrial devices and sensors. Kafka’s high throughput and scalability ensure that it can reliably handle and integrate massive streams of data from IoT devices into analytics platforms for valuable insights. This real-time capability enhances predictive maintenance, operational efficiency, and overall automation in industrial settings.

Here is an exemplary hybrid industrial IoT architecture with data streaming at the edge in the factory and 5G supply chain environments synchronizing in real-time with business applications and analytics / AI platforms in the cloud:

Brose – A Global Automotive Supplier

Brose is a global automotive supplier headquartered in beautiful Franconia, Bavaria, Germany. The company has a global presence with 70 locations, 25 countries, 5 continents, and about 30,000 employees.

Brose specializes in mechatronic systems for vehicle doors, seats, and electric motors. They develop and manufacture innovative products that enhance vehicle comfort, safety, and efficiency, serving major car manufacturers worldwide.

Source: Brose

Brose’s Hybrid Architecture for Industry 4.0 with Eclipse Kura, OPC UA, Kafka, SAP and MuleSoft

Brose is an excellent example of combining data streaming using Confluent with other technologies like open source Eclipse Kura and OPC-UA for the OT and edge site, and IT infrastructure and cloud software like SAP, Splunk, SQL Server, AWS Kinesis and MuleSoft:

Source: Brose

Here is how it works according to Sven Matuschzik, Director of IT-Platforms and Databases at Brose:

Regional Kafka on-premise clusters are embedded within the IIoT and production platform, facilitating seamless data flow from the shop floor to the business world in combination with other integration tools. This hybrid IoT streaming architecture connects machines to the IT infrastructure, mastering various technologies, and ensuring zero trust security with micro-segmentation. It manages latencies between sites and central IT, enables two-way communication between machines and the IT world, and maintains high data quality from the shop floor.

For more insights from Brose (and Siemens) about IoT and data streaming with Apache Kafka, listen to the following interactive discussion.

Interactive Discussion with Siemens and Brose about Data Streaming and IoT

Brose and Siemens discussed with me

the practical strategies employed by Brose and Siemens to integrate data streaming in IoT for real-time data utilization.
the challenges faced by both companies in embracing data streaming, and reveal how they overcame barriers to maximize their potential with a hybrid cloud infrastructure.
how these enterprise architectures will be expanded, including real-time data sharing with customers, partners, and suppliers, and the potential impact of artificial intelligence (AI), including GenAI, on data streaming efforts, providing valuable insights to drive business outcomes and operational efficiency.
the significance of event-driven architectures and data streaming for enhanced manufacturing processes to improve overall equipment effectiveness (OEE) and seamlessly integrate with existing IT systems like SAP ERP and Salesforce CRM to optimize their operations.

Here is the video recording with Stefan Baer from Siemens and Sven Matuschzik from Brose:

Source: Confluent

Data Streaming with Apache Kafka to Unify Industrial IoT Workloads from Edge to Cloud with Apache Kafka

Many manufacturers leverage data streaming powered by Apache Kafka to unify the OT/IT world from edge sites like factories to the data center or public cloud for analytics and business applications.

I wrote a lot about data streaming with Apache Kafka and Flink in Industry 4.0, Industrial IoT and OT/IT modernization. Here are a few of my favourite articles:

How does your IoT architecture look like? Do you already use data streaming? What are the use cases and enterprise architecture? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka in Manufacturing at Automotive Supplier Brose for Industrial IoT Use Cases appeared first on Kai Waehner.

Apache Kafka and Tinybird (ClickHouse) for Streaming Analytics HTTP APIs

Kai Waehner — Thu, 04 Apr 2024 06:54:12 +0000

Apache Kafka became the de facto standard for data streaming. However, the combination of an event-driven architecture with request-response APIs is crucial for most enterprise architectures. This blog post explores how Tinybird innovates with a REST/HTTP layer on top of the open source analytics database ClickHouse in the cloud. Integrating Kafka with Tinybird, the benefits of fully managed services like Confluent Cloud, and customer stories from Factorial and FanDuel show why Kafka and analytics databases complement each other for more innovation and faster time-to-market.

Tinybird = ClickHouse Analytics + HTTP APIs

Tinybird is powered by the open source database ClickHouse, but differentiates itself in the analytics market by providing a platform for providing real-time analytics as API.

ClickHouse and the Overwhelming Analytics Market

ClickHouse is an open-source column-oriented database management system designed for handling large-scale analytical workloads. The internal architecture is optimized for high performance and scalability, particularly for real-time analytics and OLAP (Online Analytical Processing) queries.

ClickHouse supports SQL queries and is optimized for fast data ingestion and querying of large volumes of data. The database is commonly used for data warehousing, time-series data analysis, and ad hoc analytics in various industries, including e-commerce, finance, and telecommunications.

ClickHouse is a great technology. However, in the analytics space, each database solution competes with various other open source frameworks, commercial products and fully managed cloud services. If you evaluate ClickHouse, you probably also evaluate:

Analytics services of the cloud providers (like Amazon Redshift, Azure Synapse Analytics, Google BigQuery, etc.)
Open source frameworks like Apache Druid (Imply), Pinot (StarTree), and others
Data analytics platforms like Snowflake, Databricks, et al.

The overlapping between these products is huge. Of course, sweet spots exist for each technology. But it is a mass market and often only the big players grow consumption significantly while small vendors end up in a niche.

Tinybird does NOT try to compete directly with all the other analytic databases. Instead, it added an intuitive layer on top of ClickHouse to differentiate its product offering.

Tinybird Adds APIs on Top of ClickHouse for Simple Integration and Publishing of Applications

Tinybird is a real-time data pipeline platform designed to help developers build and deploy scalable data products quickly and efficiently. It offers a range of tools and features for ingesting, processing, and serving real-time data streams, including data transformation, aggregation, and analytics.

Tinybird earns its reputation for simplicity and ease of use. It enables users to create custom data pipelines, applications, and APIs without the need for extensive coding or infrastructure setup. The analytics platform focuses on building operational and user-facing use cases with requirements for fresh data, fast queries, and high concurrency. Tinybird is NOT focused on traditional BI use cases served by data warehouses.

Tinybird differentiates from the competition by being more than just a managed database. The platform provides an underlying database powered by ClickHouse. It meets performance and scale needs while focusing on solving the pain around the database: data ingestion, data publication, and development workflow.

Source: Tinybird

Tinybird helps users to productize their data as HTTP APIs for integration with external systems (lakes, data mesh, user-facing applications). The product covers the entire end-to-end experience, from data ingestion to analytics to publishing APIs. An intuitive UI for interactive development and prototyping in conjunction with a full ‘data as config’ development life cycle makes the development of analytics applications straightforward. Data integration, analytics, and publication are defined as config files, stored in a Git repository. Tools help with automatic CI/CD for testing and deployment, plus a CLI and plugins for developer IDEs.

Relation of ClickHouse and Tinybird to Data Streaming with Apache Kafka

Data streaming with Apache Kafka refers to the process of ingesting, processing, and analyzing real-time data with a distributed streaming platform. Kafka enables the creation of real-time data pipelines that can handle high volumes of data from various sources, allowing for continuous data ingestion, processing, and delivery.

Kafka became the de facto standard protocol for data streaming, like Amazon S3 is the de facto standard for object storage. However…

Apache Kafka is NOT Complex Analytics

Apache Kafka is a database (even though many people disagree or don’t like this statement). But Kafka is not the right platform for complex analytics. Kafka is complementary to other databases like MySQL, MongoDB, Elasticsearch, ClickHouse, et al.

Learn more about this in my article “When NOT to use Apache Kafka” or the following lightboard video:

To be very clear: Data streaming also includes analytics capabilities. Stream processing enables continues processing of data in motion in real-time at scale. Use cases include simple stateless streaming ETL and advanced stateful computations, including embedding AI and machine learning into a stream processor. However, the workloads differ from analytical databases like ClickHouse. For a better understanding of when to use stream processing, check out my blog about the perfect match between Apache Kafka and stream processing with Kafka Streams or Apache Flink.

Apache Kafka is NOT Request-Response APIs

Request-response communication with REST / HTTP is simple, well understood, and supported by most technologies, products, and SaaS cloud services. Contrarily, data streaming with Apache Kafka is a fundamental change to process data continuously.

Data streaming with Apache Kafka and request-response APIs like HTTP/REST complement each other in most enterprise architectures. I wrote a lot about this in the past:

TL;DR: The question is not if you need Kafka or APIs, but how to combine them the best way in your architecture. Vendors like Confluent provide native HTTP APIs and connectors to produce and consume with the Kafka API using HTTP. But for more advanced use cases, solutions like Tinybird are perfect in combination with Kafka.

Apache Kafka + Tinybird = Streaming Analytics APIs

The Tinybird website explains the relation between data streaming with Kafka and Tinybird very well:

“Turn your Kafka Streams into actionable API Endpoints your teams can consume. Instead of building a new consumer every time you want to make sense of your Data Streams, write SQL queries and expose them as API endpoints. Easy to maintain. Always up-to-date. Fast as can be.”

Source: Tinybird

The Tinybird application development process is very simple:

Connect to Kafka Topics via out-of-the-box Kafka integration.
Store the events in a managed, columnar data source with schemas and a data contract for good data quality.
Query the events with SQL to filter, aggregate, join and enrichments for fast and scalable analytics.
Publish queries as dynamic, low-latency REST/HTTP APIs that scale.

Fully Managed Cloud Services for Streaming Analytics: Confluent Cloud + Tinybird

Fully managed cloud services offer simplified operations, scalability, high availability, security, compliance, cost savings, and performance optimization. They streamline infrastructure management, ensure reliability, enhance security, reduce costs, and optimize performance for businesses. Project teams focus on business logic and much faster time-to-market of new products and innovation.

Working for Confluent, I enjoy our fast growing “Connect with Confluent” partner program to see how customers build innovative applications in a cloud-native fully managed environment, including end-to-end integration and data governance.

Confluent Cloud has fully integrated Tinybird. Developers build new applications with HTTP APIs on top of data streaming with Kafka faster than ever before. Check out this screencast to learn how you can publish an API from a Kafka stream in four minutes leveraging Confluent Cloud and Tinybird.

Source: Tinybird

Data streaming truly decoupled the business domains. In the above diagram, you see a very common scenario: various consumers of the same business information (i.e., a single Kafka Topic). Some in real-time (like Tinybird), some in near real-time or batch (like Snowflake or Databricks).

Each developer can use different programming languages, databases, or SaaS analytics platforms. Apache Kafka unifies operational and analytical workloads. As a result, a Tinybird application aggregates transactional and analytical workloads to build new APIs.

Customer Stories using Kafka in Confluent Cloud and Tinybird for Real-Time Analytics

This section explores a few real world case studies that combine Apache Kafka and ClickHouse under the hood of Confluent Cloud’s and Tinybird’s SaaS cloud solutions.

Factorial Human Resources (HR) Software: Data Freshness for Users

Factorial is provides human resources (HR) software solutions for over 8000 small and medium-sized businesses. Their platform offers features such as employee onboarding, time tracking, leave management, performance reviews, and HR analytics. Factorial streamlines HR processes, boosts employee productivity, and allows businesses to effectively manage their workforce. People leaders can focus on people, not paperwork.

Source: Factorial

With data streaming and real-time analytics leveraging fully managed cloud services from Confluent and Tinybird, Factorial has improved its data freshness and reduced query latency, leading to significantly faster user feature launches. Read the detailed success story on the Confluent blog.

FanDuel: Customer Personalization and Fraud Prevention in Gambling and Sports Betting

FanDuel is operating in the online gambling and sports betting industry in the United States. I had the pleasure of hosting the company as guest speakers at executive dinner events in London.

FanDuel is a popular sports betting and daily fantasy sports company. It provides online platforms and mobile apps for users to place bets on various sports events and take part in fantasy sports contests. FanDuel has gained a reputation for its user-friendly interface, a wide range of betting options, and innovative features in the online gambling market.

The entire gambling and betting industry leverages data streaming for real-time data procession, transactional payment processing, fraud detection, customer loyalty platforms, and many other use cases. Read more about this topic in the state of data streaming for the betting industry and Kafka case studies for betting.

FanDuel leverages the combination of Confluent Cloud and Tinybird for real-time analytics use cases with user-facing applications and mobile apps.

Fanduel’s case study quotes: “Fanduel uses Confluent and Tinybird to power real-time personalization across all their sports betting solutions to improve time-to-first-bet and reduce the risk of fraud.”

If you need APIs on Top of Data Streaming and Analytics, choose Kafka and Tinybird

… and if you want a fully managed end-to-end data pipeline with out-of-the-box connectivity, critical SLAs and cloud-native elasticity and pricing, go with Confluent Cloud and Tinybird. Of course, the data streaming landscape 2024 is broad. Absolutely fine to evaluate other vendors, too.

The conversations I had with FanDuel at our customer dinner showed that the real world challenge is not choosing the right technologies, but making them easy to use for fast time-to-market and elastic scale with reliable fully managed cloud services.

The motivation for this blog post were my meetings with these joint customers. I will do similar customer dinners in the next months with other Confluent partners like Rockset and StarTree. I can’t wait to hear from joint customers about their benefits of leveraging a specific analytics engine together with a fully managed data streaming platform for product innovation and better customer experiences.

Do you already use Tinybird together with Kafka? Or how do you build scalable real-time APIs on top of your favorite analytics database? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka and Tinybird (ClickHouse) for Streaming Analytics HTTP APIs appeared first on Kai Waehner.

When NOT to Use Apache Kafka? (Lightboard Video)

Kai Waehner — Tue, 26 Mar 2024 06:45:11 +0000

Apache Kafka is the de facto standard for data streaming to process data in motion. With its significant adoption growth across all industries, I get a very valid question every week: When NOT to use Apache Kafka? What limitations does the event streaming platform have? When does Kafka simply not provide the needed capabilities? How to qualify Kafka out as it is not the right tool for the job? This blog post contains a lightboard video that gives you a twenty-minute explanation of the DOs and DONTs.

DisclaimeR: This blog post shares a lightboard video to watch an explanation about when NOT to use Apache Kafka. For a much more detailed and technical blog post with various use cases and case studies, check out this blog post from 2022 (which is still valid today – whenever you read it).

What is Apache Kafka, and what is it NOT?

Kafka is often misunderstood. For instance, I still hear way too often that Kafka is a message queue. Part of the reason is that some vendors only pitch it for a specific problem (such as data ingestion into a data lake or data warehouse) to sell their products. So, in short:

Kafka is…

a scalable real-time messaging platform to process millions of messages per second.
a data streaming platform for massive volumes of big data analytics and small volumes of transactional data processing.
a distributed storage provides true decoupling for backpressure handling, support of various communication protocols, and replayability of events with guaranteed ordering.
a data integration framework (Kafka Connect) for streaming ETL.
a data processing framework (Kafka Streams) for continuous stateless or stateful stream processing.

This combination of characteristics in a single platform makes Kafka unique (and successful).

Kafka is NOT…

a proxy for millions of clients (like mobile apps) – but Kafka-native proxies (like REST or MQTT) exist for some use cases.
an API Management platform – but these tools are usually complementary and used for the creation, life cycle management, or the monetization of Kafka APIs.
a database for complex queries and batch analytics workloads – but good enough for transactional queries and relatively simple aggregations (especially with ksqlDB).
an IoT platform with features such as device management – but direct Kafka-native integration with (some) IoT protocols such as MQTT or OPC-UA is possible and the appropriate approach for (some) use cases.
a technology for hard real-time applications such as safety-critical or deterministic systems – but that’s true for any other IT framework, too. Embedded systems are a different software!

For these reasons, Kafka is complementary, not competitive, to these other technologies. Choose the right tool for the job and combine them!

Lightboard Video: When NOT to use Apache Kafka

The following video explores the key concepts of Apache Kafka. Afterwards, the DOs and DONTs of Kafka show how to complement data streaming with other technologies for analytics, APIs, IoT, and other scenarios.

Data Streaming Vendors and Cloud Services

The research company Forrester defines data streaming platforms as a new software category in a new Forrester Wave. Apache Kafka is the de facto standard used by over 100,000 organizations.

Plenty of vendors offer Kafka platforms and cloud services. Many complementary open source stream processing frameworks like Apache Flink and related cloud offerings emerged. And competitive technologies like Pulsar, Redpanda, or WarpStream try to get market share leveraging the Kafka protocol. Check out the data streaming landscape of 2024 to summarize existing solutions and market trends. The end of the article gives an outlook to potential new entrants in 2025.

Apache Kafka is a Data Streaming Platform: Combine it with other Platforms when needed!

Over 150,000 organizations use Apache Kafka in the meantime. The Kafka protocol is the de facto standard for many open source frameworks, commercial products and serverless cloud SaaS offerings.

However, Kafka is not an allrounder for every use case. Many projects combine Kafka with other technologies, such as databases, data lakes, data warehouses, IoT platforms, and so on. Additionally, Apache Flink is becoming the de facto standard for stream processing (but Kafka Streams is not going away and is the better choice for specific use cases).

Where do you (not) use Apache Kafka? What other technologies do you combine Kafka with? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post When NOT to Use Apache Kafka? (Lightboard Video) appeared first on Kai Waehner.

The State of Data Streaming for Healthcare with Apache Kafka and Flink

Kai Waehner — Mon, 27 Nov 2023 13:52:35 +0000

This blog post explores the state of data streaming for the healthcare industry. The digital disruption combined with growing regulatory requirements and IT modernization efforts require a reliable data infrastructure, real-time end-to-end observability, fast time-to-market for new features, and integration with pioneering technologies like sensors, telemedicine, or AI/machine learning. Data streaming allows integrating and correlating legacy and modern interfaces in real-time at any scale to improve most business processes in the healthcare sector much more cost-efficiently.

I look at trends in the healthcare industry to explore how data streaming helps as a business enabler, including customer stories from Humana, Recursion, BHG (former Bankers Healthcare Group), Evernorth Health Services, and more. A complete slide deck and on-demand video recording are included.

General trends in the healthcare industry

The digitalization of the healthcare sector and disruptive use cases is exciting. Countries where healthcare is not part of the public administration innovate quickly. However, regulation and data privacy are crucial across the world. And even innovative technologies and cloud services need to comply with law and in parallel connect to legacy platforms and protocols.

Regulation and interoperability

Healthcare does often not have a choice. Regulations by the government must be implemented by a specific deadline. IT modernization, adoption of new technologies, and integration with the legacy world are mandatory. Many regulations demand Open APIs and interfaces. But even if not enforced, the public sector does itself a favour adopting open technologies for data sharing between different sectors and the members.

A concrete example: Interoperability and Patient Access final rule (CMS-9115-F), as explained by a US government, “aims to put patients first, giving them access to their health information when they need it most and, in a way, they can best use it.

Interoperability = Interoperability is the ability of two or more systems to exchange health information and use the information once it is received.
Patient Access = Patient Access refers to the ability of consumers to access their health care records.

Lack of seamless data exchange in healthcare has historically detracted from patient care, leading to poor health outcomes, and higher costs. The CMS Interoperability and Patient Access final rule establishes policies that break down barriers in the nation’s health system to enable better patient access to their health information, improve interoperability and unleash innovation while reducing the burden on payers and providers.

Patients and their healthcare providers will be more informed, which can lead to better care and improved patient outcomes, while reducing burden. In a future where data flows freely and securely between payers, providers, and patients, we can achieve truly coordinated care, improved health outcomes, and reduced costs.”

Digital disruption and automated workflows

Gartner has a few interesting insights about the evolution of the healthcare sector. The digital disruption is required to handle revenue reduction and revenue reinvention because of economic pressure, scarce and extensive talent, and supply challenges:

Gartner points out that real-time workflows and automation are critical across the entire health process to enable an optimal experience:

Therefore, data streaming is very helpful in implementing new digitalized healthcare processes.

Data streaming in the healthcare industry

Adopting healthcare trends like telemedicine, automated member service with Generative AI (GenAI), or automated claim processing are only possible if enterprises in the games sector can provide and correlate information at the right time in the proper context. Real-time, which means using the information in milliseconds, seconds, or minutes, is almost always better than processing data later:

Data streaming combines the power of real-time messaging at any scale with storage for true decoupling, data integration, and data correlation capabilities. Apache Kafka is the de facto standard for data streaming.

The following blog series about data streaming with Apache Kafka in the healthcare industry is a great starting point to learn more about data streaming in the health sector, including a few industry-specific and Kafka-powered case studies:

Data Streaming Use Cases and Architectures for Healthcare (including Slide Deck)
Legacy Modernization and Hybrid Cloud (Optum / UnitedHealth Group, Centene, Bayer)
Streaming ETL (Bayer, Babylon Health)
Real-time Analytics (Cerner, Celmatix, CDC/Centers for Disease Control and Prevention)
Machine Learning and Data Science (Recursion, Humana)
Open API and Omnichannel (Care.com, Invitae)

Architecture trends for data streaming

The healthcare industry applies various software development and enterprise architecture trends for cost, elasticity, security, and latency reasons. The three major topics I see these days at customers are:

Event-driven architectures (in combination with request-response communication) to enable domain-driven design and flexible technology choices
Data mesh for building new data products and real-time data sharing with internal platforms and partner APIs
Fully managed SaaS (whenever doable from compliance and security perspective) to focus on business logic and faster time-to-market

Let’s look deeper into some enterprise architectures that leverage data streaming for healthcare use cases.

Event-driven architecture for integration and IT modernization

IT modernization requires integration between legacy and modern applications. The integration challenges include different protocols (often proprietary and complex), various communication paradigms (asynchronous, request-response, batch), and SLAs (transactions, analytics, reporting).

Here is an example of a data integration workflow combining clinical health data and claims in EDI / EDIFACT format, data from legacy databases, and modern microservices:

One of the biggest problems in IT modernization is data consistency between files, databases, messaging platforms, and APIs. That is a sweet spot for Apache Kafka: Providing data consistency between applications no matter what technology, interface or API they use.

Data sharing across business units is important for any organization. The healthcare industry has to combine very interesting (different) data sets, like big data game telemetry, monetization and advertisement transactions, and 3rd party interfaces.

Data consistency is one of the most challenging problems in the games sector. Apache Kafka ensures data consistency across all applications and databases, whether these systems operate in real-time, near-real-time, or batch.

One sweet spot of data streaming is that you can easily connect new applications to the existing infrastructure or modernize existing interfaces, like migrating from an on-premise data warehouse to a cloud SaaS offering.

New customer stories for data streaming in the healthcare sector

The innovation is often slower in the healthcare sector. Automation and digitalization change how healthcare companies process member data, execute claim processing, integrate payment processors, or create new business models with telemedicine or sensor data in hospitals.

Most healthcare companies use a hybrid cloud approach to improve time-to-market, increase flexibility, and focus on business logic instead of operating all IT infrastructure on premises. The integration between legacy protocols like EDIFACT and modern applications is still one of the toughest challenges.

Here are a few customer stories from healthcare organizations for IT modernization and innovation with new technologies:

BHG Financial (formerly: Bankers Healthcare Group): Direct lender for healthcare professionals offering loans, credit card, insurance
Evernorth Health Services: Hybrid integration between on-premise mainframe and microservices on AWS cloud
Humana: Data integration and analytics at the point of care
Recursion: Accelerating drug discovery with a hybrid machine learning architecture

Resources to learn more

This blog post is just the starting point. Learn more about data streaming with Apache Kafka and Apache Flink in the healthcare industry in the following on-demand webinar recording, the related slide deck, and further resources, including pretty cool lightboard videos about use cases.

On-demand video recording

The video recording explores the healthcare industry’s trends and architectures for data streaming. The primary focus is the data streaming architectures and case studies.

I am excited to have presented this webinar in my interactive light board studio:

This creates a much better experience, especially in a time after the pandemic, where many people are “Zoom fatigue”.

Check out our on-demand recording:

Slides

If you prefer learning from slides, check out the deck used for the above recording:

Fullscreen Mode

Case studies and lightboard videos for data streaming in the healthcare industry

The state of data streaming for healthcare in 2023 is interesting. IT modernization is the most important initiative across most healthcare companies and organizations. This includes cost reduction by migrating from legacy infrastructure like the mainframe, hybrid cloud architectures with bi-directional data sharing, and innovative new use cases like telehealth.

We recorded lightboard videos showing the value of data streaming simply and effectively. These five-minute videos explore the business value of data streaming, related architectures, and customer stories. Here is an example of cost reduction through mainframe offloading.

Healthcare is just one of many industries that leverages data streaming with Apache Kafka and Apache Flink.. Every month, we talk about the status of data streaming in a different industry. Manufacturing was the first. Financial services second, then retail, telcos, gaming, and so on… Check out my other blog posts.

How do you modernize IT infrastructure in the healthcare sector? Do you already leverage data streaming with Apache Kafka and Apache Flink? Maybe even in the cloud as a serverless offering? Let’s connect on LinkedIn and discuss it! Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

The post The State of Data Streaming for Healthcare with Apache Kafka and Flink appeared first on Kai Waehner.

Modernizing SCADA Systems and OT/IT Integration with Data Streaming

Kai Waehner — Sun, 10 Sep 2023 12:56:13 +0000

SCADA control systems are a vital component of IT/OT modernization. The old IT/OT infrastructure and SCADA system are monolithic, proprietary, not scalable, and miss open APIs based on standard interfaces. This post explains the modernization of such a system based on the real-life use case of 50Hertz, a transmission system operator for electricity in Germany. Two common business goals drove them: Improve the Overall Equipment Effectiveness (OEE) and stay innovative. A lightboard video about the related data streaming enterprise architecture is included.

The State of Data Streaming for Manufacturing in 2023

The evolution of industrial IoT, manufacturing 4.0, and digitalized B2B and customer relations require modern, open, and scalable information sharing. Data streaming allows integrating and correlating data in real-time at any scale. Trends like software-defined manufacturing and data streaming help modernize and innovate the entire engineering and sales lifecycle.

I have recently presented an overview of trending enterprise architectures in the manufacturing industry and data streaming customer stories from BMW, Mercedes, Michelin, and Siemens. A complete slide deck and on-demand video recording are included:

This blog post explores one of the enterprise architectures and case studies in more detail: Modernization of legacy and proprietary monoliths and SCADA systems to a scalable, open platform with real-time data integration capabilities.

What is a SCADA System? And how does Data Streaming help?

Supervisory control and data acquisition (SCADA) is a control system architecture comprising computers, networked data communications, and graphical user interfaces for high-level supervision of machines and processes. It also covers sensors and other devices, such as programmable logic controllers, which interface with process plants or machinery.

Data streaming helps connect high-volume sensor data from machines, PLCs, robots, and other IoT devices. This is possible in real-time at scale with stream processing. The de facto standard for data streaming is Apache Kafka and its ecosystems, like Kafka Stream and Kafka Connect.

Enterprises leverage Apache Kafka as the next generation of Data Historians. Integrating and pre-processing the events with data streaming is a prerequisite for data correlation with information systems like the MES or ERP (that might run at the edge or more often in the cloud).

50hertz: A cloud-native SCADA system built with Apache Kafka

50hertz is a transmission system operator for electricity in Germany. The company secures electricity supply to 18 million people in northern and eastern Germany.

The infrastructure must operate 24 hours, seven days a week. Various shift teams and a mission-critical SCADA infrastructure supervise and control the OT systems.

50hertz next-generation Modular Control Center System (MCCS) leverages a central, scalable, event-based integration platform based on Confluent:

Source: 50hertz

The first four containers include the Supervisory & Control (SCADA), Load Frequency Control (LFC), and Time Series Management & Forecasting applications. Each container can have multiple services/functions that follow the event-based microservices pattern.

50hertz provides central governance for security, protocols, and data schemas (CIM compliant) between platform containers/ modules. The cloud-native 24/7 SCADA system is developed in the cloud and deployed in safety-critical edge environments.

50hertz presented their OT/IT and SCADA modernization leveraging data streaming with Apache Kafka at the Confluent Data in Motion tour 2021. Unfortunately, the on-demand video recording is available only in German. Therefore, in another blog post, I wrote more about the case study: “A cloud-native SCADA System for Industrial IoT built with Apache Kafka“.

Lightboard Video: How Data Streaming Modernizes SCADA and OT/IT

Here is a five-minute lightboard video that describes how data streaming helps with modernizing monolith and proprietary SCADA infrastructure and OT/IT environments:

If you liked this video, make sure to follow the YouTube channel for many more lightboard videos across all industries.

Apache Kafka glues together the old and new OT/IT World

The 50Hertz case study showed how to modernize an existing legacy infrastructure with cloud-native technologies, whether you deploy at the edge or in the public cloud. For more case studies, check out the free “The State of Data Streaming in Manufacturing” on-demand recording or read the related blog post.

A common question in these scenarios is the proper communication and integration protocol when you move away from proprietary legacy PLCs and OT interfaces. MQTT and OPC-UA established themselves as excellent standards with different sweet spots. Data Streaming with Apache Kafka is complementary, not competitive. Learn more by reading “OPC UA, MQTT, and Apache Kafka – The Trinity of Data Streaming in IoT“.

How do you leverage data streaming in your manufacturing use cases? Do you deploy at the edge, in the cloud, or both? Let’s connect on LinkedIn and discuss it! Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

The post Modernizing SCADA Systems and OT/IT Integration with Data Streaming appeared first on Kai Waehner.

Transforming the Global Supply Chain with Data Streaming and IoT

Kai Waehner — Mon, 06 Feb 2023 16:05:38 +0000

A supply chain is a complex logistics system that converts raw materials into finished products distributed to end-consumers. The research company IoT Analytics found eight key technologies transforming the future of global supply chains. This article explores how data streaming helps to innovate in this area. Real-world case studies from global players such as BMW, Bosch, and Walmart show the value of real-time data streaming to improve the supply chain by building use cases such as automated intralogistics, track and trace of vehicles, and proactive and context-specific decision-making with MES and ERP integration.

8 key technologies transforming the future of global supply chains

“The Digital Supply Chain market is starting to accelerate, according to new research by IoT Analytics. Eight supply chain technology innovations are helping to make global supply chains more robust, including AS/RS technology, intralogistics robots, IoT track and trace, AI-enabled software, and supply chain digital twins.”

“26 weeks (or half a year)—that is how long companies have to wait for their semiconductor orders, on average. In some cases, it takes much longer. Before the current supply shortage, the average had been approximately 14 weeks—significantly shorter. This is just one example of supply chain issues stressing the economy worldwide.”

Data Streaming with Apache Kafka across the global supply chain

Real-time data beats slow data across global supply chains. That is true in any industry. Most modern supply chains rely on the de facto standard for data streaming Apache Kafka to improve the information flow across internal and external systems.

I wrote about data streaming with Apache Kafka and its broader ecosystem a lot in the past:

Efficiency: Apache Kafka for Supply Chain Management (SCM) Optimization
Case Studies: Real-Time Supply Chain with Apache Kafka in the Food and Retail Industry
Real-Time: Supply Chain Control Tower powered by Kafka
Innovation: Building a Postmodern ERP with Data Streaming

To be clear: Data Streaming is a concept, and Apache Kafka is a technology to integrate and correlate incoming and outgoing information. Data streaming is not competing with other supply chain products or technologies. Apache Kafka is either part of the solution (e.g., within an ERP or MES cloud service) or connecting the different interfaces (e.g., sensors, robots, IT applications, analytics platforms).

With this background, let’s look at IOT Analytics’ 8 key technologies transforming the future of global supply chains and how data streaming helps here. The following real-world examples show that the data streaming part I mention in each section is not the key technology IoT Analytics mentioned, but part of that overall solution or implementation.

1. Automatic sorting and retrieval systems

The VDMA defines intra-logistics as the entire process in a company that has to do with the connection and interaction of internal systems for material flow, automated guided vehicles, logistics, production and goods transport on the company premises.

Automatic sorting and retrieval systems (AR/RS) replace conveyors, forklifts, and racks. These systems are purpose-built machines with embedded software. However, data synchronization between these machines and the overall logistics process (that includes many other applications) is critical to automate and improve intralogistics as much as possible.

Intralogistics combines AR/RS systems with standard software, such as the warehouse management system (WMS), enterprise resource planning (ERP), and manufacturing execution system (MES). Most of these systems connect real-time from various sensors and applications. Data streaming is a perfect fit for such standard software. Hence, most modern, cloud-native MES and ERP systems leverage data streaming powered by Apache Kafka.

Critical Manufacturing is a leading MES powered by Apache Kafka. It combines MES transaction workloads and big data IoT analytics. Data from AR/RS and other IoT systems is ingested, stored, processed, transformed, and analyzed in real-time. Data Streaming provides a durable, distributed, highly scalable unified analytics platform for large-scale online or offline data processing embedded into the MES.

Source: Critical Manufacturing

2. Sourcing software with market intelligence

IoT Analytics defines sourcing and supplier management software as a helper tool to find suppliers to ensure they have the right components available in the right quantity to maintain their activities. The latest innovation in this segment is the addition of market intelligence that allows the procurement team to act more strategically.

Walmart leverages data streaming across the entire supply chain, including sourcing and procurement, to enable real-time inventory management, cost-efficient procurement, and a better customer experience:

Source: Walmart

Here is a quote of the VP of Walmart Cloud: “Walmart is a $500 billion in revenue company, so every second is worth millions of dollars. Having Confluent as our partner has been invaluable. Kafka and Confluent are the backbone of our digital omnichannel transformation and success at Walmart.”

3. IoT track and trace devices

In the distribution and logistics of many types of products, track&trace determines the current and past locations (and other information) of a unique item or property. IoT devices provide sensors and connectivity modules (e.g., via a cellular network).

Data Streaming with Apache Kafka is a key enabler for IoT at Bosch Power Tools:

Source: Bosch

Bosch tracks, manages, and locates tools and other equipment anytime and anywhere from the warehouse to the job site with Confluent Cloud.

4. Supply chain digital twins

Digital twin refers to a digital replica of potential and actual physical assets (physical twin), processes, people, places, systems, and devices.

The term ‘Digital Twin’ usually means the copy of a single asset. In the real world, many digital twins exist. The term ‘Digital Thread’ spans the entire life cycle of one or more digital twins.

In a supply chain context, the digital thread is a digital model of the supply chain process to enable monitoring, simulation, and predictions.

Various IoT architectures exist for building a digital twin or digital thread with data streaming.

Mercury Systems is a global technology company serving the aerospace and defense industry. Mercury built a Digital Thread powered by data streaming to bring together design and product information across the product life cycle:

Source: Mercury Systems

The following technologies are included:

A Mendix-based portal with combined data from PLM/ERP/MES
Confluent for cloud-native and reliable real-time event streaming across applications
AI and ML technologies

The digital thread lets Mercury Systems view all data in one location using common tools. Further benefits of data streaming are faster time to market, the ability to scale, an improved innovation pipeline, and reduced cost of failure.

5. Intralogistics robots

Smart factories include various robots to automate shop floor processes and produce tangible goods. For instance, autonomous Mobile Robots (AMRs) are vehicles that autonomously use onboard sensors to move materials around a facility. Many car makers use these robots in their factories already. I visited the Mercedes “Factory 56” in 2022. This factory is a lighthouse project of Mercedes. It does not use paper anymore, but only digital services. AMRs drive around you while you look at the production line and workers.

Robots cannot work alone. They need to communicate with other systems and applications to bring the right pieces to equipment and workers along the production line.

BMW Group leverages data streaming to connect all data from its smart factories to the cloud. Robots ingest the data into the fully managed Kafka cluster in the cloud. BMW makes all data generated by its 30+ production facilities and worldwide sales network available in real-time to anyone across the global business.

BMW’s use cases include:

Logistics and supply chain in global plants
The right stock in place (physically and in ERP systems like SAP)
OT / IoT integration with open standards like OPC-UA
Just in time, just in sequence manufacturing

Here is why BMW chose data streaming for this (and many other) use cases:

Why Kafka? Decoupling. Transparency. Innovation.
Why Confluent? Stability is key in manufacturing.
Why Confluent Cloud? Focus on business logic.
Decoupling between logistics and production systems

6. Al-enabled inventory optimization

“Modern inventory planning is a very data-heavy task with companies compiling millions of data points. AI-enabled inventory optimization software is helping companies crunch these numbers much faster than before. It automates, streamlines, and controls the in- and outbound inventory flows and improves the process with AI capabilities.” as IoT Analytics describes.

AO.com is one of many retailers that leverages data streaming for real-time inventory optimization across the supply chain. The electrical retailer provides a hyper-personalized online retail experience, turning each customer visit into a one-on-one marketing opportunity. This is only possible because AO.com correlates inventory data with historical customer data and real-time digital signals like a click in the mobile app.

Source: AO.com

Context-specific pricing, discounts, upselling, and other techniques are only possible because of AI-powered real-time decision-making based on inventory information across the supply chain. Information from warehouses, department stores, suppliers, shipping, and similar inventory-related data sources must be correlated to maximize customer satisfaction and revenue growth and increase customer conversions.

7. Proactive field service

IoT Analytics describes a pain we all experienced from Telco and many other industries: “IoT-based proactive field service software provides a step up from traditional field service of assets running in the field. While traditional field service software mostly works reactively by allocating field service technicians to a site after a failure has been reported, proactive field service uses IoT and predictive maintenance to send field service technicians to a remote site before the asset fails.”

British Telecom (BT) is a telco that operates in ~180 countries. It is the largest telecommunications enterprise in the UK. BT leverages data streaming to expose real-time data events externally to improve the field service. Consequently, this provides a better customer experience and creates additional revenue streams.

For instance, a broadband outage can be recognized when it happens or even before it happens because of real-time data. This enables proactive field service. Customers receive notifications in real-time. Premium customers even receive extra data to their phone to tether a connection, while the outage persists.

British Telecom’s enterprise architecture comprises a hybrid and multi-cloud data streaming deployment:

Source: British Telecom

8. Supply chain visibility software

Supply chain visibility is crucial in creating supply chain networks that will survive disruptions like the global COVID-19 pandemic or the Ukraine war. Questions like “when will my supplies arrive?” or “which alternative supplier has stock to ship”? are only answerable with real-time information across the supply chain.

BAADER is a worldwide manufacturer of innovative machinery for the food processing industry. They run an IoT-based and data-driven food value chain on Confluent Cloud.

The Kafka-based infrastructure runs as a fully managed service in the cloud to provide end-to-end supply chain visibility. A single source of truth shows the information flow across the factories and regions across the food value chain:

Source: Baader

Business-critical operations are available 24/7 for tracking, calculations, alerts, etc.

From a technical perspective, MQTT provides connectivity to machines and GPS data from vehicles. ksqlDB processes the data in motion continuously directly after the ingestion from the edge. Kafka Connect connectors integrate applications and IT systems, such as Elasticsearch, MongoDB, and AWS S3.

Data streaming optimizes the global supply chain

Innovative IoT technologies transform the global supply chain. End-to-end visibility in real-time cost reduction and better customer experiences are the consequence.

Case studies from companies across different industries showed how innovative IoT technologies and data streaming with the de facto standard Apache Kafka enable innovation while keeping the different business units and technologies decoupled from each other. Only a scalable and distributed real-time data streaming platform empowers such innovation.

How do you leverage data streaming to improve the supply chain? What are your use cases and architectures? Which IoT technologies do you integrate with Apache Kafka? Let’s connect on LinkedIn and discuss it! Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

The post Transforming the Global Supply Chain with Data Streaming and IoT appeared first on Kai Waehner.