Analytics Archives - Kai Waehner https://www.kai-waehner.de/blog/category/analytics/ Technology Evangelist - Big Data Analytics - Middleware - Apache Kafka Tue, 20 May 2025 16:37:17 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.2 https://www.kai-waehner.de/wp-content/uploads/2020/01/cropped-favicon-32x32.png Analytics Archives - Kai Waehner https://www.kai-waehner.de/blog/category/analytics/ 32 32 Powering Fantasy Sports at Scale: How Dream11 Uses Apache Kafka for Real-Time Gaming https://www.kai-waehner.de/blog/2025/05/19/powering-fantasy-sports-at-scale-how-dream11-uses-apache-kafka-for-real-time-gaming/ Mon, 19 May 2025 06:48:27 +0000 https://www.kai-waehner.de/?p=7916 Fantasy sports has evolved into a data-driven, real-time digital industry with high stakes and massive user engagement. At the heart of this transformation is Dream11, India’s leading fantasy sports platform, which relies on Apache Kafka to deliver instant updates, seamless gameplay, and trustworthy user experiences for over 230 million fans. This blog post explores how Dream11 leverages Kafka to meet extreme traffic demands, scale infrastructure efficiently, and maintain real-time responsiveness—even during the busiest moments of live sports.

The post Powering Fantasy Sports at Scale: How Dream11 Uses Apache Kafka for Real-Time Gaming appeared first on Kai Waehner.

]]>
Fantasy sports has become one of the most dynamic and data-intensive digital industries of the past decade. What started as a casual game for sports fans has evolved into a massive business, blending real-time analytics, mobile engagement, and personalized gaming experiences. At the center of this transformation is Apache Kafka—a critical enabler for platforms like Dream11, where millions of users expect live scores, instant feedback, and seamless gameplay. This post explores how fantasy sports works, why real-time data is non-negotiable, and how Dream11 has scaled its Kafka infrastructure to handle some of the world’s most demanding user traffic patterns.

Real Time Gaming with Apache Kafka Powers Dream11 Fantasy Sports

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases, including several success stories around gaming, loyalty platforms, and personalized advertising.

Fantasy Sports: Real-Time Gaming Meets Real-World Sports

Fantasy sports allows users to create virtual teams based on real-life athletes. As matches unfold, players earn points based on the performance of their selected athletes. The better the team performs, the higher the user’s score—and the bigger the prize.

Key characteristics of fantasy gaming:

  • Multi-sport experience: Users can play across cricket, football, basketball, and more.
  • Live interaction: Scoring is updated in real time as matches progress.
  • Contests and leagues: Players join public or private contests, often with cash prizes.
  • Peak traffic patterns: Most activity spikes in the minutes before a match begins.

This user behavior creates a unique business and technology challenge. Millions of users make critical decisions at the same time, just before the start of each game. The result: extreme concurrency, massive request volumes, and a hard dependency on data accuracy and low latency.

Real-time infrastructure isn’t optional in this model. It’s fundamental to user trust and business success.

Dream11: A Fantasy Sports Giant with Massive Scale

Founded in India, Dream11 is the largest fantasy sports platform in the country—and one of the biggest globally. With over 230 million users, it dominates fantasy gaming across cricket and 11 other sports. The platform sees traffic that rivals the world’s largest digital services.

Dream11 Mobile App
Source: Dream11

Bipul Karnanit from Dream11 presented very interesting overview at Current 2025 in Bangalore India. Here are a few statistics about Dream11’s scale:

  • 230M users
  • 12 sports
  • 12,000 matches/year
  • 44TB data per day
  • 15M+ peak concurrent users
  • 43M+ peak transactions/day

During major events like the IPL, Dream11 experiences hockey-stick traffic curves, where tens of millions of users log in just minutes before a match begins—making lineup changes, joining contests, and waiting for live updates.

This creates a business-critical need for:

  • Low latency
  • Guaranteed data consistency
  • Fault tolerance
  • Real-time analytics and scoring
  • High developer productivity to iterate fast

Apache Kafka at the Heart of Dream11’s Platform

To meet these demands, Dream11 uses Apache Kafka as the foundation of its real-time data infrastructure. Kafka powers the messaging between services that manage user actions, match scores, payouts, leaderboards, and more.

Apache Kafka enables:

  • Event-driven microservices
  • Scalable ingestion and processing of user and game data
  • Loose coupling between systems with data products for operational and analytical consumers
  • High throughput with guaranteed ordering and durability

Event-driven Architecture with Data Streaming for Gaming using Apache Kafka and Flink

Solving Kafka Consumer Challenges at Scale

As the business grew, Dream11’s engineering team encountered challenges with Kafka’s standard consumer APIs, particularly around rebalancing, offset management, and processing guarantees under peak load.

To address these issues, Dream11 built a custom Java-based Kafka consumer library—a foundational component of its internal platform that simplifies Kafka integration across services and boosts developer productivity.

Dream11 Kafka Consumer Library:

  • Purpose: A custom-built Java library designed to handle high-volume Kafka message consumption at Dream11 scale.
  • Key Benefit: Abstracts away low-level Kafka consumer details, simplifying tasks like offset management, error handling, and multi-threading, allowing developers to focus on business logic.
  • Simple Interfaces: Provides easy-to-use interfaces for processing records.
  • Increased Developer Productivity: Standardized library lead to faster development and fewer errors.

This library plays a crucial role in enabling real-time updates and ensuring seamless gameplay—even under the most demanding user scenarios.

For deeper technical insights, including how Dream11 decoupled polling and processing, implemented at-least-once delivery, and improved throughput with custom worker pools, watch the Dream11 engineering session from Current India 2025 presented by Bipul Karnanit.

Fantasy Sports, Real-Time Expectations, and Business Value

Dream11’s business success is built on user trust, real-time responsiveness, and high-quality gameplay. With millions of users relying on accurate, timely updates, the platform can’t afford downtime, data loss, or delays.

Data Streaming with Apache Kafka enables Dream11 to:

  • React to user interactions instantly
  • Deliver consistent data across microservices and devices
  • Scale dynamically during live events
  • Streamline the development and deployment of new features

This is not just a backend innovation—it’s a competitive advantage in a space where milliseconds matter and trust is everything.

Dream11’s Kafka Journey: The Backbone of Fantasy Sports at Scale

Fantasy sports is one of the most demanding environments for real-time data platforms. Dream11’s approach—scaling Apache Kafka to serve hundreds of millions of events with precision—is a powerful example of aligning architecture with business needs.

As more industries adopt event-driven systems, Dream11’s journey offers a clear message: Apache Kafka is not just a messaging layer—it’s a strategic platform for building reliable, low-latency digital experiences at scale.

Whether you’re in gaming, finance, telecom, or logistics, there’s much to learn from the way fantasy sports leaders like Dream11 harness data streaming to deliver world-class services.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases, including several success stories around gaming, loyalty platforms, and personalized advertising.

The post Powering Fantasy Sports at Scale: How Dream11 Uses Apache Kafka for Real-Time Gaming appeared first on Kai Waehner.

]]>
Shift Left Architecture for AI and Analytics with Confluent and Databricks https://www.kai-waehner.de/blog/2025/05/09/shift-left-architecture-for-ai-and-analytics-with-confluent-and-databricks/ Fri, 09 May 2025 06:03:07 +0000 https://www.kai-waehner.de/?p=7774 Confluent and Databricks enable a modern data architecture that unifies real-time streaming and lakehouse analytics. By combining shift-left principles with the structured layers of the Medallion Architecture, teams can improve data quality, reduce pipeline complexity, and accelerate insights for both operational and analytical workloads. Technologies like Apache Kafka, Flink, and Delta Lake form the backbone of scalable, AI-ready pipelines across cloud and hybrid environments.

The post Shift Left Architecture for AI and Analytics with Confluent and Databricks appeared first on Kai Waehner.

]]>
Modern enterprise architectures are evolving. Traditional batch data pipelines and centralized processing models are being replaced by more flexible, real-time systems. One of the driving concepts behind this change is the Shift Left approach. This blog compares Databricks’ Medallion Architecture with a Shift Left Architecture popularized by Confluent. It explains where each concept fits best—and how they can work together to create a more complete, flexible, and scalable architecture.

Shift Left Architecture with Confluent Data Streaming and Databricks Lakehouse Medallion

About the Confluent and Databricks Blog Series

This article is part of a blog series exploring the growing roles of Confluent and Databricks in modern data and AI architectures:

Learn how these platforms will affect data use in businesses in future articles. Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including more details about the shift left architecture with data streaming and lakehouses.

Medallion Architecture: Structured, Proven, but Not Always Optimal

The Medallion Architecture, popularized by Databricks, is a well-known design pattern for organizing and processing data within a lakehouse. It provides structure, modularity, and clarity across the data lifecycle by breaking pipelines into three logical layers:

  • Bronze: Ingest raw data in its original format (often semi-structured or unstructured)
  • Silver: Clean, normalize, and enrich the data for usability
  • Gold: Aggregate and transform the data for reporting, dashboards, and machine learning
Databricks Medallion Architecture for Lakehouse ETL
Source: Databricks

This layered approach is valuable for teams looking to establish governed and scalable data pipelines. It supports incremental refinement of data and enables multiple consumers to work from well-defined stages.

Challenges of the Medallion Architecture

The Medallion Architecture also introduces challenges:

  • Pipeline delays: Moving data from Bronze to Gold can take minutes or longer—too slow for operational needs
  • Infrastructure overhead: Each stage typically requires its own compute and storage footprint
  • Redundant processing: Data transformations are often repeated across layers
  • Limited operational use: Data is primarily at rest in object storage; using it for real-time operational systems often requires inefficient reverse ETL pipelines.

For use cases that demand real-time responsiveness and/or critical SLAs—such as fraud detection, personalized recommendations, or IoT alerting—this traditional batch-first model may fall short. In such cases, an event-driven streaming-first architecture, powered by a data streaming platform like Confluent, enables faster, more cost-efficient pipelines by performing validation, enrichment, and even model inference before data reaches the lakehouse.

Importantly, this data streaming approach doesn’t replace the Medallion pattern—it complements it. It allows you to “shift left” critical logic, reducing duplication and latency while still feeding trusted, structured data into Delta Lake or other downstream systems for broader analytics and governance.

In other words, shifting data processing left (i.e., before it hits a data lake or Lakehouse) is especially valuable when the data needs to serve multiple downstream systems—operational and analytical alike—because it avoids duplication, reduces latency, and ensures consistent, high-quality data is available wherever it’s needed.

Shift Left Architecture: Process Earlier, Share Faster

In a Shift Left Architecture, data processing happens earlier—closer to the source, both physically and logically. This often means:

  • Transforming and validating data as it streams in
  • Enriching and filtering in real time
  • Sharing clean, usable data quickly across teams AND different technologies/applications

Shift Left Architecture with Data Streaming into Data Lake Warehouse Lakehouse

This is especially useful for:

  • Reducing time to insight
  • Improving data quality at the source
  • Creating reusable, consistent data products
  • Operational workloads with critical SLAs

How Confluent Enables Shift Left with Databricks

In a Shift Left setup, Apache Kafka provides scalable, low-latency, and truly decoupled ingestion of data across operational and analytical systems, forming the backbone for unified data pipelines.

Schema Registry and data governance policies enforce consistent, validated data across all streams, ensuring high-quality, secure, and compliant data delivery from the very beginning.

Apache Flink enables early data processing — closer to where data is produced. This reduces complexity downstream, improves data quality, and allows real-time decisions and analytics.

Shift Left Architecture with Confluent Databricks and Delta Lake

Data Quality Governance via Data Contracts and Schema Validation

Flink can enforce data contracts by validating incoming records against predefined schemas (e.g., using JSON Schema, Apache Avro or Protobuf with Schema Registry). This ensures structurally valid data continues through the pipeline. In cases where schema violations occur, records can be automatically routed to a Dead Letter Queue (DLQ) for inspection.

Confluent Schema Registry for good Data Quality, Policy Enforcement and Governance using Apache Kafka

Additionally, data contracts can enforce policy-based rules at the schema level—such as field-level encryption, masking of sensitive data (PII), type coercion, or enrichment defaults. These controls help maintain compliance and reduce risk before data reaches regulated or shared environments.

Flink can perform the following tasks before data ever lands in a data lake or warehouse:

Filtering and Routing

Events can be filtered based on business rules and routed to the appropriate downstream system or Kafka topic. This allows different consumers to subscribe only to relevant data, optimizing both performance and cost.

Metric Calculation

Use Flink to compute rolling aggregates (e.g., counts, sums, averages, percentiles) over windows of data in motion. This is useful for business metrics, anomaly detection, or feeding real-time dashboards—without waiting for batch jobs.

Real-Time Joins and Enrichment

Flink supports both stream-stream and stream-table joins. This enables real-time enrichment of incoming events with contextual information from reference data (e.g., user profiles, product catalogs, pricing tables), often sourced from Kafka topics, databases, or external APIs.

Streaming ETL with Apache Flink SQL

By shifting this logic to the beginning of the pipeline, teams can reduce duplication, avoid unnecessary storage and compute costs in downstream systems, and ensure that data products are clean, policy-compliant, and ready for both operational and analytical use—as soon as they are created.

Example: A financial application might use Flink to calculate running balances, detect anomalies, and enrich records with reference data before pushing to Databricks for reporting and training analytic models.

In addition to enhancing data quality and reducing time-to-insight in the lakehouse, this approach also makes data products immediately usable for operational workloads and downstream applications—without building separate pipelines.

Learn more about stateless and stateful stream processing in real-time architectures using Apache Flink in this in-depth blog post.

Combining Shift Left with Medallion Architecture

These architectures are not mutually exclusive. Shift Left is about processing data earlier. Medallion is about organizing data once it arrives.

You can use Shift Left principles to:

  • Pre-process operational data before it enters the Bronze layer
  • Ensure clean, validated data enters Silver with minimal transformation needed
  • Reduce the need for redundant processing steps between layers

Confluent’s Tableflow bridges the two worlds. It converts Kafka streams into Delta tables, integrating cleanly with the Medallion model while supporting real-time flows.

Shift Left with Delta Lake, Iceberg, and Tableflow

Confluent Tableflow makes it easy to publish Kafka streams into Delta Lake or Apache Iceberg formats. These can be discovered and queried inside Databricks via Unity Catalog.

This integration:

  • Simplifies integration, governance and discovery
  • Enables live updates to AI features and dashboards
  • Removes the need to manage Spark streaming jobs

This is a natural bridge between a data streaming platform and the lakehouse.

Confluent Tableflow to Unify Operational and Analytical Workloads with Apache Iceberg and Delta Lake
Source: Confluent

AI Use Cases for Shift Left with Confluent and Databricks

The Shift Left model benefits both predictive and generative AI:

  • Model training: Real-time data pipelines can stream features to Delta Lake
  • Model inference: In some cases, predictions can happen in Confluent (via Flink) and be pushed back to operational systems instantly
  • Agentic AI: Real-time event-driven architectures are well suited for next-gen, stateful agents

Databricks supports model training and hosting via MosaicML. Confluent can integrate with these models, or run lightweight inference directly from the stream processing application.

Data Warehouse Use Cases for Shift Left with Confluent and Databricks

  • Batch reporting: Continue using Databricks for traditional BI
  • Real-time analytics: Flink or real-time OLAP engines (e.g., Apache Pinot, Apache Druid) may be a better fit for sub-second insights
  • Hybrid: Push raw events into Databricks for historical analysis and use Flink for immediate feedback

Where you do the data processing depends on the use case.

Architecture Benefits Beyond Technology

Shift Left also brings architectural benefits:

  • Cost Reduction: Processing early can lower storage and compute usage
  • Faster Time to Market: Data becomes usable earlier in the pipeline
  • Reusability: Processed streams can be reused and consumed by multiple technologies/applications (not just Databricks teams)
  • Compliance and Governance: Validated data with lineage can be shared with confidence

These are important for strategic enterprise data architectures.

Bringing in New Types of Data

Shift Left with a data streaming platform supports a wider range of data sources:

  • Operational databases (like Oracle, DB2, SQL Server, Postgres, MongoDB)
  • ERP systems (SAP et al)
  • Mainframes and other legacy technologies
  • IoT interfaces (MQTT, OPC-UA, proprietary IIoT gateway, etc.)
  • SaaS platforms (Salesforce, ServiceNow, and so on)
  • Any other system that does not directly fit into the “table-driven analytics perspective” of a Lakehouse

With Confluent, these interfaces can be connected in real time, enriched at the edge or in transit, and delivered to analytics platforms like Databricks.

This expands the scope of what’s possible with AI and analytics.

Shift Left Using ONLY Databricks

A shift left architecture only with Databricks is possible, too. A Databricks consultant took my Shift Left slide and adjusted it that way:

Shift Left Architecture with Databricks and Delta Lake

 

Relying solely on Databricks for a “Shift Left Architecture” can work if all workloads (should) stay within the platform — but it’s a poor fit for many real-world scenarios.

Databricks focuses on ELT, not true ETL, and lacks native support for operational workloads like APIs, low-latency apps, or transactional systems. This forces teams to rely on reverse ETL tools – a clear anti-pattern in the enterprise architecture – just to get data where it’s actually needed. The result: added complexity, latency, and tight coupling.

The Shift Left Architecture is valuable, but in most cases it requires a modular approach, where streaming, operational, and analytical components work together — not a monolithic platform.

That said, shift left principles still apply within Databricks. Processing data as early as possible improves data quality, reduces overall compute cost, and minimizes downstream data engineering effort. For teams that operate fully inside the Databricks ecosystem, shifting left remains a powerful strategy to simplify pipelines and accelerate insight.

Meesho: Scaling a Real-Time Commerce Platform with Confluent and Databricks

Many high-growth digital platforms adopt a shift-left approach out of necessity—not as a buzzword, but to reduce latency, improve data quality, and scale efficiently by processing data closer to the source.

Meesho, one of India’s largest online marketplaces, relies on Confluent and Databricks to power its hyper-growth business model focused on real-time e-commerce. As the company scaled rapidly, supporting millions of small businesses and entrepreneurs, the need for a resilient, scalable, and low-latency data architecture became critical.

To handle massive volumes of operational events — from inventory updates to order management and customer interactions — Meesho turned to Confluent Cloud. By adopting a fully managed data streaming platform using Apache Kafka, Meesho ensures real-time event delivery, improved reliability, and faster application development. Kafka serves as the central nervous system for their event-driven architecture, connecting multiple services and enabling instant, context-driven customer experiences across mobile and web platforms.

Alongside their data streaming architecture, Meesho migrated from Amazon Redshift to Databricks to build a next-generation analytics platform. Databricks’ lakehouse architecture empowers Meesho to unify operational data from Kafka with batch data from other sources, enabling near real-time analytics at scale. This migration not only improved performance and scalability but also significantly reduced costs and operational overhead.

With Confluent managing real-time event processing and ingestion, and Databricks providing powerful, scalable analytics, Meesho is able to:

  • Deliver real-time personalized experiences to customers
  • Optimize operational workflows based on live data
  • Enable faster, data-driven decision-making across business teams

By combining real-time data streaming with advanced lakehouse analytics, Meesho has built a flexible, future-ready data infrastructure to support its mission of democratizing online commerce for millions across India.

Shift Left: Reducing Complexity, Increasing Value for the Lakehouse (and other Operational Systems)

Shift Left is not about replacing Databricks. It’s about preparing better data earlier in the pipeline—closer to the source—and reducing end-to-end complexity.

  • Use Confluent for real-time ingestion, enrichment, and transformation
  • Use Databricks for advanced analytics, reporting, and machine learning
  • Use Tableflow and Delta Lake to govern and route high-quality data to the right consumers

This architecture not only improves data quality for the lakehouse, but also enables the same real-time data products to be reused across multiple downstream systems—including operational, transactional, and AI-powered applications.

The result: increased agility, lower costs, and scalable innovation across the business.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including more details about the shift left architecture with data streaming and lakehouses.

The post Shift Left Architecture for AI and Analytics with Confluent and Databricks appeared first on Kai Waehner.

]]>
The Past, Present, and Future of Confluent (The Kafka Company) and Databricks (The Spark Company) https://www.kai-waehner.de/blog/2025/05/02/the-past-present-and-future-of-confluent-the-kafka-company-and-databricks-the-spark-company/ Fri, 02 May 2025 07:10:42 +0000 https://www.kai-waehner.de/?p=7755 Confluent and Databricks have redefined modern data architectures, growing beyond their Kafka and Spark roots. Confluent drives real-time operational workloads; Databricks powers analytical and AI-driven applications. As operational and analytical boundaries blur, native integrations like Tableflow and Delta Lake unify streaming and batch processing across hybrid and multi-cloud environments. This blog explores the platforms’ evolution and how, together, they enable enterprises to build scalable, data-driven architectures. The Michelin success story shows how combining real-time data and AI unlocks innovation and resilience.

The post The Past, Present, and Future of Confluent (The Kafka Company) and Databricks (The Spark Company) appeared first on Kai Waehner.

]]>
Confluent and Databricks are two of the most influential platforms in modern data architectures. Both have roots in open source. Both focus on enabling organizations to work with data at scale. And both have expanded their mission well beyond their original scope.

Confluent and Databricks are often described as serving different parts of the data architecture—real-time vs. batch, operational vs. analytical, data streaming vs. artificial intelligence (AI). But the lines are not always clear. Confluent can run batch workloads and embed AI. Databricks can handle (near) real-time pipelines. With Flink, Confluent supports both operational and analytical processing. Databricks can run operational workloads, too—if latency, availability, and delivery guarantees meet the project’s requirements. 

This blog explores where these platforms came from, where they are now, how they complement each other in modern enterprise architectures—and why their roles are future-proof in a data- and AI-driven world.

Data Streaming and Lakehouse - Comparison of Confluent with Apache Kafka and Flink and Databricks with Spark

About the Confluent and Databricks Blog Series

This article is part of a blog series exploring the growing roles of Confluent and Databricks in modern data and AI architectures:

Stay tuned for deep dives into how these platforms are shaping the future of data-driven enterprises. Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including technical architectures and the relation to analytical platforms like Databricks.

Operational vs. Analytical Workloads

Confluent and Databricks were designed for different workloads, but the boundaries are not always strict.

Confluent was built for operational workloads—moving and processing data in real time as it flows through systems. This includes use cases like real-time payments, fraud detection, system monitoring, and streaming pipelines.

Databricks focuses on analytical workloads—enabling large-scale data processing, machine learning, and business intelligence.

That said, there is no clear black and white separation. Confluent, especially with the addition of Apache Flink, can support analytical processing on streaming data. Databricks can handle operational workloads too, provided the SLAs—such as latency, uptime, and delivery guarantees—are sufficient for the use case.

With Tableflow and Delta Lake, both platforms can now be natively connected, allowing real-time operational data to flow into analytical environments, and AI insights to flow back into real-time systems—effectively bridging operational and analytical workloads in a unified architecture.

From Apache Kafka and Spark to (Hybrid) Cloud Platforms: Both Confluent and Databricks both have strong open source roots—Kafka and Spark, respectively—but have taken different branding paths.

Confluent: From Apache Kafka to a Data Streaming Platform (DSP)

Confluent is well known as “The Kafka Company.” It was founded by the original creators of Apache Kafka over ten years ago. Kafka is now widely adopted for event streaming in over 150,000 organizations worldwide. Confluent operates tens of thousands of clusters with Confluent Cloud across all major cloud providers, and also in customer’s data centers and edge locations.

But Confluent has become much more than just Kafka. It offers a complete data streaming platform (DSP)

Confluent Data Streaming Platform (DSP) Powered by Apache Kafka and Flink
Source: Confluent

This includes:

  • Apache Kafka as the core messaging and persistence layer
  • Data integration via Kafka Connect for databases and business applications, a REST/HTTP proxy for request-response APIs and clients for all relevant programming languages
  • Stream processing via Apache Flink and Kafka Streams (read more about the past, present and future of stream processing)
  • Tableflow for native integration with lakehouses that support the open table format standard via Delta Lake and Apache Iceberg
  • 24/7 SLAs, security, data governance, disaster recovery – for the most critical workloads companies run
  • Deployment options: Everywhere (not just cloud) – SaaS, on-prem, edge, hybrid, stretched across data centers, multi-cloud, BYOC (bring your own cloud)

Databricks: From Apache Spark to a Data Intelligence Platform

Databricks has followed a similar evolution. Known initially as “The Spark Company,” it is the original force behind Apache Spark. But Databricks no longer emphasizes Spark in its branding. Spark is still there under the hood, but it’s no longer the dominant story.

Today, it positions itself as the Data Intelligence Platform, focused on AI and analytics

Databricks Data Intelligence Platform and Lakehouse
Source: Databricks

Key components include:

  • Fully cloud-native deployment model—Databricks is now a cloud-only platform providing BYOC and Serverless products
  • Delta Lake and Unity Catalog for table format standardization and governance
  • Model development and AI/ML tools
  • Data warehouse workloads
  • Tools for data scientists and data engineers

Together, Confluent and Databricks meet a wide range of enterprise needs and often complement each other in shared customer environments from the edge to multi-cloud data replication and analytics.

Real-Time vs. Batch Processing

A major point of comparison between Confluent and Databricks lies in how they handle data processing—real-time versus batch—and how they increasingly converge through shared formats and integrations.

Data Processing and Data Sharing “In Motion” vs. “At Rest”

A key difference between the platforms lies in how they process and share data.

Confluent focuses on data in motion—real-time streams that can be filtered, transformed, and shared across systems as they happen.

Databricks focuses on data at rest—data that has landed in a lakehouse, where it can be queried, aggregated, and used for analysis and modeling.

Data Streaming versus Lakehouse

Both platforms offer native capabilities for data sharing. Confluent provides Stream Sharing, which enables secure, real-time sharing of Kafka topics across organizations and environments. Databricks offers Delta Sharing, an open protocol for sharing data from Delta Lake tables with internal and external consumers.

In many enterprise architectures, the two vendors work together. Kafka and Flink handle continuous real-time processing for operational workloads and data ingestion into the lakehouse. Databricks handles AI workloads (model training and some of the model inference), business intelligence (BI), and reporting. Both do data integration; ETL (Confluent) respectively ELT (Databricks).

Many organizations still use Databricks’ Apache Spark Structured Streaming to connect Kafka and Databricks. That’s a valid pattern, especially for teams with Spark expertise.

Flink is available as a serverless offering in Confluent Cloud that can scale down to zero when idle, yet remains highly scalable—even for complex stateful workloads. It supports multiple languages, including Python, Java, and SQL. 

For self-managed environments, Kafka Streams offers a lightweight alternative to running Flink in a self-managed Confluent Platform. But be aware that Kafka Streams is limited to Java and operates as a client library embedded directly within the application. Read my dedicated article to learn about the trade-offs between Apache Flink and Kafka Streams.

Stream and Batch Data Processing with Kafka Streams, Apache Flink and Spark

In short: use what works. If Spark Structured Streaming is already in place and meets your needs, keep it. But for new use cases, Apache Flink or Kafka Streams might be the better choice for stream processing workloads. But make sure to understand the concepts and value of stateless and stateful stream processing before building batch pipelines.

Confluent Tableflow: Unify Operational and Analytic Workloads with Open Table Formats (such as Apache Iceberg and Delta Lake)

Databricks is actively investing in Delta Lake and Unity Catalog to structure, govern, and secure data for analytical applications. The acquisition of Tabular—founded by the original creators of Apache Iceberg—demonstrates Databricks’ commitment to supporting open standards.

Confluent’s Tableflow materializes Kafka streams into Apache Iceberg or Delta Lake tables—automatically, reliably, and efficiently. This native integration between Confluent and Databricks is faster, simpler, and more cost-effective than using a Spark connector or other ETL tools.

Tableflow reads the Kafka segments, checks schema against schema registry, and creates parquet and table metadata.

Confluent Tableflow Architecture to Integrate Apache Kafka with Iceberg and Delta Lake for Databricks
Source: Confluent

Native stream processing with Apache Flink also plays a growing role. It enables unified real-time and batch stream processing in a single engine. Flink’s ability to “shift left” data processing (closer to the source) supports early validation, enrichment, and transformation. This simplifies the architecture and reduces the need for always-on Spark clusters, which can drive up cost.

These developments highlight how Databricks and Confluent address different but complementary layers of the data ecosystem.

Confluent + Databricks = A Strategic Partnership for Future-Proof AI Architectures

Confluent and Databricks are not competing platforms—they’re complementary. While they serve different core purposes, there are areas where their capabilities overlap. In those cases, it’s less about which is better and more about which fits best for your architecture, team expertise, SLA or latency requirements. The real value comes from understanding how they work together and where you can confidently choose the platform that serves your use case most effectively.

Confluent and Databricks recently deepened their partnership with Tableflow integration with Delta Lake and Unity Catalog. This integration makes real-time Kafka data available inside Databricks as Delta tables. It reduces the need for custom pipelines and enables fast access to trusted operational data.

The architecture supports AI end to end—from ingesting real-time operational data to training and deploying models—all with built-in governance and flexibility. Importantly, data can originate from anywhere: mainframes, on-premise databases, ERP systems, IoT and edge environments or SaaS cloud applications.

With this setup, you can:

  • Feed data from 100+ Confluent sources (Mainframe, Oracle, SAP, Salesforce, IoT, HTTP/REST applications, and so on) into Delta Lake
  • Use Databricks for AI model development and business intelligence
  • Push models back into Kafka and Flink for real-time model inference with critical, operational SLAs and latency

Both directions will be supported. Governance and security metadata flows alongside the data.

Confluent and Databricks Partnership and Bidirectional Integration for AI and Analytics
Source: Confluent

Michelin: Real-Time Data Streaming and AI Innovation with Confluent and Databricks

A great example of how Confluent and Databricks complement each other in practice is Michelin’s digital transformation journey. As one of the world’s largest tire manufacturers, Michelin set out to become a data-first and digital enterprise. To achieve this, the company needed a foundation for real-time operational data movement and a scalable analytical platform to unlock business insights and drive AI initiatives.

Confluent @ Michelin: Real-Time Data Streaming Pipelines

Confluent Cloud plays a critical role at Michelin by powering real-time data pipelines across their global operations. Migrating from self-managed Kafka to Confluent Cloud on Microsoft Azure enabled Michelin to reduce operational complexity by 35%, meet strict 99.99% SLAs, and speed up time to market by up to nine months. Real-time inventory management, order orchestration, and event-driven supply chain processes are now possible thanks to a fully managed data streaming platform.

Databricks @ Michelin: Centralized Lakehouse

Meanwhile, Databricks empowers Michelin to democratize data access across the organization. By building a centralized lakehouse architecture, Michelin enabled business users and IT teams to independently access, analyze, and develop their own analytical use cases—from predicting stock outages to reducing carbon emissions in logistics. With Databricks’ lakehouse capabilities, they scaled to support hundreds of use cases without central bottlenecks, fostering a vibrant community of innovators across the enterprise.

The synergy between Confluent and Databricks at Michelin is clear:

  • Confluent moves operational data in real time, ensuring fresh, trusted information flows across systems (including Databricks).
  • Databricks transforms data into actionable insights, using powerful AI, machine learning, and analytics capabilities.

Confluent + Databricks @ Michelin = Cloud-Native Data-Driven Enterprise

Together, Confluent and Databricks allow Michelin to shift from batch-driven, siloed legacy systems to a cloud-native, real-time, data-driven enterprise—paving the road toward higher agility, efficiency, and customer satisfaction.

As Yves Caseau, Group Chief Digital & Information Officer at Michelin, summarized: “Confluent plays an integral role in accelerating our journey to becoming a data-first and digital business.”

And as Joris Nurit, Head of Data Transformation, added: “Databricks enables our business users to better serve themselves and empowers IT teams to be autonomous.”

The Michelin success story perfectly illustrates how Confluent and Databricks, when used together, bridge operational and analytical workloads to unlock the full value of real-time, AI-powered enterprise architectures.

Confluent and Databricks: Better Together!

Confluent and Databricks are both leaders in different – but connected – layers of the modern data stack.

If you want real-time, event-driven data pipelines, Confluent is the right platform. If you want powerful analytics, AI, and ML, Databricks is a great fit.

Together, they allow enterprises to bridge operational and analytical workloads—and to power AI systems with live, trusted data.

In the next post, I will explore how Confluent’s Data Streaming Platform compares to the Databricks Data Intelligence Platform for data integration and processing.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases, including technical architectures and the relation to analytical platforms like Databricks.

The post The Past, Present, and Future of Confluent (The Kafka Company) and Databricks (The Spark Company) appeared first on Kai Waehner.

]]>
The Role of Data Streaming in McAfee’s Cybersecurity Evolution https://www.kai-waehner.de/blog/2025/01/27/the-role-of-data-streaming-in-mcafees-cybersecurity-evolution/ Mon, 27 Jan 2025 07:33:30 +0000 https://www.kai-waehner.de/?p=7308 In today’s digital landscape, cybersecurity faces mounting challenges from sophisticated threats like ransomware, phishing, and supply chain attacks. Traditional defenses like antivirus software are no longer sufficient, prompting the adoption of real-time, event-driven architectures powered by data streaming technologies like Apache Kafka and Flink. These platforms enable real-time threat detection, prevention, and response by processing massive amounts of security data from endpoints and systems. A success story from McAfee highlights how transitioning to an event-driven architecture with Kafka in Confluent Cloud has enhanced scalability, operational efficiency, and real-time protection for millions of devices. As cybersecurity threats evolve, data streaming proves essential for organizations aiming to secure their digital assets and maintain trust in an interconnected world.

The post The Role of Data Streaming in McAfee’s Cybersecurity Evolution appeared first on Kai Waehner.

]]>
In today’s digital age, cybersecurity is more vital than ever. Businesses and individuals face escalating threats such as malware, ransomware, phishing attacks, and identity theft. Combatting these challenges requires cutting-edge solutions that protect computers, networks, and devices. Beyond safeguarding digital assets, modern cybersecurity tools ensure compliance, privacy, and trust in an increasingly interconnected world.

As threats grow more sophisticated, the technologies powering cybersecurity solutions must advance to stay ahead. Data streaming technologies like Apache Kafka and Apache Flink have become foundational in this evolution, enabling real-time threat detection, prevention, and rapid response. These tools transform cybersecurity from static defenses to dynamic systems capable of identifying and neutralizing threats as they occur.

A notable example is McAfee, a global leader in cybersecurity, which has embraced data streaming to revolutionize its operations. By transitioning to an event-driven architecture powered by Apache Kafka, McAfee processes massive amounts of real-time data from millions of endpoints, ensuring instant threat identification and mitigation. This integration has enhanced scalability, reduced infrastructure complexity, and accelerated innovation, setting a benchmark for the cybersecurity industry.

Real-time data streaming is not just an advantage—it’s now a necessity for organizations aiming to safeguard digital environments against ever-evolving threats.

Data Streaming with Apache Kafka and Flink as Backbone for Real Time Cybersecurity at McAfee

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch.

Antivirus is NOT Enough: Supply Chain Attack

A supply chain attack occurs when attackers exploit vulnerabilities in an organization’s supply chain, targeting weaker links such as vendors or service providers to indirectly infiltrate the target.

For example, an attacker compromises Vendor 1, a software provider, by injecting malicious code into their product. Vendor 2, a service provider using Vendor 1’s software, becomes infected. The attacker then leverages Vendor 2’s connection to the Enterprise to access sensitive systems, even though Vendor 1 has no direct interaction with the enterprise.

The Anatomy of a Supply Chain Attack in Cybersecurity

Traditional antivirus software is insufficient to prevent such complex, multi-layered attacks. Ransomware often plays a role in supply chain attacks, as attackers use it to encrypt data or disrupt operations across compromised systems.

Modern solutions focus on real-time monitoring and event-driven architecture to detect and mitigate risks across the supply chain. These solutions utilize behavioral analytics, zero trust policies, and proactive threat intelligence to identify and stop anomalies before they escalate.

By providing end-to-end visibility, they protect organizations from cascading vulnerabilities that traditional endpoint security cannot address. In today’s interconnected world, comprehensive supply chain security is critical to safeguarding enterprises.

The Role of Data Streaming in Cybersecurity

Cybersecurity platforms must rely on real-time data for detecting and mitigating threats. Data streaming provides a backbone for processing massive amounts of security event data as it happens, ensuring swift and effective responses. My blog series on Kafka and cybersecurity looks deeply into these use cases.

Cybersecurity for Situational Awareness and Threat Intelligence in Smart Buildings and Smart City

To summarize:

  • Data Collection: A data streaming platforms powered by Apache Kafka collect logs, telemetry, and other data from devices and applications in real time.
  • Data Processing: Stream processing frameworks like Kafka Streams and Apache Flink continuously process this data with low latency at scale for analytics, identifying anomalies or potential threats.
  • Actionable Insights: The processed data feeds into Security Information and Event Management (SIEM) and Security Orchestration, Automation, and Response (SOAR) systems, enabling automated responses and better decision-making.

This approach transforms static, batch-driven cybersecurity operations into dynamic, real-time processes.

McAfee: A Real-World Data Streaming Success Story

McAfee is a global leader in cybersecurity, providing software solutions that protect computers, networks, and devices. Founded in 1987, the company has evolved from traditional antivirus software to a comprehensive suite of products focused on threat prevention, identity protection, and data security.

McAfee Antivirus and Cybersecurity Solutions
Source: McAfee

McAfee’s products cater to both individual consumers and enterprises, offering real-time protection through partnerships with global integrated service providers (ISPs) and telecom operators.

Mahesh Tyagarajan (VP, Platform Engineering and Architecture at McAfee) spoke with Confluent and Forrester about McAfee’s transition from a monolith to event-driven Microservices leveraging Apache Kafka in Confluent Cloud.

Data Streaming at McAfee with Apache Kafka Leveraging Confluent Cloud

As cyber threats have grown more complex, McAfee’s reliance on real-time data streaming has become essential. The company transitioned from a monolithic architecture to a microservices-based ecosystem with the help of Confluent Cloud, powered by Apache Kafka. The fully managed data streaming platform simplified infrastructure management, boosted scalability, and accelerated feature delivery for McAfee

Use Cases for Data Streaming

  1. Real-Time Threat Detection: McAfee processes security events from millions of endpoints, ensuring immediate identification of malware or phishing attempts.
  2. Subscription Management: Data streaming supports real-time customer notifications, updates, and billing processes.
  3. Analytics and Reporting: McAfee integrates real-time data streams into analytics systems, providing insights into user behavior, threat patterns, and operational efficiency.

Transition to an Event-Driven Architecture and Microservices

By moving to an event-driven architecture with Kafka using Confluent Cloud, McAfee:

  • Standardized its data streaming infrastructure.
  • Decoupled systems using microservices, enabling scalability and resilience.
  • Improved developer productivity by reducing infrastructure management overhead.

This transition to data streaming with a fully managed, complete and secure cloud service empowered McAfee to handle high data ingestion volumes, manage hundreds of millions of devices, and deliver new features faster.

Business Value of Data Streaming

The adoption of data streaming delivered significant business benefits:

  • Improved Customer Experience: Real-time threat detection and personalized updates enhance trust and satisfaction.
  • Operational Efficiency: Automation and reduced infrastructure complexity save time and resources.
  • Scalability: McAfee can now support a growing number of devices and data sources without compromising performance.

Data Streaming as the Backbone of an Event-Driven Cybersecurity Evolution in the Cloud

McAfee’s journey showcases the transformative potential of data streaming in cybersecurity. By leveraging Apache Kafka as fully managed cloud service as the backbone of an event-driven microservices architecture, the company has enhanced its ability to detect threats, respond in real time, and deliver exceptional customer experiences.

For organizations looking to stay ahead in the cybersecurity race, investing in real-time data streaming technologies is not just an option—it’s a necessity. To learn more about how data streaming can revolutionize cybersecurity, explore my cybersecurity blog series and follow me for updates on LinkedIn or X (formerly Twitter).

The post The Role of Data Streaming in McAfee’s Cybersecurity Evolution appeared first on Kai Waehner.

]]>
What is Microsoft Fabric for Azure Cloud (Beyond the Buzz) and how it Competes with Snowflake and Databricks https://www.kai-waehner.de/blog/2024/10/04/what-is-microsoft-fabric-for-azure-cloud-beyond-the-buzz-and-how-it-competes-with-snowflake-and-databricks/ Fri, 04 Oct 2024 05:16:31 +0000 https://www.kai-waehner.de/?p=6833 If you ask your favorite large language model, Microsoft Fabric appears to be the ultimate solution for any data challenge you can imagine. That’s also the impression many people get from Microsoft’s sales teams. But is it really the silver bullet it’s made out to be? This article takes a closer look exploring the glossy marketing and sales definition of the platform and then deconstructing it from a more practical perspective. Learn what Microsoft Fabric is truly built for, and how it fits into the wider data landscape, especially in comparison to other major players in the data analytics market like Databricks and Snowflake.

The post What is Microsoft Fabric for Azure Cloud (Beyond the Buzz) and how it Competes with Snowflake and Databricks appeared first on Kai Waehner.

]]>
If you ask your favorite large language model, Microsoft Fabric appears to be the ultimate solution for any data challenge you can imagine. That’s also the impression many people get from Microsoft’s sales teams. But is it really the silver bullet it’s made out to be? This article takes a closer look. The first part explores the glossy marketing and sales definition of the platform. The second part looks the layers and deconstructs it from a more practical perspective. By doing so, the third part uncovers what Microsoft Fabric is truly built for, and how it fits into the wider data landscape, especially in comparison to other major players in the data analytics market like Databricks and Snowflake.

Microsoft Fabric and OneLake Azure Lakehouse vs Databricks and Snowflake Cloud

This is part one of a blog series about Microsoft Fabric and its relation to other data platforms on the Azure cloud:

  1. What is Microsoft Fabric for Azure Cloud (Beyond the Buzz) and how it Compares (or Competes) with Snowflake and Databricks
  2. How Microsoft Fabric Complements Data Streaming (Apache Kafka, Flink, et al.)
  3. When to Choose Apache Kafka vs. Azure Event Hubs vs. Confluent Cloud for a Microsoft Fabric Lakehouse

Subscribe to my newsletter to get an email about a new blog post every few weeks.

What is Microsoft Fabric?

If you listen to Microsoft’s sales and marketing, then Microsoft Fabric is a silver bullet for every use case. Let’s take a two-step approach. Look at the sales and marketing definition. Then deconstruct it a bit from a more realistic point of view…

GenAI Definition (= Sales and Marketing)

If you ask your favourite large language model or search engine, it you something like the following when I ask “What is Microsoft Fabric” (based on sales and marketing content):

Microsoft Fabric is an end-to-end analytics platform designed to integrate various data services and enable businesses to manage, analyze, and act on their data seamlessly. It was launched as part of Microsoft’s data ecosystem and builds upon key features from platforms like Power BI, Azure Synapse Analytics, and Azure Data Factory.

Microsoft Fabric Marketing Website
Source: Microsoft

Here are some key aspects of Microsoft Fabric:

  • Unified Platform: It combines data engineering, data science, data warehousing, and real-time analytics into a single platform. This helps businesses eliminate the need to use multiple services for data management and analysis.
  • Lakehouse Architecture: Fabric is designed around the lakehouse concept, which merges the best of data lakes and data warehouses. It allows for both structured and unstructured data to be stored and processed together.
  • Tightly Integrated with Microsoft 365 and Azure: Microsoft Fabric connects seamlessly with other Microsoft services like Microsoft 365, Power BI, and Azure Machine Learning, enabling better collaboration, reporting, and AI-driven insights.
  • Low-code/No-code Experience: The platform provides intuitive tools for data analysts, developers, and business users, allowing non-technical users to work with data through drag-and-drop interfaces, while also enabling more complex scenarios for advanced users.
  • AI and Machine Learning Integration: Microsoft Fabric incorporates AI tools, making it easier for businesses to build predictive models and automate data-driven decisions.
  • End-to-End Security and Governance: The platform supports robust security measures and compliance requirements, offering features like data encryption, role-based access control, and regulatory compliance support.
  • Real-time Data Processing: With support for real-time analytics, Fabric enables organizations to derive insights from live data streams, improving decision-making speed and accuracy.

Microsoft Fabric is designed to streamline how businesses use data, combining the power of analytics with cloud-scale capabilities.

Wow. Just wow. Microsoft Fabric seems to be everything you ever need for your data challenges.

Microsoft Developer has an excellent 45 minute presentation about OneLake and Microsoft Fabric with a few more technical details. This video is also the source of the screenshots below.

Well, let’s dig deeper. What is Microsoft Fabric really? Let’s deconstruct it a bit…

Microsoft Fabric is a Data Analytics Platform ( = NOT for Operational / Transactional Workloads)

Microsoft Fabric is part of Microsoft’s data analytics portfolio. That’s already the first alarm signal when you consider building operational workloads. This is not a criticism, but important to understand!

Azure Data Analytics SaaS Platform

Microsoft Fabric is NOT a platform for transactional workloads like payments, fraud detection, order management or ERP integration. You should not build an operational application like an Azure Serverless Function or a self-managed Spring Boot container for Fabric.

Furthermore, within the data analytics layer, the foundation of Microsoft Fabric is (only) an optimized storage layer. And this storage layer called OneLake is a SaaS offering, i.e., the storage is part of the Microsoft tenant. Contrary to many other data lakes and lakehouses like Databricks, you do not control or own the storage.

While the conversation is usually around cloud analytics, Microsoft Fabric is a unified analytics platform that integrates with Azure Cloud but is sold independently. This allows organizations to deploy it in various environments, edge, and hybrid setups. For instance, Microsoft sells Fabric for hybrid IoT projects where data needs to be processed both locally and in the cloud.

OneLake – Cloud-based Storage Layer on Top of Azure Data Lake Storage (ADLS)

Microsoft OneLake is a unified, cloud-based data lake that acts as the central storage layer within Microsoft Fabric:

Microsoft OneLake is built on top of Azure Data Lake Storage (ADLS), using its scalable and secure data storage capabilities for long-term data retention. OneLake inherits ADLS’s features like hierarchical namespaces and advanced security, while adding a unified data lake experience across multiple clouds and deep integration with Microsoft’s analytics and data tools through Microsoft Fabric.

OneLake for Azure Lakehouse, Databricks with Delta Lake, Snowflake with Apache Iceberg Lakehouse
Source: Microsoft

The message is obvious: Store all data in OneLake and connect your favourite compute engines, such as Microsoft Fabric, Azure Databricks and Snowflake. Open Table Formats like Delta Lake and Apache Iceberg allow simple integration without the need to copy data again.

Microsoft Fabric Connects to Many Existing Azure Services

On top of the storage layer OneLake, Microsoft Fabric connects to plenty of different existing Microsoft Azure services, including Power BI, Data Explorer, various Synapse services, and so on. This explains why Microsoft Fabric can magically provide every capability you are looking for a few months after the initial announcement.

Microsoft Fabric Architecture - OneLake as Storage Layer and Azure SaaS Analytics Cloud Services
Source: Microsoft

Here are a few integrations of Azure services into the unified storage of Microsoft Fabric:

  • Power BI: A critical component of Microsoft Fabric, enabling data visualization and business intelligence. It allows users to create interactive dashboards and reports directly from data stored in the lakehouse, providing real-time insights with minimal data movement.
  • Azure Data Explorer: Used for analyzing large volumes of streaming and historical data. Microsoft Fabric connects to Data Explorer, allowing users to perform fast, complex, real-time queries on structured and semi-structured data.
  • Azure Synapse Analytics: Fabric integrates Synapse’s data engineering capabilities, allowing users to prepare, transform, and orchestrate data pipelines. It provides a unified workspace to manage end-to-end data engineering workflows, reducing the need for complex data movement.
  • Synapse Data Warehousing: Fabric connects with Synapse’s data warehousing services, making it easy to run massively parallel processing (MPP) queries for large-scale analytics on structured data.
  • Synapse Spark Pools: Fabric integrates with Apache Spark in Synapse, supporting big data processing, AI, and machine learning workloads. Users can leverage Spark’s distributed computing power within Fabric for data transformation, advanced analytics, and machine learning.
  • Azure Machine Learning (AML): Enables data scientists to build, train, and deploy machine learning models on data stored within the Fabric lakehouse. Users can perform machine learning experiments, automate ML model training, and deploy models with an unified data platform.
  • Azure Data Factory: Used for data ingestion, ETL (extract, transform, load), and data orchestration. Fabric connects with Azure Data Factory, making it easy to create data pipelines that move and transform data from a wide variety of sources, including on-premises databases, cloud storage, and third-party systems.
  • Azure Purview: Provides a unified data catalog, allowing users to discover, classify, and govern data assets across the Fabric ecosystem. It also provides compliance and auditing capabilities.
  • Azure Event Hubs and Stream Analytics: Real-time data processing and analytics. Event Hubs enables streaming data ingestion from sources like IoT devices, applications, and logs, while Stream Analytics allows for real-time data querying and analysis.

Expect more Azure services to be integrated with Microsoft Fabric in the coming months to provide a “complete lakehouse experience”. Also expect more fancy marketing brands, such as the new “Real Time Intelligence Hub” that is built by connecting / re-using existing Microsoft Azure services.

So, what is the main idea behind building this lakehouse product and brand within Microsoft’s huge cloud portfolio?

Microsoft Fabric is a Lakehouse Competing with Snowflake and Databricks

A lakehouse is a data architecture that combines the features of data lakes and data warehouses, allowing for both structured and unstructured data to be stored and processed together. It provides the scalability and flexibility of a data lake with the data management, governance, and performance features of a data warehouse. This unified approach enables real-time analytics and machine learning on diverse types of data, reducing the need for separate infrastructures.

Most analytical data vendors transition to a full-blown lakehouse. While Databricks moved from the data lake foundation powered by Apache Spark into the lakehouse, Snowflake comes from the data warehouse approach but has incorporated a lot of lakehouse features over time (even though Snowflake calls it a more general “data cloud”).

Microsoft Fabric competes with platforms like Databricks and Snowflake in the realm of data analytics, data engineering, and data warehousing by providing an integrated, cloud-native solution for data management and analytics.

Microsoft Fabric positions itself as a more holistic and integrated platform, offering a unified solution for businesses that need to handle everything from data ingestion to real-time analytics and AI. Its Microsoft ecosystem integration is a key competitive advantage.

There are also trade-offs. For instance:

  • Microsoft Fabric is only available on Azure cloud
  • Not a mature product yet
  • Starting a much more competitive approach with strategic partners like Databricks

The support of open table formats like Delta Lake and Apache Iceberg is great. But this is coming in all lakehouses because of market pressure. Not because the data cloud vendors like Databricks, Snowflake and now Microsoft with Fabric have a new business model. All of these vendors still want to collect all the data, store it forever, and put (their own!) compute services on top.

Microsoft Fabric is Azure’s Future Lakehouse

Microsoft Fabric’s integration with many Azure services allows it to offer a broad range of capabilities – from data ingestion, storage, and transformation to real-time analytics, machine learning, and governance. This interconnected ecosystem explains how Fabric can quickly meet diverse enterprise needs by leveraging Microsoft’s existing suite of powerful tools, providing a comprehensive data platform with minimal friction and seamless workflows.

In the end, Microsoft Fabric is a new lakehouse built on top of the optimized cloud storage OneLake. It directly competes with other lakehouses and data clouds such as Databricks and Snowflake to become the leading unified solution for all the things analytics. The future will show where this competition goes. Snowflake and Databricks have a very strong product and customer base already. They will not give up to Microsoft Fabric voluntarily.

Microsoft Fabric includes integrations with Azure Event Hubs (based on the Kafka protocol) and is building a brand around real-time intelligence. In the next article of this blog series, I will explore how this new lakehouse on Azure cloud competes or overlaps with data streaming technologies such as Apache Kafka, Flink, et al. Primer: Data Streaming and Microsoft Fabric are mostly complementary and have very different sweet spots.

How do you see the future of Microsoft Fabric? Do you already use it? What is the plan in the future, also keeping in mind that you likely already have other lakehouses in your enterprise architecture? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post What is Microsoft Fabric for Azure Cloud (Beyond the Buzz) and how it Competes with Snowflake and Databricks appeared first on Kai Waehner.

]]>
Apache Kafka and Tinybird (ClickHouse) for Streaming Analytics HTTP APIs https://www.kai-waehner.de/blog/2024/04/04/apache-kafka-and-tinybird-clickhouse-for-streaming-analytics-http-apis/ Thu, 04 Apr 2024 06:54:12 +0000 https://www.kai-waehner.de/?p=6103 Apache Kafka became the de facto standard for data streaming. However, the combination of an event-driven architecture with request-response APIs is crucial for most enterprise architectures. This blog post explores how Tinybird innovates with a REST/HTTP layer on top of the open source analytics database ClickHouse in the cloud. Integrating Kafka with Tinybird, the benefits of fully managed services like Confluent Cloud, and customer stories from Factorial and FanDuel show why Kafka and analytics databases complement each other for more innovation and faster time-to-market.

The post Apache Kafka and Tinybird (ClickHouse) for Streaming Analytics HTTP APIs appeared first on Kai Waehner.

]]>
Apache Kafka became the de facto standard for data streaming. However, the combination of an event-driven architecture with request-response APIs is crucial for most enterprise architectures. This blog post explores how Tinybird innovates with a REST/HTTP layer on top of the open source analytics database ClickHouse in the cloud. Integrating Kafka with Tinybird, the benefits of fully managed services like Confluent Cloud, and customer stories from Factorial and FanDuel show why Kafka and analytics databases complement each other for more innovation and faster time-to-market.

Streaming Analytics SQL API with Apache Kafka Confluent ClickHouse Tinybird

Tinybird = ClickHouse Analytics + HTTP APIs

Tinybird is powered by the open source database ClickHouse, but differentiates itself in the analytics market by providing a platform for providing real-time analytics as API.

ClickHouse and the Overwhelming Analytics Market

ClickHouse is an open-source column-oriented database management system designed for handling large-scale analytical workloads. The internal architecture is optimized for high performance and scalability, particularly for real-time analytics and OLAP (Online Analytical Processing) queries.

ClickHouse supports SQL queries and is optimized for fast data ingestion and querying of large volumes of data. The database is commonly used for data warehousing, time-series data analysis, and ad hoc analytics in various industries, including e-commerce, finance, and telecommunications.

ClickHouse is a great technology. However, in the analytics space, each database solution competes with various other open source frameworks, commercial products and fully managed cloud services. If you evaluate ClickHouse, you probably also evaluate:

  • Analytics services of the cloud providers (like Amazon Redshift, Azure Synapse Analytics, Google BigQuery, etc.)
  • Open source frameworks like Apache Druid (Imply), Pinot (StarTree), and others
  • Data analytics platforms like Snowflake, Databricks, et al.

The overlapping between these products is huge. Of course, sweet spots exist for each technology. But it is a mass market and often only the big players grow consumption significantly while small vendors end up in a niche.

Tinybird does NOT try to compete directly with all the other analytic databases. Instead, it added an intuitive layer on top of ClickHouse to differentiate its product offering.

Tinybird Adds APIs on Top of ClickHouse for Simple Integration and Publishing of Applications

Tinybird is a real-time data pipeline platform designed to help developers build and deploy scalable data products quickly and efficiently. It offers a range of tools and features for ingesting, processing, and serving real-time data streams, including data transformation, aggregation, and analytics.

Tinybird earns its reputation for simplicity and ease of use. It enables users to create custom data pipelines, applications, and APIs without the need for extensive coding or infrastructure setup. The analytics platform focuses on building operational and user-facing use cases with requirements for fresh data, fast queries, and high concurrency. Tinybird is NOT focused on traditional BI use cases served by data warehouses.

Tinybird differentiates from the competition by being more than just a managed database. The platform provides an underlying database powered by ClickHouse. It meets performance and scale needs while focusing on solving the pain around the database: data ingestion, data publication, and development workflow.

Data Streaming to APIs in Minutes
Source: Tinybird

Tinybird helps users to productize their data as HTTP APIs for integration with external systems (lakes, data mesh, user-facing applications). The product covers the entire end-to-end experience, from data ingestion to analytics to publishing APIs. An intuitive UI for interactive development and prototyping in conjunction with a full ‘data as config’ development life cycle makes the development of analytics applications straightforward. Data integration, analytics, and publication are defined as config files, stored in a Git repository. Tools help with automatic CI/CD for testing and deployment, plus a CLI and plugins for developer IDEs.

Relation of ClickHouse and Tinybird to Data Streaming with Apache Kafka

Data streaming with Apache Kafka refers to the process of ingesting, processing, and analyzing real-time data with a distributed streaming platform. Kafka enables the creation of real-time data pipelines that can handle high volumes of data from various sources, allowing for continuous data ingestion, processing, and delivery.

Kafka became the de facto standard protocol for data streaming, like Amazon S3 is the de facto standard for object storage. However…

Apache Kafka is NOT Complex Analytics

Apache Kafka is a database (even though many people disagree or don’t like this statement). But Kafka is not the right platform for complex analytics. Kafka is complementary to other databases like MySQL, MongoDB, Elasticsearch, ClickHouse, et al.

Learn more about this in my article “When NOT to use Apache Kafka” or the following lightboard video:

To be very clear: Data streaming also includes analytics capabilities. Stream processing enables continues processing of data in motion in real-time at scale. Use cases include simple stateless streaming ETL and advanced stateful computations, including embedding AI and machine learning into a stream processor. However, the workloads differ from analytical databases like ClickHouse. For a better understanding of when to use stream processing, check out my blog about the perfect match between Apache Kafka and stream processing with Kafka Streams or Apache Flink.

Apache Kafka is NOT Request-Response APIs

Request-response communication with REST / HTTP is simple, well understood, and supported by most technologies, products, and SaaS cloud services. Contrarily, data streaming with Apache Kafka is a fundamental change to process data continuously.

Data streaming with Apache Kafka and request-response APIs like HTTP/REST complement each other in most enterprise architectures. I wrote a lot about this in the past:

TL;DR: The question is not if you need Kafka or APIs, but how to combine them the best way in your architecture. Vendors like Confluent provide native HTTP APIs and connectors to produce and consume with the Kafka API using HTTP. But for more advanced use cases, solutions like Tinybird are perfect in combination with Kafka.

Apache Kafka + Tinybird = Streaming Analytics APIs

The Tinybird website explains the relation between data streaming with Kafka and Tinybird very well:

Turn your Kafka Streams into actionable API Endpoints your teams can consume. Instead of building a new consumer every time you want to make sense of your Data Streams, write SQL queries and expose them as API endpoints. Easy to maintain. Always up-to-date. Fast as can be.”

Apache Kafka Tinybird ClickHouse Integration with Kafka Connect
Source: Tinybird

The Tinybird application development process is very simple:

  1. Connect to Kafka Topics via out-of-the-box Kafka integration.
  2. Store the events in a managed, columnar data source with schemas and a data contract for good data quality.
  3. Query the events with SQL to filter, aggregate, join and enrichments for fast and scalable analytics.
  4. Publish queries as dynamic, low-latency REST/HTTP APIs that scale.

Fully Managed Cloud Services for Streaming Analytics: Confluent Cloud + Tinybird

Fully managed cloud services offer simplified operations, scalability, high availability, security, compliance, cost savings, and performance optimization. They streamline infrastructure management, ensure reliability, enhance security, reduce costs, and optimize performance for businesses. Project teams focus on business logic and much faster time-to-market of new products and innovation.

Working for Confluent, I enjoy our fast growing “Connect with Confluent” partner program to see how customers build innovative applications in a cloud-native fully managed environment, including end-to-end integration and data governance.

Confluent Cloud has fully integrated Tinybird. Developers build new applications with HTTP APIs on top of data streaming with Kafka faster than ever before. Check out this screencast to learn how you can publish an API from a Kafka stream in four minutes leveraging Confluent Cloud and Tinybird.

Tinybird Confluent Cloud Integration Connector
Source: Tinybird

Data streaming truly decoupled the business domains. In the above diagram, you see a very common scenario: various consumers of the same business information (i.e., a single Kafka Topic). Some in real-time (like Tinybird), some in near real-time or batch (like Snowflake or Databricks).

Each developer can use different programming languages, databases, or SaaS analytics platforms. Apache Kafka unifies operational and analytical workloads. As a result, a Tinybird application aggregates transactional and analytical workloads to build new APIs.

Customer Stories using Kafka in Confluent Cloud and Tinybird for Real-Time Analytics

This section explores a few real world case studies that combine Apache Kafka and ClickHouse under the hood of Confluent Cloud’s and Tinybird’s SaaS cloud solutions.

Factorial Human Resources (HR) Software: Data Freshness for Users

Factorial is provides human resources (HR) software solutions for over 8000 small and medium-sized businesses. Their platform offers features such as employee onboarding, time tracking, leave management, performance reviews, and HR analytics. Factorial streamlines HR processes, boosts employee productivity, and allows businesses to effectively manage their workforce. People leaders can focus on people, not paperwork.

Factorial HR Software for Real-Time Analytics
Source: Factorial

With data streaming and real-time analytics leveraging fully managed cloud services from Confluent and Tinybird, Factorial has improved its data freshness and reduced query latency, leading to significantly faster user feature launches. Read the detailed success story on the Confluent blog.

FanDuel: Customer Personalization and Fraud Prevention in Gambling and Sports Betting

FanDuel is operating in the online gambling and sports betting industry in the United States. I had the pleasure of hosting the company as guest speakers at executive dinner events in London.

FanDuel is a popular sports betting and daily fantasy sports company. It provides online platforms and mobile apps for users to place bets on various sports events and take part in fantasy sports contests. FanDuel has gained a reputation for its user-friendly interface, a wide range of betting options, and innovative features in the online gambling market.

The entire gambling and betting industry leverages data streaming for real-time data procession, transactional payment processing, fraud detection, customer loyalty platforms, and many other use cases. Read more about this topic in the state of data streaming for the betting industry and Kafka case studies for betting.

FanDuel leverages the combination of Confluent Cloud and Tinybird for real-time analytics use cases with user-facing applications and mobile apps.

Fanduel’s case study quotes: “Fanduel uses Confluent and Tinybird to power real-time personalization across all their sports betting solutions to improve time-to-first-bet and reduce the risk of fraud.

If you need APIs on Top of Data Streaming and Analytics, choose Kafka and Tinybird

… and if you want a fully managed end-to-end data pipeline with out-of-the-box connectivity, critical SLAs and cloud-native elasticity and pricing, go with Confluent Cloud and Tinybird. Of course, the data streaming landscape 2024 is broad. Absolutely fine to evaluate other vendors, too. 🙂

The conversations I had with FanDuel at our customer dinner showed that the real world challenge is not choosing the right technologies, but making them easy to use for fast time-to-market and elastic scale with reliable fully managed cloud services.

The motivation for this blog post were my meetings with these joint customers. I will do similar customer dinners in the next months with other Confluent partners like Rockset and StarTree. I can’t wait to hear from joint customers about their benefits of leveraging a specific analytics engine together with a fully managed data streaming platform for product innovation and better customer experiences.

Do you already use Tinybird together with Kafka? Or how do you build scalable real-time APIs on top of your favorite analytics database? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka and Tinybird (ClickHouse) for Streaming Analytics HTTP APIs appeared first on Kai Waehner.

]]>
How to Build a Real-Time Advertising Platform with Apache Kafka and Flink https://www.kai-waehner.de/blog/2023/09/15/how-to-build-a-real-time-advertising-platform-with-apache-kafka-and-flink/ Fri, 15 Sep 2023 06:49:27 +0000 https://www.kai-waehner.de/?p=4930 An advertising platform requires real-time capabilities to provide dynamic targeting, ad personalization, ad fraud detection, budget allocation, and event-driven marketing. This blog post explores how data streaming with Apache Kafka and Apache Flink enables context-specific advertising at any scale. Real-world success stories from Pinterest, Uber, Unity, buzzwil, and TV-Insight show different solutions and architectures for serving ads in marketing campaigns, embedded into mobile apps, and as SaaS software products.

The post How to Build a Real-Time Advertising Platform with Apache Kafka and Flink appeared first on Kai Waehner.

]]>
An advertising platform requires real-time capabilities to provide dynamic targeting, ad personalization, ad fraud detection, budget allocation, and event-driven marketing. This blog post explores how data streaming with Apache Kafka and Apache Flink enables context-specific advertising at any scale. Real-world success stories from Pinterest, Uber, Reddit, Unity, buzzwil, and TV-Insight show different solutions and architectures for serving ads in marketing campaigns, embedded into mobile apps, and as SaaS software products.

Real-Time Advertising Platform with Apache Kafka and Flink

What is a digital advertising platform?

An advertising (ads) platform is a digital system or service that allows businesses and advertisers to create, manage, and optimize their advertising campaigns across various channels. These platforms provide tools and features to target specific audiences, allocate budgets, track performance, and measure the effectiveness of advertising efforts.

Digital Marketing

Examples of advertising platforms include Google Ads, Facebook Ads, and programmatic advertising platforms that automate ad placement across websites and apps. These platforms play a crucial role in digital marketing, enabling advertisers to reach their target audience online and achieve their marketing objectives.

Challenges of an advertising platform

  1. Competition: Advertisers often face fierce competition for ad space. This can lead to higher costs and the need for effective targeting strategies.
  2. Ad Fraud: Digital advertising is susceptible to various forms of ad fraud, including click and impression fraud. Advertisers need to implement measures to protect their campaigns from fraudulent activity.
  3. Data Privacy Regulations: Stricter data privacy regulations, such as GDPR and CCPA, impact how advertisers collect and use customer data for targeting. Advertisers must comply with these regulations to avoid legal consequences.
  4. Ad Quality and Relevance: Ensuring that ads are of high quality and relevance to the target audience is essential. Poorly designed or irrelevant ads can lead to wasted ad spend and a negative user experience.
  5. Ad Fatigue: Showing the same ads repeatedly to users can lead to ad fatigue, causing users to ignore or block the ads. Advertisers need to manage frequency and creative refresh to combat this.
  6. Measurement and Attribution: Accurately measuring the impact of advertising campaigns and attributing conversions to specific ads or channels can be challenging, especially in a multi-channel marketing environment.
  7. Platform Changes: Advertising platforms frequently update their algorithms and policies. Advertisers need to adapt to these changes and stay informed to maintain campaign effectiveness.
  8. Budget Management: Effective budget allocation across various channels and campaigns can be complex. Balancing the budget to achieve the best results is an ongoing challenge.
  9. Creative Variation: Creating and testing different ad creatives to find the most effective ones requires ongoing effort and creativity.
  10. Ad Placement: Choosing the right placements on websites, apps, and social media is crucial. Advertisers must consider where their target audience spends their time.

Navigating these challenges requires a data-driven platform and a deep understanding of the digital advertising landscape, constant monitoring, and optimization.

Why does an ads platform need to be real-time?

An advertising platform should be real-time for several important reasons:

  1. Timely Campaign Adjustments: Real-time data allows advertisers to adjust their advertising campaigns promptly. They can respond quickly to market conditions, user behavior, or campaign performance changes. For example, if a particular ad is not performing well or if a sudden surge in user interest occurs, advertisers can pause or modify their campaigns immediately to optimize results.
  2. Dynamic Targeting: Real-time data enables dynamic and precise targeting. Advertisers can adjust their targeting criteria on the fly based on real-time user actions and data, ensuring that ads are delivered to the most relevant audience at the right moment.
  3. Optimized Bidding: Real-time bidding (RTB) is a crucial component of programmatic advertising. Advertisers can bid on ad inventory in real-time based on user data, maximizing their chances of winning ad placements at the best prices.
  4. Ad Personalization: Real-time data allows for highly personalized ad experiences. Advertisers can serve ads tailored to individual user preferences and behavior, increasing the likelihood of engagement and conversion.
  5. Ad Fraud Detection: Real-time monitoring and analysis of ad traffic can help detect and prevent ad fraud as it occurs. Ad platforms can identify suspicious patterns and take action to mitigate fraud, protecting advertisers’ investments.
  6. Budget Allocation: Real-time data informs budget allocation decisions. Advertisers can allocate more budget to high-performing campaigns and reduce spending on underperforming ones in real-time, ensuring efficient use of resources.
  7. Competitive Advantage: Real-time capabilities can provide a significant advantage in a competitive advertising landscape. Advertisers who can react swiftly to market changes and trends can capture opportunities that slower competitors might miss.
  8. User Engagement: Real-time advertising can engage users at the most opportune moments. For example, an e-commerce platform can display retargeting ads to users who abandoned their shopping carts in real-time, encouraging them to complete their purchases.
  9. Event-Driven Marketing: Real-time capabilities enable event-driven marketing. Advertisers can trigger ads based on specific user actions or external events, such as holidays or significant news events, making their campaigns more relevant and timely.
  10. Measurement and Attribution: Real-time data allows for immediate measurement of ad performance and attribution of conversions. Advertisers can track which ads and channels drive results and adjust their strategies accordingly.
  11. User Experience: Real-time ads can enhance the user experience by delivering current and contextually relevant content and offers. This can improve user engagement and satisfaction.

In today’s digital advertising landscape, where user behavior and market conditions can change rapidly, real-time capabilities are essential for advertisers to stay competitive, make data-driven decisions, and maximize the impact of their advertising campaigns. Real-time advertising platforms empower advertisers to be more agile, responsive, and effective in reaching their target audience.

How does Apache Kafka help build an advertising platform?

Apache Kafka combines real-time messaging at any scale with true decoupling through its event store. The data streaming platform collects data, correlates real-time and historical events with stream processing, and shares created information with downstream consumers.

Data Streaming with Apache Kafka and Apache Flink for Advertisement Platform and Ads

One of the most underestimated capabilities is the out-of-the-box capability of Apache Kafka to ensure data consistency across real-time and non-real-time systems. The heart of the enterprise architecture is real-time, scalable, and reliable. But any near real-time, batch or request-response communication can produce or consume at its own pace with its own API or programming language.

Apache Flink is ideal for data correlation. No matter if the task is data integration (aka streaming ETL) or advanced stateful business and application logic. Apache Kafka and Apache Flink are a match made in heaven for data streaming.

Real Time Bidding and Fraud Detection Advertisement Platform with Apache Kafka and Flink

Real-world success stories show how data streaming with Kafka and Flink helps build a next-generation advertising platform. These technologies solve the abovementioned challenges to provide real-time and consistent information across all applications.

Advertising platforms are either directly embedded into customer-facing applications or built as software or SaaS products that other companies buy and leverage.

The following success stories explore ad platforms built with data streaming:

  • Pinterest: Image-sharing and natural engagement with ads (Kafka Streams).
  • Buzzvil: Lock screen advertisement for smartphones (Kafka and Confluent Cloud).
  • TV-Insight: Live decisions of regular TV ad blocks (Kafka, Flink, and Confluent Cloud).
  • Unity: Monetization network for gaming (Kafka and Confluent).
  • Uber Eats: Ads in the mobile food delivery app (Kafka, Flink, Pinot).
  • Reddit: Ads placing including real-time budget planning without over-delivery or under-delivery (Kafka, Flink, Druid).

Pinterest – Social media natural engagement with ads

Pinterest is an American image-sharing and social media service designed to enable the saving and discovery of information (specifically “ideas”) like recipes, home, style, motivation, and inspiration on the internet.

The content of ads is very close to the actual content. Naturally, users engage with the content and ads:

Pinterest Mobile App Home Feed Search and Ads

Pinterest talked about its Kafka-powered advertising platform for the first time in 2018 at a Kafka Summit. The Ad platform leverages Kafka for the data ingestion pipeline and stream processing with Kafka Streams to enable a real-time feedback loop. Recommendation engine (via machine learning), budgeting, and new ads exploration are some of the critical use cases.

Pinterest Ads Engine built with Kafka Streams

The continuous feedback loop enables real-time updates in seconds. Stateful stream processing with Kafka Streams correlates events from users, ads, budget, and other interfaces to decide on ads serving.

Stateful Stream Processing with Kafka Streams at Pinterest

Real-time (even at an extreme scale) is critical for Pinterest. When a new ad is created, the ads platform does not know about the user engagement with this ad on different surfaces. The faster the ads platform knows about the performance of the newly created ad, the better value can be provided to the user.

There is a balance between exploiting good ads and exploring new ads. The solution was adding a boosting factor to new ads to increase the probability of winning an auction.

Listen to the talk from Pinterest for more details, best practices, and lessons learned in developing and operating a scalable, real-time advertising platform with stateful stream processing using Kafka Streams.

Buzzvil – Lock screen advertising platform

Buzzvil provides a lock screen advertising platform that connects partners and advertisers:

buzzvil – AdTech for Publishers and Advertisers

Buzvill’s advertising platform is data-driven and built with Apache Kafka in the cloud. It optimizes ad spending through automation, behavioral analytics, audience targeting, rewards programs, and more. Data streaming enables a single source of truth for real-time ad transaction data.

 

 

buzzvil - Advertisement Platform built with Apache Kafka in Confluent Cloud

They built the ad platform with Apache Kafka in a fully managed Confluent Cloud to focus on business logic and faster time-to-market.

Data streaming with Apache Kafka enables 18x faster data updates for ad bidding. Confluent Cloud saves 20-30% infrastructure cost.

TV-Insight – Live decisions of regular TV ad blocks

TV-Insight developed a solution to help Joint Industry Committees (JIC), Broadcasters, and Advertisers to improve and evolve the data quality of existing TV measurement panels using return path data of connected devices.

The problem of monitoring classical TV

The essential difference between TV-Insight and all other “panel boosting” initiatives and products is that TV-Insight uses real-time data. Therefore, it can provide a live TV reach for live decisions of regular TV ad blocks.

TVI Insight Live Reach Prediction in Real-Time

The TV-Insight application collects data from the Smart TV or Set-Top Box via GDPR compliance device tracking. The live extrapolation enables advertising optimization:

TV Insight Enterprise Architecture for Real-Time Ads

The technical architecture and data pipeline look like the following. Apache Kafka is the real-time messaging platform and event store. Apache Kafka’s stateful stream processing correlates events to calculate real-time ad serving in the advertising platform.

Apache Kafka and Apache Flink for Advertisement Platform at TV Insight

Unity Ads – Monetization network for gaming

Unity is a cross-platform game engine developed by Unity Technologies. The engine has since been gradually extended to support a variety of desktop, mobile, console, and virtual reality platforms. The engine can create three-dimensional (3D) and two-dimensional (2D) games, interactive simulations, and other experiences. Industries outside video gaming have adopted the engine, such as film, automotive, architecture, engineering, and construction.

In 2019, Unity apps and content were installed 33 billion times, reaching 3 billion devices worldwide.

The 3D development platform and game engine is not the only product of Unity Technologies. Unity Ads is one of the largest monetization networks in the world:

  • Reward players for watching ads
  • Incorporate banner ads
  • Incorporate Augmented Reality (AR) ads
  • Playable ads
  • Cross-Promotions
  • IAPs (in-app purchases)

Unity is a data-driven company:

  • Average about half a million events per second
  • Handles millions of dollars in monetary transactions
  • Data infrastructure based on Apache Kafka

single data pipeline provides the foundational infrastructure for analytics, R&D, monetization, cloud services, etc., for real-time and batch processing leveraging Apache Kafka:

  • Real-time monetization network
  • Feed machine learning models in real-time
  • Data lake went from two-day latency down to 15 minutes

If you want to learn about Unity’s success story of migrating this platform from self-managed Kafka to the cloud, read the post on the Confluent Blog: “How Unity uses Confluent for real-time event streaming at scale“.

Uber Eats – Ads embedded into food delivery app

Uber provides an exciting food delivery app capability: Uber Eats allows ads embedding. With this ability came new challenges that needed to be solved at Uber, such as systems for ad auctions, bidding, attribution, reporting, and more.

Uber wrote an excellent article that focuses on how they leveraged open source technology to build Uber’s first near real-time exactly-once events processing system. Uber leverages Kafka, Flink, and Pinot for its advertising platform. This perfectly combines the right technologies.

Uber Eats Architecture of the Advertisement Platform using Kafka Flink and Pinot

As Uber writes: “With every ad served, there are corresponding events per user (impressions, clicks). The responsibility of the ad events processing system is to manage the flow of events, cleanse them, aggregate clicks and impressions, attribute them to orders, and provide this data in an accessible format for reporting and analytics as well as dependent clients (e.g., other ads systems).”

While speed, scale, and reliability are always crucial for such a system, I want to emphasize the part about accuracy and why exactly-once processing with Kafka and Flink was a critical piece of the architecture.

The Aggregation Job implemented with Apache Flink does a lot of the heavy lifting: Data cleansing, persistence for order attribution, aggregation, and record UUID generation.

Uber Aggregation Job with Apache Flink

Exactly-once with Kafka and Flink is very important, as their blog post explains: “Uber can’t afford to overcount events. Double counting clicks results in overcharging advertisers and overreporting the success of ads. Both being poor customer experiences, this requires processing events exactly-once. Uber is the marketplace in which ads are being served, therefore our ad attribution must be 100% accurate.”

Reddit – Ads Placing with Budget Planning avoiding Over- or Under-Delivery

Reddit is an American social news aggregation, content rating, and discussion website. Registered users submit content to the site such as links, text posts, images, and videos, which other members then vote up or down.

Reddit’s ads platform allows advertisers to create ad campaigns and set both daily and lifetime budgets for a campaign. Here is Reddit’s decision tree to place advertisements:

Reddit Decision Tree to Place Advertisements

The data pipeline leverages Kafka, Flink, and Druid to analyze campaign budgets in real-time. The platform leverages real-time plus historical user activity data to decide which ad to place. All within 30 milliseconds to avoid over-delivery and under-delivery (budget spent too quickly / slowly).

Reddit Ads Serving Platform using Apache Kafka, Flink and Druid

Watch Reddit’s talk from Druid Summit “Low Latency Real-Time Ads Pacing Queries” to learn more about their ads platform and use cases.

Real-world success stories from Pinterest, Uber, Unity, buzzwil, and TV-Insight showed how to embed real-time advertising into your applications or build a dedicated marketing product.

Data streaming with Apache Kafka and Apache Flink enables context-specific advertising at scale in real time. The cloud makes it possible to focus on business logic and faster time-to-market with a fully managed data streaming platform.

How do you leverage data streaming in marketing and advertising use cases? Do you deploy at the edge, in the cloud, or both? Or do you integrate 3rd party marketing platforms into your advertising platforms? Let’s connect on LinkedIn and discuss it! Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

The post How to Build a Real-Time Advertising Platform with Apache Kafka and Flink appeared first on Kai Waehner.

]]>
Apache Kafka as Mission Critical Data Fabric for GenAI https://www.kai-waehner.de/blog/2023/07/22/apache-kafka-as-mission-critical-data-fabric-for-genai/ Sat, 22 Jul 2023 10:40:59 +0000 https://www.kai-waehner.de/?p=5548 Apache Kafka serves thousands of enterprises as the mission-critical and scalable real-time data fabric for machine learning infrastructures. The evolution of Generative AI (GenAI) with large language models (LLM) like ChatGPT changed how people think about intelligent software and automation. This blog post explains the relation between data streaming and GenAI and shows the enormous opportunities and some early adopters of GenAI beyond the buzz.

The post Apache Kafka as Mission Critical Data Fabric for GenAI appeared first on Kai Waehner.

]]>
Apache Kafka serves thousands of enterprises as the mission-critical and scalable real-time data fabric for machine learning infrastructures. The evolution of Generative AI (GenAI) with large language models (LLM) like ChatGPT changed how people think about intelligent software and automation. This blog post explains the relation between data streaming and GenAI and shows the enormous opportunities and some early adopters of GenAI beyond the buzz.
Data Streaming with Apache Kafka as Data Fabric for GenAI

Generative AI (GenAI) and Data Streaming

Let’s set the context first to have the same understanding of the buzzwords.

[Note: My ChatGPT iPhone app generated this section. I slightly shortened and adjusted the content afterward. GenAI is perfect for summarizing existing content so that authors can spend time on new content (that ChatGPT does not know yet).]

Natural Language Processing (NLP)

ChatGPT, what is NLP?

NLP stands for Natural Language Processing. It is a subfield of artificial intelligence (AI) that focuses on interactions between computers and human language. NLP enables computers to understand, interpret, and generate human language in a meaningful and valuable way.

Natural Language Processing involves a range of tasks, including:

  • Text Parsing and Tokenization: Breaking down text into individual words or tokens.
  • Named Entity Recognition (NER): Identifying and classifying named entities such as people, organizations, locations, and dates in the text.
  • Sentiment Analysis: Determining the sentiment or emotional tone expressed in text, whether positive, negative, or neutral.
  • Machine Translation: Translating text from one language to another.
  • Question Answering: Building systems that can understand and answer questions posed in natural language.
  • Text Generation: Creating human-like text or generating responses to prompts.

NLP is crucial in applications such as chatbots, virtual assistants, language translation, information retrieval, sentiment analysis, and more.

GenAI = Next Generation NLP (and more)

ChatGPT, what is Generative AI?

Generative AI is a branch of artificial intelligence focused on creating models and systems capable of generating new content, such as images, text, music, or even entire virtual worlds. These models are trained on large datasets and learn patterns and structures to generate new outputs similar to the training data. That’s why the widespread buzzword is Large Language Model (LLM).

Generative AI is used for next-generation NLP and uses techniques such as generative adversarial networks (GANs), variational autoencoders (VAEs), and recurrent neural networks (RNNs). Generative AI has applications in various fields and industries, including art, design, entertainment, and scientific research.

Apache Kafka for Data Streaming

ChatGPT, what is Apache Kafka?

Apache Kafka is an open-source distributed streaming platform and became the de facto standard for event streaming. It was developed by the Apache Software Foundation and is widely used for building real-time data streaming applications and event-driven architectures. Kafka provides a scalable and fault-tolerant system for handling high volumes of streaming data.

Open Source Data Streaming in the Cloud

Kafka has a thriving ecosystem with various tools and frameworks that integrate with it, such as Apache Spark, Apache Flink, and others.

Apache Kafka is widely adopted in use cases that require real-time data streaming, such as data pipelines, event sourcing, log aggregation, messaging systems, and more.

Why Apache Kafka and GenAI?

Generative AI (GenAI) is the next-generation NLP engine that helps many projects in the real world for service desk automation, customer conversation with a chatbot, content moderation in social networks, and many other use cases.

Apache Kafka became the predominant orchestration layer in these machine learning platforms for integrating various data sources, processing at scale, and real-time model inference.

Data streaming with Kafka already powers many GenAI infrastructures and software products. Very different scenarios are possible:

  • Data streaming as data fabric for the entire machine learning infrastructure
  • Model scoring with stream processing for real-time productions
  • Generation of streaming data pipelines with input text or speech
  • Real-time online training of large language models

Let’s explore these opportunities for data streaming with Kafka and GenAI in more detail.

Real-time Kafka Data Hub for GenAI and other Microservices in the Enterprise Architecture

I already explored in 2017 (!) how “How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka“. At that time, real-world examples came from tech giants like Uber, Netflix, and Paypal.

Today, Apache Kafka is the de facto standard for building scalable and reliable machine learning infrastructures across any enterprise and industry, including:

  • Data integration from various sources (sensors, logs, databases, message brokers, APIs, etc.) using Kafka Connect connectors, fully-managed SaaS integrations, or any kind of HTTP REST API or programming language.
  • Data processing leveraging stream processing for cost-efficient streaming ETL such as filtering, aggregations, and more advanced calculations while the data is in motion (so that any downstream application gets accurate information)
  • Data ingestion for near real-time data sharing with various data warehouses and data lakes so that each analytics platform can use its product and tools.

Kafka Machine Learning Architecture for GenAI

Building scalable and reliable end-to-end pipelines is today’s sweet spot of data streaming with Apache Kafka in the AI and Machine Learning space.

Model Scoring with Stream Processing for Real-Time Predictions at any Scale

Deploying an analytic model in a Kafka application is the solution to provide real-time predictions at any scale with low latency. This is one of the biggest problems in the AI space, as data scientists primarily focus on historical data and batch model training in data lakes.

However, the model scoring for predictions needs to provide much better SLAs regarding scalability, reliability, and latency. Hence, more and more companies separate model training from model scoring and deploy the analytic model within a stream processor such as Kafka Streams, KSQL, or Apache Flink:

Data Streaming and Machine Learning with Embedded TensorFlow Model

Check out my article “Machine Learning and Real-Time Analytics in Apache Kafka Applications” for more details.

Dedicated model servers usually only support batch and request-response (e.g., via HTTP or gRPC). Fortunately, many solutions now also provide native integration with the Kafka protocol.

Kafka-native Model Server for Machine Learning and Model Deployment

I explored this innovation in my blog post “Streaming Machine Learning with Kafka-native Model Deployment“.

Development Tools for Generating Kafka-native Data Pipelines from Input Text or Speech

Almost every software vendor discusses GenAI to enhance its development environments and user interfaces.

For instance, GitHub is a platform and cloud-based service for software development and version control using Git. But their latest innovation is “the AI-Powered Developer Platform to Build, Scale, and Deliver Secure Software”: Github CoPilot X. Cloud providers like AWS provide similar tools.

Similarly, look at any data infrastructure vendor like Databricks or Snowflake. The latest conferences and announcements focus on embedded capabilities around large language models and GenAI in their solutions.

The same will be true for many data streaming platforms and cloud services. Low-code/no-code tools will add capabilities to generate data pipelines from input text. One of the most straightforward applications that I see coming is generating SQL code out of user text.

For instance, “Consume data from Oracle table customer, aggregate the payments by customer, and ingest it into Snowflake”. This could create SQL code for stream processing technologies like KSQL or FlinkSQL.

Developer experience, faster time-to-market, and support less technical personas are enormous advantages for embedding GenAI into Kafka development environments.

Real-time Training of Large Language Models (LLM)

AI and Machine Learning are still batch-based systems almost all of the time. Model training takes at least hours. This is not ideal, as many GenAI use cases require accurate and updated information. Imagine googling for information today, and you could not find data from the past week. Impossible to use such a service in many scenarios!

Similarly, if I ask ChatGPT today (July 2023): “What is GenAI?” – I get the following response:

As of my last update in September 2021, there is no specific information on an entity called “GenAi.” It’s possible that something new has emerged since then. Could you provide more context or clarify your question so I can better assist you?

The faster your machine learning infrastructure ingests data into model training, the better. My colleague Michael Drogalis wrote an excellent deep-technical blog post: “GPT-4 + Streaming Data = Real-Time Generative AI” to explore this topic more thoroughly.

Real Time GenAI with Data Streaming powered by Apache Kafka

This architecture is compelling because the chatbot will always have your latest information whenever you prompt it. For instance, if your flight gets delayed or your terminal changes, the chatbot will know about it during your chat session. This is entirely distinct from current approaches where the chat session must be reloaded or wait a few hours/days for new data to arrive.

LLM + Vector Database + Kafka = Real-Time GenAI

Real-time model training is still a novel approach. Many machine learning algorithms are not ready for continuous online model training today. But combining Kafka with a vector database enables using a batch-trained LLM together with real-time updates feeding up-to-date information into the LLM.

Nobody will accept an LLM like ChatGPT in a few years, giving you answers like “I don’t have this information; my model was trained a week ago”. It does not matter if you choose a brand new vector database like Pinecone or leverage new vector capabilities of your installed Oracle or MongoDB storage.

Feed data into the vector database in real-time with Kafka Connect and combine with with a mature LLM to enable real-time GenAI with context-specific recommendations.

Real-World Case Studies for Kafka and GenAI

This section explores how companies across different industries, such as the carmaker BMW, the online travel and booking Expedia, and the dating app Tinder leverage the combination of data streaming with GenAI for reliable real-time conversational AI, NLP and chatbots leveraging Kafka.

Two years ago, I wrote about this topic: “Apache Kafka for Conversational AI, NLP and Chatbot“. But technologies like ChatGPT make it much easier to adopt GenAI in real-world projects with much faster time-to-market and less cost and risk. Let’s explore a few of these success stories for embedding NLP and GenAI into data streaming enterprise architectures.

Disclaimer: As I want to show real-world case studies instead of visionary outlooks, I show several examples deployed in production in the last few years. Hence, the analytic models do not use GenAI, LLM, or ChatGPT as we know it from the press today. But the principles are precisely the same. The only difference is that you could use a cutting-edge model like ChatGPT with much improved and context-specific responses today.

Expedia – Conversations Platform for Better Customer Experience

Expedia is a leading online travel and booking. They have many use cases for machine learning. One of my favorite examples is their Conversations Platform built on Kafka and Confluent Cloud to provide an elastic cloud-native application.

The goal of Expedia’s Conversations Platform was simple: Enable millions of travelers to have natural language conversations with an automated agent via text, Facebook, or their channel of choice. Let them book trips, make changes or cancellations, and ask questions:

  • “How long is my layover?”
  • “Does my hotel have a pool?”
  • “How much will I get charged to bring my golf clubs?”

Then take all that is known about that customer across all of Expedia’s brands and apply machine learning models to immediately give customers what they are looking for in real-time and automatically, whether a straightforward answer or a complex new itinerary.

Real-time Orchestration realized in four Months

Such a platform is no place for batch jobs, back-end processing, or offline APIs. To quickly make decisions that incorporate contextual information, the platform needs data in near real-time, and it needs it from a wide range of services and systems. Meeting these needs meant architecting the Conversations Platform around a central nervous system based on Confluent Cloud and Apache Kafka. Kafka made it possible to orchestrate data from loosely coupled systems, enrich data as it flows between them so that by the time it reaches its destination, it is ready to be acted upon, and surface aggregated data for analytics and reporting.

Expedia built this platform from zero to production in four months. That’s the tremendous advantage of using a fully managed serverless event streaming platform as the foundation. The project team can focus on the business logic.

The Covid pandemic proved the idea of an elastic platform: Companies were hit with a tidal wave of customer questions, cancellations, and re-bookings. Throughout this once-in-a-lifetime event, the Conversations Platform proved up to the challenge, auto-scaling as necessary and taking off much of the load of live agents.

Expedia’s Migration from MQ to Kafka as Foundation for Real-time Machine Learning and Chatbots
As part of their conversations platform, Expedia needed to modernize their IT infrastructure, as Ravi Vankamamidi, Director of Technology at Expedia Group, explained in a Kafka Summit keynote.
Expedia’s old legacy chatbot service relied on a legacy messaging system. This service was a question-and-answer board with very limited scope for booking scenarios. This service could handle two-party conversations. It could not scale to bring all different systems into one architecture to build a powerful chatbot that is helpful for customer conversations.

I explored several times that event streaming is more than just a (scalable) message queue. Check out my old (but still accurate and relevant) Comparison between MQ and Kafka, or the newer comparison between cloud-native iPaaS and Kafka.

Expedia needed a service that was closer to travel assistance. It needed to handle context-specific, multi-party, multi-channel conversations. Hence, features such as natural language processing, translation, and real-time analytics are required. The full service needs to be scalable across multiple brands. Therefore, a fast and highly scalable platform with order guarantees, exactly-once-semantics (EOS), and real-time data processing were needed.
The Kafka-native event streaming platform powered by Confluent was the best choice and met all requirements. The new conversations platform doubled the Net Promoter Score (NPS) one year after the rollout. The new platform proved the business value of the new platform quickly.

BMW – GenAI for Contract Intelligence, Workplace Assistance and Machine Translation

The automotive company BMW presented innovative NLP services at Kafka Summit in 2019. It is no surprise that a carmaker has various NLP scenarios. These include digital contract intelligence, workplace assistance, machine translation, and customer conversations. The latter contains multiple use cases for conversational AI:
  • Service desk automation
  • Speech analysis of customer interaction center (CIC) calls to improve the quality
  • Self-service using smart knowledge bases
  • Agent support
  • Chatbots
The text and speech data is structured, enriched, contextualized, summarized, and translated to build real-time decision support applications. Kafka is a crucial component of BMW’s ML and NLP architecture. The real-time integration and data correlation enable interactive and interoperable data consumption and usage:
NLP Service Framework Based on Data Streaming at BMW
BMW explained the key advantages of leveraging Kafka and its streaming processing library Kafka Streams as the real-time integration and orchestration platform:
  • Flexible integration: Multiple supported interfaces for different deployment scenarios, including various machine learning technologies, programming languages, and cloud providers
  • Modular end-to-end pipelines: Services can be connected to provide full-fledged NLP applications.
  • Configurability: High agility for each deployment scenario

Tinder – Intelligent Content Moderation, Matching and Recommendations with Kafka and GenAI

The dating app Tinder is a great example where I can think of tens of use cases for NLP. Tinder talked at a past Kafka Summit about their Kafka-powered machine learning platform.

Tinder is a massive user of Kafka and its ecosystem for various use cases, including content moderation, matching, recommendations, reminders, and user reactivation. They used Kafka Streams as a Kafka-native stream processing engine for metadata processing and correlation in real-time at scale:

Impact of Data Streaming at Tinder
A critical use case in any dating or social platform is content moderation for detecting fakes, filtering sexual content, and other inappropriate things. Content moderation combines NLP and text processing (e.g., for chat messages) with image processing (e.g., selfie uploads) or processes the metadata with Kafka and stores the linked content in a data lake. Both leverage Deep Learning to process high volumes of text and images. Here is what content moderation looks like in Tinder’s Kafka architecture:
Content Moderation at Tinder with Data Streaming and Machine Learning
Plenty of ways exist to process text, images, and videos with the Kafka ecosystem. I wrote a detailed article about handling large messages and files with Apache Kafka to explore the options and trade-offs.
Chatbots could also play a key role “in the other way round”. More and more dating apps (and other social networks) fight against spam, fraud, and automated chatbots. Like building a chatbot, a chatbot detection system can analyze the data streams to block a dating app’s chatbot.

Kafka as Real-Time Data Fabric for Future GenAI Initiatives

Real-time data beats slow data. Generative AI only adds value if it provides accurate and up-to-date information. Data streaming technologies such as Apache Kafka and Apache Flink enable building a reliable, scalable real-time infrastructure for GenAI. Additionally, the event-based heart of the enterprise architecture guarantees data consistency between real-time and non-real-time systems (near real-time, batch, request-response).

The early adopters like BWM, Expedia, and Tinder proved that Generative AI integrated into a Kafka architecture adds enormous business value. The evolution of AI models with ChatGPT et al. makes the use case even more compelling across every industry.

How do you build conversational AI, chatbots, and other GenAI applications leveraging Apache Kafka? What technologies and architectures do you use? Are data streaming and Kafka part of the architecture? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka as Mission Critical Data Fabric for GenAI appeared first on Kai Waehner.

]]>
The Heart of the Data Mesh Beats Real-Time with Apache Kafka https://www.kai-waehner.de/blog/2022/07/28/the-heart-of-the-data-mesh-beats-real-time-with-apache-kafka/ Thu, 28 Jul 2022 06:08:38 +0000 https://www.kai-waehner.de/?p=4706 If there were a buzzword of the hour, it would undoubtedly be "data mesh"! This new architectural paradigm unlocks analytic and transactional data at scale and enables rapid access to an ever-growing number of distributed domain datasets for various usage scenarios. The data mesh addresses the most common weaknesses of the traditional centralized data lake or data platform architecture. And the heart of a decentralized data mesh infrastructure must be real-time, reliable, and scalable. Learn how the de facto standard for data streaming, Apache Kafka, plays a crucial role in building a data mesh.

The post The Heart of the Data Mesh Beats Real-Time with Apache Kafka appeared first on Kai Waehner.

]]>
If there were a buzzword of the hour, it would undoubtedly be “data mesh“! This new architectural paradigm unlocks analytic and transactional data at scale and enables rapid access to an ever-growing number of distributed domain datasets for various usage scenarios. The data mesh addresses the most common weaknesses of the traditional centralized data lake or data platform architecture. And the heart of a decentralized data mesh infrastructure must be real-time, reliable, and scalable. Learn how the de facto standard for data streaming, Apache Kafka, plays a crucial role in building a data mesh.

The Heart of the Data Mesh Beats Real Time with Apache Kafka

There is no single technology or product for a data mesh!

This post explores how Apache Kafka, as an open and scalable decentralized real-time platform, can be the basis of a data mesh infrastructure and – complemented by many other data platforms like a data warehouse, data lake, and lakehouse – solve real business problems.

There is no silver bullet or single technology/product/cloud service for implementing a data mesh. The key outcome of a data mesh architecture is the ability to build data products; with the right tool for the job. A good data mesh combines data streaming technology like Apache Kafka or Confluent Cloud with cloud-native data warehouse and data lake architectures from Snowflake, Databricks, Google BigQuery, et al.

What is a data mesh?

I won’t write yet another article describing the concepts of a data mesh. Zhamak Dehghani coined the term in 2019. The following data mesh architecture from 30,000-foot view explains the basic idea well:

Data Mesh for Domain Driven Design Microservices Data Mart Data Streaming

I summarize data mesh as the following three bullet points:

  • An architecture paradigm with several historical influences (domain-driven design, microservices, data marts, data streaming)
  • Not specific to a single technology or product; no single vendor can implement a data mesh alone
  • Handling data as a product is a fundamental change, enabling a more flexible architecture and independent solving of separate business problems
  • Decentralized services, not just analytics but also transactional workloads

Why handle data as a product?

Talking about innovative technology is insufficient to introduce a new architectural paradigm. Consequently, measuring the business value of the enterprise architecture is critical, too.

McKinsey finds that “when companies instead manage data like a consumer product—be it digital or physical—they can realize near-term value from their data investments and pave the way for quickly getting more value tomorrow. Creating reusable data products and patterns for piecing together data technologies enables companies to derive value from data today and tomorrow”:

McKinsey - Why Handle Data as a Product

For McKinsey, the benefits of this approach can be significant:

  • New business use cases can be delivered as much as 90 percent faster.
  • The total cost of ownership, including technology, development, and maintenance, can decline by 30 percent.
  • The risk and data-governance burden can be reduced.

What is data streaming with Apache Kafka and its relation to data mesh?

A data mesh enables flexibility through decentralization and best-of-breed data products. The heart of data sharing requires reliable real-time data at any scale between data producers and data consumers. Additionally, true decoupling between the decentralized data products is key to the success of the data mesh paradigm. Each domain must have access to shared data but also the ability to choose the right tool (i.e., technology, API, product, or SaaS) to solve its business problems.

That’s where data streaming fits into the data mesh story:

Flexibility through Decentralization and Best-of-Breed with Data Streaming

The de facto standard for data streaming is Apache Kafka. A cloud-native data streaming infrastructure that can link clusters with each other out-of-the-box enables building a modern data mesh. No Data Mesh will use just one technology or vendor. Learn from inspiring posts from your favorite data products vendors like AWS, Snowflake, Databricks, Confluent, and many more to successfully define and build your custom Data Mesh. Data Mesh is a journey, not a big bang. A data warehouse or data lake (or in modern days, a lakehouse) cannot be the only infrastructure for data mesh and data products.

I covered how to leverage the capabilities of Apache Kafka and its ecosystem like Kafka Connect, ksqlDB, Cluster Linking, etc. to build the heart of a data mesh in a separate blog post: Streaming Data Exchange with Kafka and a Data Mesh in Motion.

Example: Real-time data fabric in hybrid cloud

Here is one example spanning a streaming Data Mesh across multiple cloud providers like AWS, Azure, GCP, or Alibaba, and on-premise / edge sites:

Hybrid Cloud Streaming Data Mesh powered by Apache Kafka and Cluster Linking

This example shows all the characteristics discussed in the above sections for a Data Mesh:

  • Decentralized real-time infrastructure across domains and infrastructures
  • True decoupling between domains within and between the clouds
  • Several communication paradigms, including data streaming, RPC, and batch
  • Data integration with legacy and cloud-native technologies
  • Continuous stream processing where it adds value, and batch processing in some analytics sinks

Presentation: Building a decentralized data mesh with data streaming at its heart

The following slide deck walks you through the motivation, principles, and architectures of building a real-time data mesh powered by Apache Kafka using the Kappa architecture, hybrid cloud, and stream data sharing:

The data mesh provides flexibility and freedom of technology choice for each data product

The heart of a decentralized data mesh infrastructure must be real-time, reliable, and scalable. As the de facto standard for data streaming, Apache Kafka plays a crucial role in a cloud-native data mesh architecture. Nevertheless, data mesh is not bound to a specific technology. The beauty of the decentralized architecture is the freedom of technology choice for each business unit when building its data products.

Data sharing between domains within and across organizations is another aspect where data streaming helps in a data mesh. Real-time data beats slow data. That is not just true for most business problems across industries but also for replicating data between data centers, clouds, regions, or organizations. A streaming data exchange enables data sharing in real-time to build a data mash in motion.

Did you already start building your Data Mesh? What does the enterprise architecture look like? What frameworks, products, and cloud services do you use? Is the heart of your data mesh real-time in motion or some lakehouse at rest? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post The Heart of the Data Mesh Beats Real-Time with Apache Kafka appeared first on Kai Waehner.

]]>
Real Time Analytics with Apache Kafka in the Healthcare Industry https://www.kai-waehner.de/blog/2022/04/04/real-time-analytics-machine-learning-with-apache-kafka-in-the-healthcare-industry/ Mon, 04 Apr 2022 10:31:47 +0000 https://www.kai-waehner.de/?p=4414 IT modernization and innovative new technologies change the healthcare industry significantly. This blog series explores how data streaming with Apache Kafka enables real-time data processing and business process automation. This is part four: Real-Time Analytics. Examples include Cerner, Celmatix, CDC/Centers for Disease Control and Prevention.

The post Real Time Analytics with Apache Kafka in the Healthcare Industry appeared first on Kai Waehner.

]]>
IT modernization and innovative new technologies change the healthcare industry significantly. This blog series explores how data streaming with Apache Kafka enables real-time data processing and business process automation. Real-world examples show how traditional enterprises and startups increase efficiency, reduce cost, and improve the human experience across the healthcare value chain, including pharma, insurance, providers, retail, and manufacturing. This is part four: Real-Time Analytics. Examples include Cerner, Celmatix, CDC/Centers for Disease Control and Prevention.

Real Time Analytics and Machine Learning with Apache Kafka in Healthcare

Blog Series – Kafka in Healthcare

Many healthcare companies leverage Kafka today. Use cases exist in every domain across the healthcare value chain. Most companies deploy data streaming in different business domains. Use cases often overlap. I tried to categorize a few real-world deployments into different technical scenarios and added a few real-world examples:

Stay tuned for a dedicated blog post for each of these topics as part of this blog series. I will link the blogs here as soon as they are available (in the next few weeks). Subscribe to my newsletter to get an email after each publication (no spam or ads).

Real-Time Analytics with Apache Kafka

Real-time analytics (aka stream processing, streaming analytics, or complex event processing) is a data processing technology used to collect, store, and manage continuous data streams when produced or received.

Stream processing has many use cases. Examples include the backend process for claim processing, billing, logistics, manufacturing, fulfillment, or fraud detection. Data processing may need to be decoupled from the frontend, where users click buttons and expect things to happen.

The de facto standard for real-time analytics is Apache Kafka. Kafka is like a central data hub that holds shared events and keeps services in sync. Its distributed cluster technology provides availability, resiliency, and performance properties that strengthen the architecture. It leaves the programmer to write and deploy client applications that will run load balanced and be highly available.

Real Time Analytics with Data Streaming Stream Processing and Apache Kafka

Technologies for real-time analytics with the Kafka ecosystem include Kafka-native stream processing with Kafka Streams or ksqlDB, or 3rd party add-ons like Apache Flink, Spark Streaming, or commercial streaming analytics cloud services.

The critical difference with the Kafka ecosystem is that you leverage a single platform for data integration and processing at scale in real-time. There is no need to combine several platforms to achieve this. The result is a Kappa architecture that enables real-time but also batch workloads with a single integration architecture.

Let’s look at a few real-world deployments in the healthcare sector.

Cerner – Sepsis Alerting in Real-Time

Cerner is a supplier of health information technology services, devices, and hardware. ~30% of all US healthcare data in a Cerner solution.

Sepsis kills. In fact, it kills up to 52,000 people every year in the UK alone. With sepsis alerting, the key to saving lives is early identification, especially the need to administer antibiotics within that first critical ‘golden hour’. Quick alerts make a significant impact. Cerner’s sepsis alert, coupled with the care plans developed with the big room approach, means that patients are now 71% more likely to receive timely antibiotics.

Cerner leverages a Kafka-powered central event streaming platform for sepsis alerting in real-time to save lives. Legacy systems hit a wall preventing going faster (and missed SLAs). The data processing with Kafka progressed from minutes to seconds.

Real Time Sepsis Alerting at Cerner with Apache Kafka

Cerner is a long-term Kafka user and early adopter in the healthcare sector. Learn more about this use case in their Kafka Summit talk from 2016.

Celmatix – Reproductive Health Care

Celmatix is a preclinical-stage biotech company that provides digital tools and genetic insights focused on fertility. They offer personalized information to disrupt how women approach their lifelong reproductive health journey.

The streaming platform provides real-time aggregation of heterogeneous data collected from Electronic Medical Records (EMRs) and genetic data collected from partners through their Personalized Reproductive Medicine (PReM) Initiative.

Proactive reproductive health decisions are enabled by real-time genomics data and by applying technologies such as big data analytics, machine learning, A/I, and whole-genome DNA sequencing.

Celmatix Reproductive Health Care Eletronical Medical Records EMR Processing with Apache Kafka

Data governance for security and compliance is critical in such a healthcare application. “Apache Kafka and Confluent are invaluable investments to scale the way we want to and future-proof our business,” says the lead data architect at Celmatix. Learn more in the Confluent case study.

CDC – Covid-19 Electronic Lab Reporting

The Centers for Disease Control and Prevention (CDC) built Covid-19 Electronic Lab Reporting (CELR) with the Kafka ecosystem. Use cases include case notifications, lab reporting, and healthcare interoperability.

The threat of the COVID-19 virus is tracked in real-time to provide comprehensive data for local, state, and federal responses. The application allows them to understand locations with an increase in incidence better.

With the true decoupling of the data streaming platform, the CDC can rapidly aggregate, validate, transform, and distribute laboratory testing data submitted by public health departments and other partners:

Centers for Disease Control and Prevention CDC Covid Analytics with Kafka

Real-Time Analytics with Kafka for Smart Healthcare Applications at any Scale

Think about IoT sensor analytics, cybersecurity, patient communication, insurance, research, and many other domains. Real-time data beats slow data in the healthcare supply chain almost everywhere.

This blog post explored the capabilities of the Apache Kafka ecosystem for real-time analytics. Real-world deployments from Cerner, Celmatix and the Centers for Disease Control and Prevention showed how enterprises successfully deploy Kafka for different enterprise architecture use cases.

How do you leverage data streaming with Apache Kafka in the healthcare industry? What architecture does your platform use? Which products do you combine with data streaming? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Real Time Analytics with Apache Kafka in the Healthcare Industry appeared first on Kai Waehner.

]]>