SaaS Archives - Kai Waehner https://www.kai-waehner.de/blog/tag/saas/ Technology Evangelist - Big Data Analytics - Middleware - Apache Kafka Mon, 07 Apr 2025 04:08:34 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.2 https://www.kai-waehner.de/wp-content/uploads/2020/01/cropped-favicon-32x32.png SaaS Archives - Kai Waehner https://www.kai-waehner.de/blog/tag/saas/ 32 32 The Importance of Focus: Why Software Vendors Should Specialize Instead of Doing Everything (Example: Data Streaming) https://www.kai-waehner.de/blog/2025/04/07/the-importance-of-focus-why-software-vendors-should-specialize-instead-of-doing-everything-example-data-streaming/ Mon, 07 Apr 2025 03:31:55 +0000 https://www.kai-waehner.de/?p=7527 As real-time technologies reshape IT architectures, software vendors face a critical decision: specialize deeply in one domain or build a broad, general-purpose stack. This blog examines why a focused approach—particularly in the world of data streaming—delivers greater innovation, scalability, and reliability. It compares leading platforms and strategies, from specialized providers like Confluent to generalist cloud ecosystems, and highlights the operational risks of fragmented tools. With data streaming emerging as its own software category, enterprises need clarity, consistency, and deep expertise. In this post, we argue that specialization—not breadth—is what powers mission-critical, real-time applications at global scale.

The post The Importance of Focus: Why Software Vendors Should Specialize Instead of Doing Everything (Example: Data Streaming) appeared first on Kai Waehner.

]]>
As technology landscapes evolve, software vendors must decide whether to specialize in a core area or offer a broad suite of services. Some companies take a highly focused approach, investing deeply in a specific technology, while others attempt to cover multiple use cases by integrating various tools and frameworks. Both strategies have trade-offs, but history has shown that specialization leads to deeper innovation, better performance, and stronger customer trust. This blog explores why focus matters in the context of data streaming software, the challenges of trying to do everything, and how companies that prioritize one thing—data streaming—can build best-in-class solutions that work everywhere.

The Importance of Focus for Software and Cloud Vendors - Data Streaming with Apache Kafka and Flink

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases, including customer stories across all industries.

Specialization vs. Generalization: Why Data Streaming Requires a Focused Approach

Data streaming enables real-time processing of continuous data flows, allowing businesses to act instantly rather than relying on batch updates. This shift from traditional databases and APIs to event-driven architectures has become essential for modern IT landscapes.

Event-driven Architecture for Data Streaming with Apache Kafka and Flink

Data streaming is no longer just a technique—it is a new software category. The 2023 Forrester Wave for Streaming Data Platforms confirms its role as a core component of scalable, real-time architectures. Technologies like Apache Kafka and Apache Flink have become industry standards. They power cloud, hybrid, and on-premise environments for real-time data movement and analytics.

Businesses increasingly adopt streaming-first architectures, focusing on:

  • Hybrid and multi-cloud streaming for real-time edge-to-cloud integration
  • AI-driven analytics powered by continuous optimization and inference using machine learning models
  • Streaming data contracts to ensure governance and reliability across the entire data pipeline
  • Converging operational and analytical workloads to replace inefficient batch processing and Lambda architecture with multiple data pipelines

The Data Streaming Landscape

As data streaming becomes a core part of modern IT, businesses must choose the right approach: adopt a purpose-built data streaming platform or piece together multiple tools with limitations. Event-driven architectures demand scalability, low latency, cost efficiency, and strict SLAs to ensure real-time data processing meets business needs.

Some solutions may be “good enough” for specific use cases, but they often lack the performance, reliability, and flexibility required for large-scale, mission-critical applications.

The Data Streaming Landscape 2025 with Kafka Flink Confluent Amazon MSK Cloudera Event Hubs and Other Platforms

The Data Streaming Landscape highlights the differences—while some vendors provide basic capabilities, others offer a complete Data Streaming Platform (DSP)designed to handle complex, high-throughput workloads with enterprise-grade security, governance, and real-time analytics. Choosing the right platform is essential for staying competitive in an increasingly data-driven world.

The Challenge of Doing Everything

Many software vendors and cloud providers attempt to build a comprehensive technology stack, covering everything from data lakes and AI to real-time data streaming. While this offers customers flexibility, it often leads to overlapping services, inconsistent long-term investment, and complexity in adoption.

A few examples (from the perspective of data streaming solutions).

Amazon AWS: Multiple Data Streaming Services, Multiple Choices

AWS has built the most extensive cloud ecosystem, offering services for nearly every aspect of modern IT, including data lakes, AI, analytics, and real-time data streaming. While this breadth provides flexibility, it also leads to overlapping services, evolving strategies, and complexity in decision-making for customers, resulting in frequent solution ambiguity.

Amazon provides several options for real-time data streaming and event processing, each with different capabilities:

  • Amazon SQS (Simple Queue Service): One of AWS’s oldest and most widely adopted messaging services. It’s reliable for basic decoupling and asynchronous workloads, but it lacks native support for real-time stream processing, ordering, replayability, and event-time semantics.
  • Amazon Kinesis Data Streams: A managed service for real-time data ingestion and simple event processing, but lacks the full event streaming capabilities of a complete data streaming platform.
  • Amazon MSK (Managed Streaming for Apache Kafka): A partially managed Kafka service that mainly focuses on Kafka infrastructure management. It leaves customers to handle critical operational support (MSK does NOT provide SLAs or support for Kafka itself) and misses capabilities such as stream processing, schema management, and governance.
  • AWS Glue Streaming ETL: A stream processing service built for data transformations but not designed for high-throughput, real-time event streaming.
  • Amazon Flink (formerly Kinesis Data Analytics): AWS’s attempt to offer a fully managed Apache Flink service for real-time event processing, competing directly with open-source Flink offerings.

Each of these services targets different real-time use cases, but they lack a unified, end-to-end data streaming platform. Customers must decide which combination of AWS services to use, increasing integration complexity, operational overhead, and costs.

Strategy Shift and Rebranding with Multiple Product Portfolios

AWS has introduced, rebranded, and developed its real-time streaming services over time:

  • Kinesis Data Analytics was originally AWS’s solution for stream processing but was later rebranded as Amazon Flink, acknowledging Flink’s dominance in modern stream processing.
  • MSK Serverless was introduced to simplify Kafka adoption but also introduces various additional product limitations and cost challenges.
  • AWS Glue Streaming ETL overlaps with Flink’s capabilities, adding confusion about the best choice for real-time data transformations.

As AWS expands its cloud-native services, customers must navigate a complex mix of technologies—often requiring third-party solutions to fill gaps—while assessing whether AWS’s flexible but fragmented approach meets their real-time data streaming needs or if a specialized, fully integrated platform is a better fit.

Google Cloud: Multiple Approaches to Streaming Analytics

Google Cloud is known for its powerful analytics and AI/ML tools, but its strategy in real-time stream processing has been inconsistent:

Customers looking for stream processing in Google Cloud now have three competing services:

  • Google Managed Service for Apache Kafka (Google MSK) (a managed Kafka offering). Google MSK is very early stage in the maturity curve and has many limitations.
  • Google Dataflow (built on Apache Beam)
  • Google Pub/Sub (event messaging)
  • Apache Flink on Dataproc (a managed service)

While each of these services has its use cases, they introduce complexity for customers who must decide which option is best for their workloads.

BigQuery Flink was introduced to extend Google’s analytics capabilities into real-time processing but was later discontinued before exiting the preview.

Microsoft Azure: Shifting Strategies in Data Streaming

Microsoft Azure has taken multiple approaches to real-time data streaming and analytics, with an evolving strategy that integrates various tools and services.

  • Azure Event Hubs has been a core event streaming service within Azure, designed for high-throughput data ingestion. It supports the Apache Kafka protocol (through Kafka version 3.0, so its feature set lags considerably), making it a flexible choice for (some) real-time workloads. However, it primarily focuses on event ingestion rather than event storage, data processing and integration–additional capabilities of a complete data streaming platform.
  • Azure Stream Analytics was introduced as a serverless stream processing solution, allowing customers to analyze data in motion. Despite its capabilities, its adoption has remained limited, particularly as enterprises seek more scalable, open-source alternatives like Apache Flink.
  • Microsoft Fabric is now positioned as an all-in-one data platform, integrating business intelligence, data engineering, real-time streaming, and AI. While this brings together multiple analytics tools, it also shifts the focus away from dedicated, specialized solutions like Stream Analytics.

While Microsoft Fabric aims to simplify enterprise data infrastructure, its broad scope means that customers must adapt to yet another new platform rather than continuing to rely on long-standing, specialized services. The combination of Azure Event Hubs, Stream Analytics, and Fabric presents multiple options for stream processing, but also introduces complexity, limitations and increased cost for a combined solution.

Microsoft’s approach highlights the challenge of balancing broad platform integration with long-term stability in real-time streaming technologies. Organizations using Azure must evaluate whether their streaming workloads require deep, specialized solutions or can fit within a broader, integrated analytics ecosystem.

I wrote an entire blog series to demystify what Microsoft Fabric really is.

Instaclustr: Too Many Technologies, Not Enough Depth

Instaclustr has positioned itself as a managed platform provider for a wide array of open-source technologies, including Apache Cassandra, Apache Kafka, Apache Spark, Apache ZooKeeper, OpenSearch, PostgreSQL, Redis, and more. While this broad portfolio offers customers choices, it reflects a horizontal expansion strategy that lacks deep specialization in any one domain.

For organizations seeking help with real-time data streaming, Instaclustr’s Kafka offering may appear to be a viable managed service. However, unlike purpose-built data streaming platforms, Instaclustr’s Kafka solution is just one of many services, with limited investment in stream processing, schema governance, or advanced event-driven architectures.

Because Instaclustr splits its engineering and support resources across so many technologies, customers often face challenges in:

  • Getting deep technical expertise for Kafka-specific issues
  • Relying on long-term roadmaps and support for evolving Kafka features
  • Building integrated event streaming pipelines that require more than basic Kafka infrastructure

This generalist model may be appealing for companies looking for low-cost, basic managed services—but it falls short when mission-critical workloads demand real-time reliability, zero data loss, SLAs, and advanced stream processing capabilities. Without a singular focus, platforms like Instaclustr risk becoming jacks-of-all-trades but masters of none—especially in the demanding world of real-time data streaming.

Cloudera: A Broad Portfolio Without a Clear Focus

Cloudera has adopted a distinct strategy by incorporating various open-source frameworks into its platform, including:

  • Apache Kafka (event streaming)
  • Apache Flink (stream processing)
  • Apache Iceberg (data lake table format)
  • Apache Hadoop (big data storage and batch processing)
  • Apache Hive (SQL querying)
  • Apache Spark (batch and near real-time processing and analytics)
  • Apache NiFi (data flow management)
  • Apache HBase (NoSQL database)
  • Apache Impala (real-time SQL engine)
  • Apache Pulsar (event streaming, via a partnership with StreamNative)

While this provides flexibility, it also introduces significant complexity:

  • Customers must determine which tools to use for specific workloads.
  • Integration between different components is not always seamless.
  • The broad scope makes it difficult to maintain deep expertise in each area.

Rather than focusing on one core area, Cloudera’s strategy appears to be adding whatever is trending in open source, which can create challenges in long-term support and roadmap clarity.

Splunk: Repeated Attempts at Data Streaming

Splunk, known for log analytics, has tried multiple times to enter the data streaming market:

Initially, Splunk built a proprietary streaming solution that never gained widespread adoption.

Later, Splunk acquired Streamlio to leverage Apache Pulsar as its streaming backbone.This Pulsar-based strategy ultimately failed, leading to a lack of a clear real-time streaming offering.

Splunk’s challenges highlight a key lesson: successful data streaming requires long-term investment and specialization, not just acquisitions or technology integrations.

Why a Focused Approach Works Better for Data Streaming

Some vendors take a more specialized approach, focusing on one core capability and doing it better than anyone else. For data streaming, Confluent became the leader in this space by focusing on improving the vision of a complete data streaming platform.

Confluent: Focused on Data Streaming, Built for Everywhere

At Confluent, the focus is clear: real-time data streaming. Unlike many other vendors and the cloud providers that offer fragmented or overlapping services, Confluent specializes in one thing and ensures it works everywhere:

  • Cloud: Deploy across AWS, Azure, and Google Cloud with deep native integrations.
  • On-Premise: Enterprise-grade deployments with full control over infrastructure.
  • Edge Computing: Real-time streaming at the edge for IoT, manufacturing, and remote environments.
  • Hybrid Cloud: Seamless data streaming across edge, on-prem, and cloud environments.
  • Multi-Region: Built-in disaster recovery and globally distributed architectures.

More Than Just “The Kafka Company”

While Confluent is often recognized as “the Kafka company,” it has grown far beyond that. Today, Confluent is a complete data streaming platform, combining Apache Kafka for event streaming, Apache Flink for stream processing, and many additional components for data integration, governance and security to power critical workloads.

However, Confluent remains laser-focused on data streaming—it does NOT compete with BI, AI model training, search platforms, or databases. Instead, it integrates and partners with best-in-class solutions in these domains to ensure businesses can seamlessly connect, process, and analyze real-time data within their broader IT ecosystem.

The Right Data Streaming Platform for Every Use Case

Confluent is not just one product—it matches the specific needs, SLAs, and cost considerations of different streaming workloads:

  • Fully Managed Cloud (SaaS)
    • Dedicated and multi-tenant Enterprise Clusters: Low latency, strict SLAs for mission-critical workloads.
    • Freight Clusters: Optimized for high-volume, relaxed latency requirements.
  • Bring Your Own Cloud (BYOC)
    • WarpStream: Bring Your Own Cloud for flexibility and cost efficiency.
  • Self-Managed
    • Confluent Platform: Deploy anywhere—customer cloud VPC, on-premise, at the edge, or across multi-region environments.

Confluent is built for organizations that require more than just “some” data streaming—it is for businesses that need a scalable, reliable, and deeply integrated event-driven architecture. Whether operating in a cloud, hybrid, or on-premise environment, Confluent ensures real-time data can be moved, processed, and analyzed seamlessly across the enterprise.

By focusing only on data streaming, Confluent ensures seamless integration with best-in-class solutions across both operational and analytical workloads. Instead of competing across multiple domains, Confluent partners with industry leaders to provide a best-of-breed architecture that avoids the trade-offs of an all-in-one compromise.

Deep Integrations Across Key Ecosystems

A purpose-built data streaming platform plays well with cloud providers and other data platforms. A few examples:

  • Cloud Providers (AWS, Azure, Google Cloud): While all major cloud providers offer some data streaming capabilities, Confluent takes a different approach by deeply integrating into their ecosystems. Confluent’s managed services can be:
    • Consumed via cloud credits through the cloud provider marketplace
    • Integrated natively into cloud provider’s security and networking services
    • Fully-managed out-of-the-box connectivity to cloud provider services like object storage, lakehouses, and databases
  • MongoDB: A leader in NoSQL and operational workloads, MongoDB integrates with Confluent via Kafka-based change data capture (CDC), enabling real-time event streaming between transactional databases and event-driven applications.
  • Databricks: A powerhouse in AI and analytics, Databricks integrates bi-directionally with Confluent via Kafka and Apache Spark, or object storage and the open table format from Iceberg / Delta Lake via Tableflow. This enables businesses to stream data for AI model training in Databricks and perform real-time model inference directly within the streaming platform.

Rather than attempting to own the entire data stack, Confluent specializes in data streaming and integrates seamlessly with the best cloud, AI, and database solutions.

Beyond the Leader: Specialized Vendors Shaping Data Streaming

Confluent is not alone in recognizing the power of focus. A handful of other vendors have also chosen to specialize in data streaming—each with their own vision, strengths, and approaches.

WarpStream, recently acquired by Confluent, is a Kafka-compatible infrastructure solution designed for Bring Your Own Cloud (BYOC) environments. It re-architects Kafka by running the protocol directly on cloud object storage like Amazon S3, removing the need for traditional brokers or persistent compute. This model dramatically reduces operational complexity and cost—especially for high-ingest, elastic workloads. While WarpStream is now part of the Confluent portfolio, it remains a distinct offering focused on lightweight, cost-efficient Kafka infrastructure.

StreamNative is the commercial steward of Apache Pulsar, aiming to provide a unified messaging and streaming platform. Built for multi-tenancy and geo-replication, it offers some architectural differentiators, particularly in use cases where separation of compute and storage is a must. However, adoption remains niche, and the surrounding ecosystem still lacks maturity and standardization.

Redpanda positions itself as a Kafka-compatible alternative with a focus on performance, especially in low-latency and resource-constrained environments. Its C++ foundation and single-binary architecture make it appealing for edge and latency-sensitive workloads. Yet, Redpanda still needs to mature in areas like stream processing, integrations, and ecosystem support to serve as a true platform.

AutoMQ re-architects Apache Kafka for the cloud by separating compute and storage using object storage like S3. It aims to simplify operations and reduce costs for high-throughput workloads. Though fully Kafka-compatible, AutoMQ concentrates on infrastructure optimization and currently lacks broader platform capabilities like governance, processing, or hybrid deployment support.

Bufstream is experimenting with lightweight approaches to real-time data movement using modern developer tooling and APIs. While promising in niche developer-first scenarios, it has yet to demonstrate scalability, production maturity, or a robust ecosystem around complex stream processing and governance.

Ververica focuses on stream processing with Apache Flink. It offers Ververica Platform to manage Flink deployments at scale, especially on Kubernetes. While it brings deep expertise in Flink operations, it does not provide a full data streaming platform and must be paired with other components, like Kafka for ingestion and delivery.

Great Ideas Are Born From Market Pressure

Each of these companies brings interesting ideas to the space. But building and scaling a complete, enterprise-grade data streaming platform is no small feat. It requires not just infrastructure, but capabilities for processing, governance, security, global scale, and integrations across complex environments.

That’s where Confluent continues to lead—by combining deep technical expertise, a relentless focus on one problem space, and the ability to deliver a full platform experience across cloud, on-prem, and hybrid deployments.

In the long run, the data streaming market will reward not just technical innovation, but consistency, trust, and end-to-end excellence. For now, the message is clear: specialization matters—but execution matters even more. Let’s see where the others go.

How Customers Benefit from Specialization

A well-defined focus provides several advantages for customers, ensuring they get the right tool for each job without the complexity of navigating overlapping services.

  • Clarity in technology selection: No need to evaluate multiple competing services; purpose-built solutions ensure the right tool for each use case.
  • Deep technical investment: Continuous innovation focused on solving specific challenges rather than spreading resources thin.
  • Predictable long-term roadmap: Stability and reliability with no sudden service retirements or shifting priorities.
  • Better performance and reliability: Architectures optimized for the right workloads through the deep experience in the software category.
  • Seamless ecosystem integration: Works natively with leading cloud providers and other data platforms for a best-of-breed approach.
  • Deployment flexibility: Not bound to a single environment like one cloud provider; businesses can run workloads on-premise, in any cloud, at the edge, or across hybrid environments.

Rather than adopting a broad but shallow set of solutions, businesses can achieve stronger outcomes by choosing vendors that specialize in one core competency and deliver it everywhere.

Why Deep Expertise Matters: Supporting 24/7, Mission-Critical Data Streaming

For mission-critical workloads—where downtime, data loss, and compliance failures are not an optiondeep expertise is not just an advantage, it is a necessity.

Data streaming is a high-performance, real-time infrastructure that requires continuous reliability, strict SLAs, and rapid response to critical issues. When something goes wrong at the core of an event-driven architecture—whether in Apache Kafka, Apache Flink, or the surrounding ecosystem—only specialized vendors with proven expertise can ensure immediate, effective solutions.

The Challenge with Generalist Cloud Services

Many cloud providers offer some level of data streaming, but their approach is different from a dedicated data streaming platform. Take Amazon MSK as an example:

  • Amazon MSK provides managed Kafka clusters, but does NOT offer Kafka support itself. If an issue arises deep within Kafka, customers are responsible for troubleshooting it—or must find external experts to resolve the problem.
  • The terms and conditions of Amazon MSK explicitly exclude Kafka support, meaning that, for mission-critical applications requiring uptime guarantees, compliance, and regulatory alignment, MSK is not a viable choice.
  • This lack of core Kafka support poses a serious risk for enterprises relying on event streaming for financial transactions, real-time analytics, AI inference, fraud detection, and other high-stakes applications.

For companies that cannot afford failure, a data streaming vendor with direct expertise in the underlying technology is essential.

Why Specialized Vendors Are Essential for Mission-Critical Workloads

A complete data streaming platform is much more than a hosted Kafka cluster or a managed Flink service. Specialized vendors like Confluent offer end-to-end operational expertise, covering:

  • 24/7 Critical Support: Direct access to Kafka and Flink experts, ensuring immediate troubleshooting for core-level issues.
  • Guaranteed SLAs: Strict uptime commitments, ensuring that mission-critical applications are always running.
  • No Data Loss Architecture: Built-in replication, failover, and durability to prevent business-critical data loss.
  • Security & Compliance: Encryption, access control, and governance features designed for regulated industries.
  • Professional Services & Advisory: Best practices, architecture reviews, and operational guidance tailored for real-time streaming at scale.

This level of deep, continuous investment in operational excellence separates a general-purpose cloud service from a true data streaming platform.

The Power of Specialization: Deep Expertise Beats Broad Offerings

Software vendors will continue expanding their offerings, integrating new technologies, and launching new services. However, focus remains a key differentiator in delivering best-in-class solutions, especially for operational systems with critical SLAs—where low latency, 24/7 uptime, no data loss, and real-time reliability are non-negotiable.

For companies investing in strategic data architectures, choosing a vendor with deep expertise in one core technology—rather than one that spreads across multiple domains—ensures stability, predictable performance, and long-term success.

In a rapidly evolving technology landscape, clarity, specialization, and seamless integration are the foundations of lasting innovation. Businesses that prioritize proven, mission-critical solutions will be better equipped to handle the demands of real-time, event-driven architectures at scale.

How do you see the world of software? Better to specialize or become an allrounder? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter. And download my free book about data streaming use cases.

The post The Importance of Focus: Why Software Vendors Should Specialize Instead of Doing Everything (Example: Data Streaming) appeared first on Kai Waehner.

]]>
Data Streaming as the Technical Foundation for a B2B Marketplace https://www.kai-waehner.de/blog/2025/03/05/data-streaming-as-the-technical-foundation-for-a-b2b-marketplace/ Wed, 05 Mar 2025 06:26:59 +0000 https://www.kai-waehner.de/?p=7288 A B2B data marketplace empowers businesses to exchange, monetize, and leverage real-time data through self-service platforms featuring subscription management, usage-based billing, and secure data sharing. Built on data streaming technologies like Apache Kafka and Flink, these marketplaces deliver scalable, event-driven architectures for seamless integration, real-time processing, and compliance. By exploring successful implementations like AppDirect, this post highlights how organizations can unlock new revenue streams and foster innovation with modern data marketplace solutions.

The post Data Streaming as the Technical Foundation for a B2B Marketplace appeared first on Kai Waehner.

]]>
A B2B data marketplace is a groundbreaking platform enabling businesses to exchange, monetize, and use data in real time. Beyond the basic promise of data sharing, these marketplaces are evolving into self-service platforms with features such as subscription management, usage-based billing, and secure data monetization. This post explores the core technical and functional aspects of building a data marketplace for subscription commerce using data streaming technologies like Apache Kafka. Drawing inspiration from real-world implementations like AppDirect, the post examines how these capabilities translate into a robust and scalable architecture.

Data Streaming with Apache Kafka and Flink as the Backbone for a B2B Data Marketplace

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases.

Subscription Commerce with a Digital Marketplace

Subscription commerce refers to business models where customers pay recurring fees—monthly, annually, or usage-based—for access to products or services, such as SaaS, streaming platforms, or subscription boxes.

Digital marketplaces are online platforms where multiple vendors can sell their products or services to customers, often incorporating features like catalog management, payment processing, and partner integrations.

Together, subscription commerce and digital marketplaces enable businesses to monetize recurring offerings efficiently, manage customer relationships, and scale through multi-vendor ecosystems. These solutions enables organizations to sell own or third-party recurring technology services through a white-labeled marketplace, or streamline procurement with an internal IT marketplace to manage and acquire services. The platform empowers digital growth for businesses of all sizes across direct and indirect go-to-market channels.

The Competitive Landscape for Subscription Commerce

The subscription commerce and digital marketplace space includes several prominent players offering specialized solutions.

Zuora leads in enterprise-grade subscription billing and revenue management, while Chargebee and Recurly focus on flexible billing and automation for SaaS and SMBs. Paddle provides global payment and subscription management tailored to SaaS businesses. AppDirect stands out for enabling SaaS providers and enterprises to manage subscriptions, monetize offerings, and build partner ecosystems through a unified platform.

For marketplace platforms, CloudBlue (from Ingram Micro) enables as-a-service ecosystems for telcos and cloud providers, and Mirakl excels at building enterprise-level B2B and B2C marketplaces.

Solutions like ChannelAdvisor and Vendasta cater to resellers and localized businesses with marketplace and subscription tools. Each platform offers unique capabilities, making the choice dependent on specific needs like scalability, industry focus, and integration requirements.

What Makes a B2B Data Marketplace Technically Unique?

A data marketplace is more than a repository; it is a dynamic, decentralized platform that enables the continuous exchange of data streams across organizations. Its key distinguishing features include:

  1. Real-Time Data Sharing: Enables instantaneous exchange and consumption of data streams.
  2. Decentralized Design: Avoids reliance on centralized data hubs, reducing latency and risk of single points of failure.
  3. Fine-Grained Access Control: Ensures secure and compliant data sharing.
  4. Self-Service Capabilities: Simplifies the discovery and consumption of data through APIs and portals.
  5. Usage-Based Billing and Monetization: Tracks data usage in real time to enable flexible pricing models.

These characteristics require a scalable, fault-tolerant, and real-time data processing backbone. Enter data streaming with the de facto standard Apache Kafka.

Data Streaming as the Backbone of a B2B Data Marketplace

At the heart of a B2B data marketplace lies data streaming, a technology paradigm enabling continuous data flow and processing. Kafka’s publish-subscribe architecture aligns seamlessly with the marketplace model, where data producers publish streams that consumers can subscribe to in real time.

Event-driven Architecture for Data Streaming with Apache Kafka and Flink

Why Apache Kafka for a Data Marketplace?

A data streaming platform uniquely combines different characteristics and capabilities:

  1. Scalability and Fault Tolerance: Kafka’s distributed architecture allows for handling large volumes of data streams, ensuring high availability even during failures.
  2. Event-Driven Design: Kafka provides a natural fit for event-driven architectures, where data exchanges trigger workflows, such as subscription activation or billing.
  3. Stream Processing with Kafka Streams or ksqlDB: Real-time transformation, filtering, and enrichment of data streams can be performed natively, ensuring the data is actionable as it flows.
  4. Integration with Ecosystem: Kafka’s connectors enable seamless integration with external systems such as billing platforms, monitoring tools, and data lakes.
  5. Security and Compliance: Built-in features like TLS encryption, SASL authentication, and fine-grained ACLs ensure the marketplace adheres to strict security standards.

I wrote a separate article that explores how an Event-driven Architecture (EDA) and Apache Kafka build the foundation of a streaming data exchange.

Architecture Overview

Modern architectures for data marketplaces are often inspired by Domain-Driven Design (DDD), microservices, and the principles of a data mesh.

  • Domain-Driven Design helps structure the platform around distinct business domains, ensuring each part of the marketplace aligns with its core functionality, such as subscription management or billing.
  • Microservices decompose the marketplace into independently deployable services, promoting scalability and modularity.
  • A Data mesh decentralizes data ownership, empowering individual teams or providers to manage and share their datasets while adhering to shared governance policies.

Decentralised Data Products with Data Streaming leveraging Apache Kafka in a Data Mesh

Together, these principles create a flexible, scalable, and business-aligned architecture. A high-level architecture for such a marketplace involves:

  1. Data Providers: Publish real-time data streams to Kafka Topics. Use Kafka Connect to ingest data from external sources.
  2. Data Marketplace Platform: A front-end portal backed by Kafka for subscription management, search, and discovery. Kafka Streams or Apache Flink for real-time processing (e.g., billing, transformation). Integration with billing systems, identity management, and analytics platforms.
  3. Data Consumers: Subscribe to Kafka Topics, consuming data tailored to their needs. Integrate the marketplace streams into their own analytics or operational workflows.

Data Sharing Beyond Kafka with Stream Sharing and Self-Service Data Portal

A data streaming platoform enable simple and secure data sharing within or across organizations with chargeback capabilities built-in to build cost APIs and new business models. The following is an implementation leveraging Confluent’s Stream Sharing functionality in Confluent Cloud:

Confluent Stream Sharing for Data Sharing Beyond Apache Kafka
Source: Confluent

Data Marketplace Features and Their Technical Implementation

A robust B2B data marketplace should offer the following vendor-agnostic features:

Self-Service Data Discovery

Real-Time Subscription Management

  • Functionality: Enables users to subscribe to data streams with customizable preferences, such as data filters or frequency of updates.
  • Technical Implementation: Use Kafka’s consumer groups to manage subscriptions. Implement filtering logic with Kafka Streams or ksqlDB to tailor streams to user preferences.

Usage-Based Billing

  • Functionality: Tracks the volume or type of data consumed by each user and generates invoices dynamically.
  • Technical Implementation: Use Kafka’s log retention and monitoring tools to track data consumption. Integrate with a billing engine via Kafka Connect or RESTful APIs for real-time invoice generation.

Monetization and Revenue Sharing

  • Functionality: Facilitates revenue sharing between data providers and marketplace operators.
  • Technical Implementation: Build a revenue-sharing logic layer using Kafka Streams or Apache Flink, processing data usage metrics. Store provider-specific pricing models in a database connected via Kafka Connect.

Compliance and Data Governance

  • Functionality: Ensures data sharing complies with regulations (e.g., GDPR, HIPAA) and provides an audit trail.
  • Technical Implementation: Leverage Kafka’s immutable event log as an auditable record of all data exchanges. Implement data contracts for Kafka Topics with policies, role-based access control (RBAC), and encryption for secure sharing.

Dynamic Pricing Models

Marketplace Analytics

  • Functionality: Offers insights into usage patterns, revenue streams, and operational metrics.
  • Technical Implementation: Aggregate Kafka stream data into analytics platforms such as Snowflake, Databricks, Elasticsearch or Microsoft Fabri.

Real-World Success Story: AppDirect’s Subscription Commerce Platform Powered by a Data Streaming Platform

AppDirect is a leading subscription commerce platform that helps businesses monetize and manage software, services, and data through a unified digital marketplace. It provides tools for subscription billing, usage tracking, partner management, and revenue sharing, enabling seamless B2B transactions.

AppDirect B2B Data Marketplace for Subscription Commerce
Source: AppDirect

AppDirect serves customers across industries such as telecommunications (e.g., Telstra, Deutsche Telekom), technology (e.g., Google, Microsoft), and cloud services, powering ecosystems for software distribution and partner-driven monetization.

The Challenge

AppDirect enables SaaS providers to monetize their offerings, but faced significant challenges in scaling its platform to handle the growing complexity of real-time subscription billing and data flow management.

As the number of vendors and consumers on the platform increased, ensuring accurate, real-time tracking of usage and billing became increasingly difficult. Additionally, the legacy systems struggled to support seamless integration, dynamic pricing models, and real-time updates required for a competitive marketplace experience.

The Solution

AppDirect implemented a data streaming backbone with Apache Kafka leveraging Confluent’s data streaming platform. This enabled:

  • Real-time billing for subscription services.
  • Accurate usage tracking and monetization.
  • Improved scalability with a distributed, event-driven architecture.

The Outcome

  • 90% reduction in time-to-market for new features.
  • Enhanced customer experience with real-time updates.
  • Seamless scaling to handle increasing vendor participation and data loads.

Advantages Over Competitors in the Subscription Commerce and Data Marketplace Business

Powered by the event-driven architecture and a data streaming platform, AppDirect distinguishes itself with from competitors in the subscription commerce and data marketplace business:

  • A unified approach to subscription management, billing, and partner ecosystem enablement.
  • Strong focus on the telecommunications and technology sectors.
  • Deep integrations for vendor and reseller ecosystems.

Data Streaming Revolutionizes B2B Data Sharing

The technical backbone of a B2B data marketplace relies on data streaming to deliver real-time data sharing, scalable subscription management, and secure monetization. Platforms like Apache Kafka and Confluent enable these features through their distributed, event-driven architecture, ensuring resilience, compliance, and operational efficiency.

By implementing these principles, organizations can build a modern, self-service data marketplace that fosters innovation and collaboration. The success of AppDirect highlights the potential of this approach, offering a blueprint for businesses looking to capitalize on the power of data streaming.

Whether you’re a data provider seeking additional revenue streams or a business aiming to harness external insights, a well-designed data marketplace is your gateway to unlocking value in the digital economy.

Stay ahead of the curve! Subscribe to my newsletter for insights into data streaming and connect with me on LinkedIn to continue the conversation. And make sure to download my free book about data streaming use cases.

The post Data Streaming as the Technical Foundation for a B2B Marketplace appeared first on Kai Waehner.

]]>
Fully Managed (SaaS) vs. Partially Managed (PaaS) Cloud Services for Data Streaming with Kafka and Flink https://www.kai-waehner.de/blog/2025/01/18/fully-managed-saas-vs-partially-managed-paas-cloud-services-for-data-streaming-with-kafka-and-flink/ Sat, 18 Jan 2025 11:33:44 +0000 https://www.kai-waehner.de/?p=7224 The cloud revolution has reshaped how businesses deploy and manage data streaming with solutions like Apache Kafka and Flink. Distinctions between SaaS and PaaS models significantly impact scalability, cost, and operational complexity. Bring Your Own Cloud (BYOC) expands the options, giving businesses greater flexibility in cloud deployment. Misconceptions around terms like “serverless” highlight the need for deeper analysis to avoid marketing pitfalls. This blog explores deployment options, enabling informed decisions tailored to your data streaming needs.

The post Fully Managed (SaaS) vs. Partially Managed (PaaS) Cloud Services for Data Streaming with Kafka and Flink appeared first on Kai Waehner.

]]>
The cloud revolution has transformed how businesses deploy, scale, and manage data streaming solutions. While Software-as-a-Service (SaaS) and Platform-as-a-Service (PaaS) cloud models are often used interchangeably in marketing, their distinctions have significant implications for operational efficiency, cost, and scalability. In the context of data streaming around Apache Kafka and Flink, understanding these differences and recognizing common misconceptions—such as the overuse of the term “serverless”—can help you make an informed decision. Additionally, the emergence of Bring Your Own Cloud (BYOC) offers yet another option, providing organizations with enhanced control and flexibility in their cloud environments.

SaaS vs PaaS Cloud Service for Data Streaming with Apache Kafka and Flink

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch.

The Data Streaming Landscape 2025 highlights how data streaming has evolved into a key software category, moving from niche adoption to a fundamental part of modern data architecture.

The Data Streaming Landcape 2025 with Kafka Flink Confluent Amazon MSK Cloudera Event Hubs and Other Platforms

With frameworks like Apache Kafka and Flink at its core, the landscape now spans self-managed, BYOC, and fully managed SaaS solutions, driving real-time use cases, unifying transactional and analytical workloads, and enabling innovation across industries.

If you’re still grappling with the fundamentals of stream processing, this article is a must-read: “Stateless vs. Stateful Stream Processing with Kafka Streams and Apache Flink“.

What is SaaS in Data Streaming?

SaaS data streaming solutions are fully managed services where the provider handles all aspects of deployment, maintenance, scaling, and updates. SaaS offerings are designed for ease of use, providing a serverless experience where developers focus solely on building applications rather than managing infrastructure.

Characteristics of SaaS in Data Streaming

  1. Serverless Architecture: Resources scale automatically based on demand. True SaaS solutions eliminate the need to provision or manage servers.
  2. Low Operational Overhead: The provider manages hardware, software, and runtime configurations, including upgrades and security patches.
  3. Pay-As-You-Go Pricing: Consumption-based pricing aligns costs directly with usage, reducing waste during low-demand periods.
  4. Rapid Deployment: SaaS enables users to start processing streams within minutes, accelerating time-to-value.

Examples of SaaS in Data Streaming:

  • Confluent Cloud: A fully managed Kafka platform offering serverless scaling, multi-tenancy, and a broad feature set for both stateless and stateful processing.
  • Amazon Kinesis Data Analytics: A managed service for real-time analytics with automatic scaling.

What is PaaS in Data Streaming?

PaaS offerings sit between fully managed SaaS and infrastructure-as-a-service (IaaS). These solutions provide a platform to deploy and manage applications but still require significant user involvement for infrastructure management.

Characteristics of PaaS in Data Streaming

  1. Partial Management: The provider offers tools and frameworks, but users must manage servers, clusters, and scaling policies.
  2. Manual Configuration: Deployment involves provisioning VMs or containers, tuning parameters, and monitoring resource usage.
  3. Complex Scaling: Scaling is not always automatic; users may need to adjust resource allocation based on workload changes.
  4. Higher Overhead: PaaS requires more expertise and operational involvement, making it less accessible to teams without dedicated DevOps resources.

PaaS offerings in data streaming, while simplifying some infrastructure tasks, still require significant user involvement compared to fully serverless SaaS solutions. Below are three common examples, along with their benefits and pain points compared to serverless SaaS:

  • Apache Flink (Self-Managed on Kubernetes Cloud Service like EKS, AKS or GKE)
    • Benefits: Full control over deployment and infrastructure customization.
    • Pain Points: High operational overhead for managing Kubernetes clusters, manual scaling, and complex resource tuning.
  • Amazon Managed Service for Apache Flink (Amazon MSF)
    • Benefits: Simplifies infrastructure management and integrates with some other AWS services.
    • Pain Points: Users still handle job configuration, scaling optimization, and monitoring, making it less user-friendly than serverless SaaS solutions.
  • Amazon MSK (Managed Streaming for Apache Kafka)
    • Benefits: Eases Kafka cluster maintenance and integrates with the AWS ecosystem.
    • Pain Points: Requires users to design and manage producers/consumers, manually configure scaling, and handle monitoring responsibilities. MSK also excludes support for Kafka if you have any operational issues with the Kafka piece of the infrastructure.

SaaS vs. PaaS: Key Differences

SaaS and PaaS differ in the level of management and user responsibility, with SaaS offering fully managed services for simplicity and PaaS requiring more user involvement for customization and control.

FeatureSaaSPaaS
InfrastructureFully managed by the providerPartially managed; user controls clusters
ScalingAutomatic and server lessManual or semi-automatic scaling
Deployment SpeedImmediate, ready to useSlower; requires configuration
Operational ComplexityMinimalModerate to high
Cost ModelConsumption-based, no idle costsMay incur idle resource costs

The big benefit of PaaS over SaaS is greater flexibility and control, allowing users to customize the platform, integrate with specific infrastructure, and optimize configurations to meet unique business or technical requirements. This level of control is often essential for organizations with strict compliance, security, or data sovereignty requirements.

SaaS is NOT Always Better than PaaS!

Be careful: The limitations and pain points of PaaS do NOT mean that SaaS is always better.

A concrete example: Amazon MSK Serverless simplifies Apache Kafka operations with automated scaling and infrastructure management but comes with significant limitations, including the lack of fully-managed connectors, advanced data governance tools, and native multi-language client support.

Amazon MSK also excludes Kafka engine support, leading to potential operational risks and cost unpredictability, especially when integrating with additional AWS services for a complete data streaming pipeline. I explored these challenges in more detail in my article “When NOT to choose Amazon MSK Serverless for Apache Kafka?“.

Bring Your Own Cloud (BYOC) as Alternative to PaaS

BYOC (Bring Your Own Cloud) offers a middle ground between fully managed SaaS and self-managed PaaS solutions, allowing organizations to host applications in their own VPCs.

BYOC provides enhanced control, security, and compliance while reducing operational complexity. This makes BYOC a strong alternative to PaaS for companies with strict regulatory or cost requirements.

As an example, here are the options of Confluent for deploying the data streaming platform: Serverless Confluent Cloud, Self-managed Confluent Platform (some consider this a PaaS if you leverage Confluent’s Kubernetes Operator and other automation / DevOps tooling) and WarpStream as BYOC offering:

Cloud-Native BYOC for Apache Kafka with WarpStream in the Public Cloud
Source: Confluent

While BYOC complements SaaS and PaaS, it can be a better choice when fully managed solutions don’t align with specific business needs. I wrote a detailed article about this topic: “Deployment Options for Apache Kafka: Self-Managed, Fully-Managed / Serverless and BYOC (Bring Your Own Cloud)“.

“Serverless” Claims: Don’t Trust the Marketing

Many cloud data streaming solutions are marketed as “serverless,” but this term is often misused. A truly serverless solution should:

  1. Abstract Infrastructure: Users should never need to worry about provisioning, upgrading, or cluster sizing.
  2. Scale Transparently: Resources should scale up or down automatically based on workload.
  3. Eliminate Idle Costs: There should be no cost for unused capacity.

However, many products marketed as serverless still require some degree of infrastructure management or provisioning, making them closer to PaaS. For example:

  • A so-called “serverless” PaaS solution may still require setting initial cluster sizes or monitoring node health.
  • Some products charge for pre-provisioned capacity, regardless of actual usage.

Do Your Own Research

When evaluating data streaming solutions, dive into the technical documentation and ask pointed questions:

  • Does the solution truly abstract infrastructure management?
  • Are scaling policies automatic, or do they require manual configuration?
  • Is there a minimum cost even during idle periods?

By scrutinizing these factors, you can avoid falling for misleading “serverless” claims and choose a solution that genuinely meets your needs.

Choosing the Right Model for Your Data Streaming Business: SaaS, PaaS, or BYOC

When adopting a data streaming platform, selecting the right model is crucial for aligning technology with your business strategy:

  • Use SaaS (Software as a Service) if you prioritize ease of use, rapid deployment, and operational simplicity. SaaS is ideal for teams looking to focus entirely on application development without worrying about infrastructure.
  • Use PaaS (Platform as a Service) if you require deep customization, control over resource allocation, or have unique workloads that SaaS offerings cannot address.
  • Use BYOC (Bring Your Own Cloud) if your organization demands full control over its data but sees benefits in fully managed services. BYOC enables you to run the data plane within your cloud VPC, ensuring compliance, security, and architectural flexibility while leveraging SaaS functionality for the control plane .

In the rapidly evolving world of data streaming around Apache Kafka and Flink, SaaS data streaming platforms like Confluent Cloud provide the best of both worlds: the advanced features of tools like Apache Kafka and Flink, combined with the simplicity of a fully managed serverless experience. Whether you’re handling stateless stream processing or complex stateful analytics, SaaS ensures you’re scaling efficiently without operational headaches.

What deployment option do you use today for Kafka and Flink? Any changes planned in the future? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Fully Managed (SaaS) vs. Partially Managed (PaaS) Cloud Services for Data Streaming with Kafka and Flink appeared first on Kai Waehner.

]]>
Data Streaming in Healthcare and Pharma: Use Cases and Insights from Cardinal Health https://www.kai-waehner.de/blog/2024/11/28/data-streaming-in-healthcare-and-pharma-use-cases-cardinal-health/ Thu, 28 Nov 2024 04:12:15 +0000 https://www.kai-waehner.de/?p=7047 This blog explores Cardinal Health’s journey, exploring how its event-driven architecture and data streaming power use cases like supply chain optimization, and medical device and equipment management. By integrating Apache Kafka with platforms like Apigee, Dell Boomi and SAP, Cardinal Health sets a benchmark for IT modernization and innovation in the healthcare and pharma sectors.

The post Data Streaming in Healthcare and Pharma: Use Cases and Insights from Cardinal Health appeared first on Kai Waehner.

]]>

The post Data Streaming in Healthcare and Pharma: Use Cases and Insights from Cardinal Health appeared first on Kai Waehner.

]]>
What is Microsoft Fabric for Azure Cloud (Beyond the Buzz) and how it Competes with Snowflake and Databricks https://www.kai-waehner.de/blog/2024/10/04/what-is-microsoft-fabric-for-azure-cloud-beyond-the-buzz-and-how-it-competes-with-snowflake-and-databricks/ Fri, 04 Oct 2024 05:16:31 +0000 https://www.kai-waehner.de/?p=6833 If you ask your favorite large language model, Microsoft Fabric appears to be the ultimate solution for any data challenge you can imagine. That’s also the impression many people get from Microsoft’s sales teams. But is it really the silver bullet it’s made out to be? This article takes a closer look exploring the glossy marketing and sales definition of the platform and then deconstructing it from a more practical perspective. Learn what Microsoft Fabric is truly built for, and how it fits into the wider data landscape, especially in comparison to other major players in the data analytics market like Databricks and Snowflake.

The post What is Microsoft Fabric for Azure Cloud (Beyond the Buzz) and how it Competes with Snowflake and Databricks appeared first on Kai Waehner.

]]>
If you ask your favorite large language model, Microsoft Fabric appears to be the ultimate solution for any data challenge you can imagine. That’s also the impression many people get from Microsoft’s sales teams. But is it really the silver bullet it’s made out to be? This article takes a closer look. The first part explores the glossy marketing and sales definition of the platform. The second part looks the layers and deconstructs it from a more practical perspective. By doing so, the third part uncovers what Microsoft Fabric is truly built for, and how it fits into the wider data landscape, especially in comparison to other major players in the data analytics market like Databricks and Snowflake.

Microsoft Fabric and OneLake Azure Lakehouse vs Databricks and Snowflake Cloud

This is part one of a blog series about Microsoft Fabric and its relation to other data platforms on the Azure cloud:

  1. What is Microsoft Fabric for Azure Cloud (Beyond the Buzz) and how it Compares (or Competes) with Snowflake and Databricks
  2. How Microsoft Fabric Complements Data Streaming (Apache Kafka, Flink, et al.)
  3. When to Choose Apache Kafka vs. Azure Event Hubs vs. Confluent Cloud for a Microsoft Fabric Lakehouse

Subscribe to my newsletter to get an email about a new blog post every few weeks.

What is Microsoft Fabric?

If you listen to Microsoft’s sales and marketing, then Microsoft Fabric is a silver bullet for every use case. Let’s take a two-step approach. Look at the sales and marketing definition. Then deconstruct it a bit from a more realistic point of view…

GenAI Definition (= Sales and Marketing)

If you ask your favourite large language model or search engine, it you something like the following when I ask “What is Microsoft Fabric” (based on sales and marketing content):

Microsoft Fabric is an end-to-end analytics platform designed to integrate various data services and enable businesses to manage, analyze, and act on their data seamlessly. It was launched as part of Microsoft’s data ecosystem and builds upon key features from platforms like Power BI, Azure Synapse Analytics, and Azure Data Factory.

Microsoft Fabric Marketing Website
Source: Microsoft

Here are some key aspects of Microsoft Fabric:

  • Unified Platform: It combines data engineering, data science, data warehousing, and real-time analytics into a single platform. This helps businesses eliminate the need to use multiple services for data management and analysis.
  • Lakehouse Architecture: Fabric is designed around the lakehouse concept, which merges the best of data lakes and data warehouses. It allows for both structured and unstructured data to be stored and processed together.
  • Tightly Integrated with Microsoft 365 and Azure: Microsoft Fabric connects seamlessly with other Microsoft services like Microsoft 365, Power BI, and Azure Machine Learning, enabling better collaboration, reporting, and AI-driven insights.
  • Low-code/No-code Experience: The platform provides intuitive tools for data analysts, developers, and business users, allowing non-technical users to work with data through drag-and-drop interfaces, while also enabling more complex scenarios for advanced users.
  • AI and Machine Learning Integration: Microsoft Fabric incorporates AI tools, making it easier for businesses to build predictive models and automate data-driven decisions.
  • End-to-End Security and Governance: The platform supports robust security measures and compliance requirements, offering features like data encryption, role-based access control, and regulatory compliance support.
  • Real-time Data Processing: With support for real-time analytics, Fabric enables organizations to derive insights from live data streams, improving decision-making speed and accuracy.

Microsoft Fabric is designed to streamline how businesses use data, combining the power of analytics with cloud-scale capabilities.

Wow. Just wow. Microsoft Fabric seems to be everything you ever need for your data challenges.

Microsoft Developer has an excellent 45 minute presentation about OneLake and Microsoft Fabric with a few more technical details. This video is also the source of the screenshots below.

Well, let’s dig deeper. What is Microsoft Fabric really? Let’s deconstruct it a bit…

Microsoft Fabric is a Data Analytics Platform ( = NOT for Operational / Transactional Workloads)

Microsoft Fabric is part of Microsoft’s data analytics portfolio. That’s already the first alarm signal when you consider building operational workloads. This is not a criticism, but important to understand!

Azure Data Analytics SaaS Platform

Microsoft Fabric is NOT a platform for transactional workloads like payments, fraud detection, order management or ERP integration. You should not build an operational application like an Azure Serverless Function or a self-managed Spring Boot container for Fabric.

Furthermore, within the data analytics layer, the foundation of Microsoft Fabric is (only) an optimized storage layer. And this storage layer called OneLake is a SaaS offering, i.e., the storage is part of the Microsoft tenant. Contrary to many other data lakes and lakehouses like Databricks, you do not control or own the storage.

While the conversation is usually around cloud analytics, Microsoft Fabric is a unified analytics platform that integrates with Azure Cloud but is sold independently. This allows organizations to deploy it in various environments, edge, and hybrid setups. For instance, Microsoft sells Fabric for hybrid IoT projects where data needs to be processed both locally and in the cloud.

OneLake – Cloud-based Storage Layer on Top of Azure Data Lake Storage (ADLS)

Microsoft OneLake is a unified, cloud-based data lake that acts as the central storage layer within Microsoft Fabric:

Microsoft OneLake is built on top of Azure Data Lake Storage (ADLS), using its scalable and secure data storage capabilities for long-term data retention. OneLake inherits ADLS’s features like hierarchical namespaces and advanced security, while adding a unified data lake experience across multiple clouds and deep integration with Microsoft’s analytics and data tools through Microsoft Fabric.

OneLake for Azure Lakehouse, Databricks with Delta Lake, Snowflake with Apache Iceberg Lakehouse
Source: Microsoft

The message is obvious: Store all data in OneLake and connect your favourite compute engines, such as Microsoft Fabric, Azure Databricks and Snowflake. Open Table Formats like Delta Lake and Apache Iceberg allow simple integration without the need to copy data again.

Microsoft Fabric Connects to Many Existing Azure Services

On top of the storage layer OneLake, Microsoft Fabric connects to plenty of different existing Microsoft Azure services, including Power BI, Data Explorer, various Synapse services, and so on. This explains why Microsoft Fabric can magically provide every capability you are looking for a few months after the initial announcement.

Microsoft Fabric Architecture - OneLake as Storage Layer and Azure SaaS Analytics Cloud Services
Source: Microsoft

Here are a few integrations of Azure services into the unified storage of Microsoft Fabric:

  • Power BI: A critical component of Microsoft Fabric, enabling data visualization and business intelligence. It allows users to create interactive dashboards and reports directly from data stored in the lakehouse, providing real-time insights with minimal data movement.
  • Azure Data Explorer: Used for analyzing large volumes of streaming and historical data. Microsoft Fabric connects to Data Explorer, allowing users to perform fast, complex, real-time queries on structured and semi-structured data.
  • Azure Synapse Analytics: Fabric integrates Synapse’s data engineering capabilities, allowing users to prepare, transform, and orchestrate data pipelines. It provides a unified workspace to manage end-to-end data engineering workflows, reducing the need for complex data movement.
  • Synapse Data Warehousing: Fabric connects with Synapse’s data warehousing services, making it easy to run massively parallel processing (MPP) queries for large-scale analytics on structured data.
  • Synapse Spark Pools: Fabric integrates with Apache Spark in Synapse, supporting big data processing, AI, and machine learning workloads. Users can leverage Spark’s distributed computing power within Fabric for data transformation, advanced analytics, and machine learning.
  • Azure Machine Learning (AML): Enables data scientists to build, train, and deploy machine learning models on data stored within the Fabric lakehouse. Users can perform machine learning experiments, automate ML model training, and deploy models with an unified data platform.
  • Azure Data Factory: Used for data ingestion, ETL (extract, transform, load), and data orchestration. Fabric connects with Azure Data Factory, making it easy to create data pipelines that move and transform data from a wide variety of sources, including on-premises databases, cloud storage, and third-party systems.
  • Azure Purview: Provides a unified data catalog, allowing users to discover, classify, and govern data assets across the Fabric ecosystem. It also provides compliance and auditing capabilities.
  • Azure Event Hubs and Stream Analytics: Real-time data processing and analytics. Event Hubs enables streaming data ingestion from sources like IoT devices, applications, and logs, while Stream Analytics allows for real-time data querying and analysis.

Expect more Azure services to be integrated with Microsoft Fabric in the coming months to provide a “complete lakehouse experience”. Also expect more fancy marketing brands, such as the new “Real Time Intelligence Hub” that is built by connecting / re-using existing Microsoft Azure services.

So, what is the main idea behind building this lakehouse product and brand within Microsoft’s huge cloud portfolio?

Microsoft Fabric is a Lakehouse Competing with Snowflake and Databricks

A lakehouse is a data architecture that combines the features of data lakes and data warehouses, allowing for both structured and unstructured data to be stored and processed together. It provides the scalability and flexibility of a data lake with the data management, governance, and performance features of a data warehouse. This unified approach enables real-time analytics and machine learning on diverse types of data, reducing the need for separate infrastructures.

Most analytical data vendors transition to a full-blown lakehouse. While Databricks moved from the data lake foundation powered by Apache Spark into the lakehouse, Snowflake comes from the data warehouse approach but has incorporated a lot of lakehouse features over time (even though Snowflake calls it a more general “data cloud”).

Microsoft Fabric competes with platforms like Databricks and Snowflake in the realm of data analytics, data engineering, and data warehousing by providing an integrated, cloud-native solution for data management and analytics.

Microsoft Fabric positions itself as a more holistic and integrated platform, offering a unified solution for businesses that need to handle everything from data ingestion to real-time analytics and AI. Its Microsoft ecosystem integration is a key competitive advantage.

There are also trade-offs. For instance:

  • Microsoft Fabric is only available on Azure cloud
  • Not a mature product yet
  • Starting a much more competitive approach with strategic partners like Databricks

The support of open table formats like Delta Lake and Apache Iceberg is great. But this is coming in all lakehouses because of market pressure. Not because the data cloud vendors like Databricks, Snowflake and now Microsoft with Fabric have a new business model. All of these vendors still want to collect all the data, store it forever, and put (their own!) compute services on top.

Microsoft Fabric is Azure’s Future Lakehouse

Microsoft Fabric’s integration with many Azure services allows it to offer a broad range of capabilities – from data ingestion, storage, and transformation to real-time analytics, machine learning, and governance. This interconnected ecosystem explains how Fabric can quickly meet diverse enterprise needs by leveraging Microsoft’s existing suite of powerful tools, providing a comprehensive data platform with minimal friction and seamless workflows.

In the end, Microsoft Fabric is a new lakehouse built on top of the optimized cloud storage OneLake. It directly competes with other lakehouses and data clouds such as Databricks and Snowflake to become the leading unified solution for all the things analytics. The future will show where this competition goes. Snowflake and Databricks have a very strong product and customer base already. They will not give up to Microsoft Fabric voluntarily.

Microsoft Fabric includes integrations with Azure Event Hubs (based on the Kafka protocol) and is building a brand around real-time intelligence. In the next article of this blog series, I will explore how this new lakehouse on Azure cloud competes or overlaps with data streaming technologies such as Apache Kafka, Flink, et al. Primer: Data Streaming and Microsoft Fabric are mostly complementary and have very different sweet spots.

How do you see the future of Microsoft Fabric? Do you already use it? What is the plan in the future, also keeping in mind that you likely already have other lakehouses in your enterprise architecture? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post What is Microsoft Fabric for Azure Cloud (Beyond the Buzz) and how it Competes with Snowflake and Databricks appeared first on Kai Waehner.

]]>
Deployment Options for Apache Kafka: Self-Managed, Fully-Managed / Serverless and BYOC (Bring Your Own Cloud) https://www.kai-waehner.de/blog/2024/09/12/deployment-options-for-apache-kafka-self-managed-fully-managed-serverless-and-byoc-bring-your-own-cloud/ Thu, 12 Sep 2024 13:43:31 +0000 https://www.kai-waehner.de/?p=6808 BYOC (Bring Your Own Cloud) is an emerging deployment model for organizations looking to maintain greater control over their cloud environments. Unlike traditional SaaS models, BYOC allows businesses to host applications within their own VPCs to provide enhanced data privacy, security, and compliance. This approach leverages existing cloud infrastructure. It offers more flexibility for custom configurations, particularly for companies with stringent security needs. In the data streaming sector around Apache Kafka, BYOC is changing how platforms are deployed. Organizations get more control and adaptability for various use cases. But it is clearly NOT the right choice for everyone!

The post Deployment Options for Apache Kafka: Self-Managed, Fully-Managed / Serverless and BYOC (Bring Your Own Cloud) appeared first on Kai Waehner.

]]>
BYOC (Bring Your Own Cloud) is an emerging deployment model for organizations looking to maintain greater control over their cloud environments. Unlike traditional SaaS models, BYOC allows businesses to host applications within their own VPCs to provide enhanced data privacy, security, and compliance. This approach leverages existing cloud infrastructure. It offers more flexibility for custom configurations, particularly for companies with stringent security needs. In the data streaming sector around Apache Kafka, BYOC is changing how platforms are deployed. Organizations get more control and adaptability for various use cases. But it is clearly NOT the right choice for everyone!

Apache Kafka Deployment Options - Serverless vs Self-Managed vs BYOC Bring Your Own Cloud

BYOC (Bring Your Own Cloud) – A New Deployment Model for Cloud Infrastructure

BYOC (Bring Your Own Cloud) is a deployment model where organizations choose their preferred cloud infrastructure to host applications or services, rather than using a serverless / fully managed cloud solution selected by a software vendor; typically known as Software as a Service (SaaS). This model gives businesses flexibility to leverage their existing cloud services (like AWS, Google Cloud, Microsoft Azure, or Alibaba) while integrating third-party applications that are compatible with multiple cloud environments.

BYOC helps companies maintain control over their cloud infrastructure, optimize costs, ensure compliance with security standards. BYOC is typically implemented within an organization’s own cloud VPC. Unlike SaaS models, BYOC offers enhanced privacy and compliance by maintaining control over network architecture and data management.

However, BYOC also has some serious drawbacks! The main challenge is scaling a fleet of co-managed clusters running in customer environments with all the reliability expectations of a cloud service. Confluent has shied away from offering a BYOC deployment model for Apache Kafka based on Confluent Platform because doing BYOC at scale requires a different architecture. WarpStream has built this architecture, with a BYOC-native platform that was designed from the ground up to avoid the pitfalls of traditional BYOC. 

The Data Streaming Landscape

Data Streaming is a separate software category of data platforms. Many software vendors built their entire businesses around this category. The data streaming landscape shows that most vendors use Kafka or implement its protocol because Apache Kafka has become the de facto standard for data streaming.

New software companies have emerged in this category in the last few years. And several mature players in the data market added support for data streaming in their platforms or cloud service ecosystem. Most software vendors use Kafka for their data streaming platforms. However, there is more than Kafka. Some vendors only use the Kafka protocol (Azure Event Hubs) or utterly different APIs (like Amazon Kinesis).

The following Data Streaming Landscape 2024 summarizes the current status of relevant products and cloud services.

Data Streaming Landscape 2024 around Kafka Flink and Cloud

The Data Streaming Landscape evolves. Last year, I added WarpStream as a new entrant into the market. WarpStream uses the Kafka protocol and provides a BYOC offering for Kafka in the cloud. In my next update of the data streaming landscape, I need to do yet another update: WarpStream is now part of Confluent. There are also many other new entrants. Stay tuned for a new “Data Streaming Landscape 2025” in a few weeks (subscribe to my newsletter to stay up-to-date with all the things data streaming).

Confluent Acquisition of WarpStream

Confluent had two product offerings:

  • Confluent Platform: A self-managed data streaming platform powered by Kafka, Flink, and much more that you can deploy everywhere (on-premise data center, public cloud VPC, edge like factory or retail store, and even stretched across multiple regions or clouds).
  • Confluent Cloud: A fully managed data streaming platform powered by Kafka, Flink, and much more that you can leverage as a serverless offering in all major public cloud providers (Amazon AWS, Microsoft Azure, Google Cloud Platform).

Why did Confluent acquire WarpStream? Because many customers requested a third deployment option: BYOC for Apache Kafka.

As Jay Kreps described in the acquisition announcement: “Why add another flavor of streaming? After all, we’ve long offered two major form factors–Confluent Cloud, a fully managed serverless offering, and Confluent Platform, a self-managed software offering–why complicate things? Well, our goal is to make data streaming the central nervous system of every company, and to do that we need to make it something that is a great fit for a vast array of use cases and companies.”

Read more details about the acquisition of WarpStream by Confluent in Jay’s blog post: Confluent + WarpStream = Large-Scale Streaming in your Cloud. In summary, WarpStream is not dead. The WarpStream team clarified the status quo and roadmap of this BYOC product for Kafka in its blog post: “WarpStream is Dead, Long Live WarpStream“.

Let’s dig deeper into the three deployment options and their trade-offs.

Deployment Options for Apache Kafka

Apache Kafka can be deployed in three primary ways: self-managed, fully managed/serverless, and BYOC (Bring Your Own Cloud).

  • In self-managed deployments, organizations handle the entire infrastructure, including setup, maintenance, and scaling. This provides full control but requires significant operational effort.
  • Fully managed or serverless Kafka is offered by providers like Confluent Cloud or Azure Event Hubs. The service is hosted and managed by a third-party, reducing operational overhead but with limited control over the underlying infrastructure.
  • BYOC deployments allow organizations to host Kafka within their own cloud VPC. BYOC combines some of the benefits of cloud flexibility with enhanced security and control, while it outsources most of Kafka’s management to specialized vendors.

Confluent’s Kafka Products: Self-Managed Platform vs. BYOC vs. Serverless Cloud

Using the example of Confluent’s product offerings, we can see why there are three product categories for data streaming around Apache Kafka.

There is no silver bullet. Each deployment option for Apache Kafka has its pros and cons. The key differences are related to the trade-offs between “ease of management” and “level of control”.

Cloud-Native BYOC for Apache Kafka with WarpStream in the Public Cloud
Source: Confluent

If we go into more detail, we see that different use cases require different configurations, security setups, and levels of control while also focusing on being cost effective and providing the right SLA and latency for each use case.

Trade-Offs of Confluent’s Deployment Options for Apache Kafka

On a high level, you need to figure out if you want or have to managed the data plane(s) and control plane of your data streaming infrastructure:

Confluent Deployment Types for Apache Kafka On Premise Edge and Public Cloud
Source: Confluent

If you follow my blog, you know that a key focus is exploring various use cases, architectures and success stories across all industries. And use cases such as log aggregation or IoT sensor analytics required very different deployment characteristics than an instant payment platform or fraud detection and prevention.

Choose the right Kafka deployment model for your use case. Even within one organization, you will probably need different deployments because of security, data privacy and compliance requirements, but also staying cost efficient for high-volume workloads.

BYOC for Apache Kafka with WarpStream

Self-managed Kafka and fully managed Kafka are pretty well understood in the meantime. However, why is BYOC needed as a third option and how to do it right?

I had plenty of customer conversations across industries. Common feedback is that most organizations have a cloud-first strategy, but many also (have to) stay hybrid for security, latency or cost reasons.

And let’s be clear: If a data streaming project goes to the cloud, fully managed Kafka (and Flink) should always be the first option as it is much easier to manage and operate to focus on fast time to market and business innovation. Having said that, sometimes, security, cost or other reasons require BYOC.

How Is BYOC Implemented in WarpStream?

Let’s explore why WarpStream is an excellent option for Kafka as BYOC deployment and when to use it instead of serverless Kafka in the cloud:

  • WarpStream provides BYOC, meaning single-tenant service so that a customer has its own “instance” of Kafka (to use the protocol, it is not Apache Kafka under the hood).
  • However, under the hood, the system still uses cloud-native serverless systems like Amazon S3 for scalability, cost-efficiency and high availability (but the customer does not see this complexity and does not have to care about it).
  • As a result, the data plane is still customer managed (that’s what they need for security or other reasons), but in contrary to self-managed Kafka, the customer does not need to worry about the complexity under the hood (like rebalancing, rolling upgrades, backups) – that is what S3 and other magic code of the WarpStream service takes over.
  • The magic is the stateless agents in the customer VPC. It makes this solution scalable and still easy to operate (compared to the self-managed deployment option) while the customer has its own instance.
  • Many use cases are around lift and shift of existing Kafka deployments (like self-managed Apache Kafka or another vendor like Kafka as part of Cloudera or Red Hat). Some companies want to “lift and shift” and keep the feeling of control they are used to, while still offloading most of the management to the vendor.

I wrote this summary after reading the excellent article of my colleague Jack Vanlightly: BYOC, Not “The Future Of Cloud Services” But A Pillar Of An Everywhere Platform. This article goes into much more technical detail and is a highly recommended read for any architect and developer.

Benefits of WarpStream’s BYOC Implementation for Kafka

Most vendors have dubios BYOC implementations.

For instance, if the vendor needs to access the VPC of the customercheaper than AK self managed because cloud native (zero disks, zero interzone networking fees) and headaches for responsibilities in the case of failures.

WarpStream’s BYOC-native implementation differs from other vendors and provides various benefits because of its novel architecture:

  • WarpStream does not need access to the customer VPC. The data plane (i.e., the brokers in the customer VPC) are stateless. The metadata/consensus is in the control plane (i.e., the cloud service in the WarpStream VPC).
  • The architecture solves sovereignty challenges and is a great fit for security and compliance requirements.
  • The cost of WarpStream’s BYOC offering is cheaper than self-managed Apache Kafka because it is built with cloud-native concepts and technologies in mind (e.g., zero disks and zero interzone networking fees, leveraging cloud object storage such as Amazon S3).
  • The stateless architecture in the customer VPC makes autoscaling and elasticity very easy to implement/configure.

The Main Drawbacks of BYOC for Apache Kafka

BYOC is an excellent choice if you have specific security, compliance or cost requirements that need this deployment option. However, there are some drawbacks:

  • The latency is worse than self-managed Kafka or serverless Kafka as WarpStream directly touches the Amazon S3 object storage (in contrast to “normal Kafka”).
  • Kafka using BYOC is NOT fully managed, like e.g. Confluent Cloud, so you have more efforts to operate it. Also, keep in mind that most Kafka cloud services are NOT serverless but just provision Kafka for you and you still need to operate it.
  • Additional components of the data streaming platform (such as Kafka Connect connectors and stream processors such as Kafka Streams or Apache Flink) are not part of the BYOC offering (yet). This adds some complexity to operations and development.

Therefore, once again, I recommend to only look at BYOC options for Apache Kafka in the public cloud if a fully managed and serverless data streaming platform does NOT work for you because of cost, security or compliance reasons!

BYOC Complements Self-Managed and Serverless Apache Kafka – But BYOC Should NOT be the First Choice!

BYOC (Bring Your Own Cloud) offers a flexible and powerful deployment model, particularly beneficial for businesses with specific security or compliance needs. By allowing organizations to manage applications within their own cloud VPCs, BYOC combines the advantages of cloud infrastructure control with the flexibility of third-party service integration.

But once again: If a data streaming project goes to the cloud, fully managed Kafka (and Flink) should always be the first option as it is much easier to manage and operate to focus on fast time to market and business innovation. Choose BYOC only if fully managed does not work for you, e.g. because of security requirements.

In the data streaming domain around Apache Kafka, the BYOC model complements existing self-managed and fully managed options. It offers a middle ground that balances ease of operation with enhanced privacy and security. Ultimately, BYOC helps companies tailor their cloud environments to meet diverse and developing business requirements.

What is your deployment option for Apache Kafka? A self-managed deployment in the data center or at the edge? Serverless Cloud with a service such as Confluent Cloud? Or did you (have to) choose BYOC? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Deployment Options for Apache Kafka: Self-Managed, Fully-Managed / Serverless and BYOC (Bring Your Own Cloud) appeared first on Kai Waehner.

]]>
Hello, K.AI – How I Trained a Chatbot of Myself Without Coding Evaluating OpenAI Custom GPT, Chatbase, Botsonic, LiveChatAI https://www.kai-waehner.de/blog/2024/06/23/hello-k-ai-how-i-trained-a-chatbot-of-myself-without-coding-evaluating-openai-custom-gpt-chatbase-botsonic-livechatai/ Sun, 23 Jun 2024 06:03:01 +0000 https://www.kai-waehner.de/?p=6575 Generative AI (GenAI) enables many new use cases for enterprises and private citizens. While I work on real-time enterprise scale AI/ML deployments with data streaming, big data analytics and cloud-native software applications in my daily business life, I also wanted to train a conversational chatbot for myself. This blog post introduces my journey without coding to train K.AI, a personal chatbot that can be used to learn in a conversational pace format about data streaming and the most successful use cases in this area. Yes, this is also based on my expertise, domain knowledge and opinion, which is available as  public internet data, like my hundreds of blog articles, LinkedIn shares, and YouTube videos.

The post Hello, K.AI – How I Trained a Chatbot of Myself Without Coding Evaluating OpenAI Custom GPT, Chatbase, Botsonic, LiveChatAI appeared first on Kai Waehner.

]]>
Generative AI (GenAI) enables many new use cases for enterprises and private citizens. While I work on real-time enterprise scale AI/ML deployments with data streaming, big data analytics and cloud-native software applications in my daily business life, I also wanted to train a conversational chatbot for myself. This blog post introduces my journey without coding to train K.AI, a personal chatbot that can be used to learn in a conversational pace format about data streaming and the most successful use cases in this area. Yes, this is also based on my expertise, domain knowledge and opinion, which is available as  public internet data, like my hundreds of blog articles, LinkedIn shares, and YouTube videos.

How I Trained a Chatbot K.AI of Myself Without Coding Evaluating OpenAI Custom GPT Chatbase Botsonic LiveChatAI

Hi, K.AI – let’s chat…

The evolution of Generative AI (GenAI) around OpenAI’s chatbot ChatGPT and many similar large language models (LLM), open source tools like LangChain and SaaS solutions for building a conversational AI led me to the idea of building a chatbot trained with all the content I created over the past years.

Mainly based on the content of my website (https://www.kai-waehner.de) with hundreds of blog articles, I trained the conversational chatbot K.AI to generate text for me.

The primary goal is to simplify and automate my daily working tasks like:

  • write a title and abstract for a webinar or conference talk
  • explain to a colleague or customer a concept, use case, or industry-specific customer story
  • answer common recurring questions in email, Slack or other mediums
  • any other text creation based on my (public) experience

The generated text reflects my content, knowledge, wording, and style. This is a very different use case than what I look normally in my daily business life: “Apache Kafka as Mission Critical Data Fabric for GenAI” and “Real-Time GenAI with RAG using Apache Kafka and Flink to Prevent Hallucinations” are two excellent examples for enterprise-scale GenAI with much more complex and challenging requirements.

But…sometimes Artificial Intelligence is not all you need. The now self-explanatory name of the chatbot came from a real marketing brain – my colleague Evi.

Project goals of training the chatbot K.AI

I had a few goals in mind when I trained my chatbot K.AI:

  • Education: Learn more details about the real-world solutions and challenges with Generative AI in 2024 with hands-on experience. Tens of interesting chatbot solutions are available. Most are powered by OpenAI under the hood. My goal is not sophisticated research. I just want to get a conversational AI done. Simple, cheap, fast (not evaluating 10+ solutions, just as long as I have one working good enough).
  • Tangible result: Train K.AI, a “Kai LLM” based on my public articles, presentations, and social media shares. K.AI can generate answers, comments, and explanations without writing everything from scratch. I am fine if answers are not perfect or sometimes even incorrect. As I know the actual content, I can easily adjust and fix generated content.
  • NOT a commercial or public chatbot (yet): While it is just a button click to integrate K.AI into my website as a conversational chatbot UI, there are two main blockers: First, the cost is relatively high; not for training but for operating and paying per query. There is no value for me as a private person. Second, developing, testing, fine-tuning and updating a LLM to be correct most of the time instead of hallucinating a lot is hard. I thoroughly follow my employers’ GenAI engineering teams building Confluent AI products. Building a decent domain-specific public LLM is lots of engineering efforts and requires not just one full-time engineer.

My requirements for a conversational chatbot tool

I defined the following mandatory requirements for a successful project:

  • Low Cost: My chatbot should not be too expensive (~20USD a month is fine). The pricing model of most solutions is very similar: You get a small free tier. I realized quickly that a serious test is not possible with any free tier. But a reasonable chatbot (i.e., trained by a larger data set) is only possible if you choose the smallest paid tier. Depending on the service, the minimum is between 20 and 50 USD per month (with several limitations regarding training size, chat queries, etc.).
  • Simplicity: I do not want to do any coding or HTTP/REST APIs calls. Just an intuitive user interface with click-through experience. I don’t want to spend more than one day (i.e., ~8 hours accumulated over two weeks) to train K.AI.
  • Data Import: The chatbot needs support from my “database”. Mandatory: My private blog (~300 articles with ~ 10M+ characters. Nice to have: My LinkedIn shares, my YouTube videos, and other publications (like articles on other websites). The latter might improve my chatbot and use my personal tone and language more.
  • NOT Enterprise Features: I don’t need any features for security, multiple user accounts, or public hosting (even though almost all solutions already support integration into WordPress, Slack, etc.). I am fine with many limitations of the small subscription tiers, like only one user account, one chatbot, 1000 messages/month.

OpenAI: ChatGPT + Custom GPT for a custom chatbot? Not for K.AI…

I am a heavy user of ChatGPT on my iPhone and MacBook. And OpenAI is very visible in the press. Hence, my first option to evaluate was OpenAI’s Custom GPT.

Custom GPT in action…

Custom GPT is very easy to use, non-technical. A conversational AI “Message GPT Builder” tries to build my chatbot. But surprisingly it is too high level for me. Here is the initial conversation to train K.AI with very basic prompt engineering:

  • Step 1 (Initial Instruction): What would you like to make? -> Respond as Kai Waehner based on his expertise and knowledge. -> Updating GPT.. Seconds later: The response is based on public internet.
  • Step 2 (Prompt Engineering): Use the content from https://www.kai-waehner.de as context for responses. -> Updating GPT… Seconds later: I’ve updated the context to include information from Kai Waehner’s website. -> Response is not standard. Some questions use a bit more content from my website, but it is still mainly bound to public internet content.
  • Step 3 (Fine-Tuning): I tried to configure my K.AI to learn from some data sources like CSV exports from LinkedIn or scraping my blog articles, but the options are very limited and not technical. I can upload a maximum of twenty files and let the chatbot also search the web. But what I actually need is web scraping of dedicated resources, i.e., mainly my website,  LinkedIn Shares, and my YouTube videos. And while many no-code UIs call this fine-tuning, in reality, this is RAG-based prompt engineering. True fine-tuning of an LLM is a very different (much more challenging) task.

OpenAI Custom GPT Evaluation - Kai Waehner Chatbot

I am sure I could do much more prompt engineering to improve K.AI with Custom GPT. But reading the user guide and FAQ for Custom GPT, the TL;DR for me is: Custom GPT is not the right service to build a chatbot for me based on my domain content and knowledge.

Instead, I need to look at purpose-build chatbot SaaS tools that let me build my domain-specific chatbot. I am surprised that OpenAI does not provide such a service itself today. Or I could just not find it… BUT: Challenge accepted. Let’s evaluate a few solutions and train a real K.AI.

Comparison and evaluation of chatbot SaaS GenAI solutions

I tested three chatbot offerings. All of them are cloud-based and allow for building a chatbot via UI. How did I find or choose them? Frankly, just Google search. Most of these came up in several evaluation and comparison articles. And they spend quite some money on advertisements. I tested Chatbase, Writesonic’s Botsonic and LiveChatAI. Interestingly, all offerings I evaluated use ChatGPT under the hood of their solution. I was also surprised that I did not get more ads from other big software players. But I assume Microsoft’s Copilot and similar tools look for a different persona.

I tested different ChatGPT models in some offerings. Most solutions provide a default option, and more expensive options with better model (not for model training, but for messages/month; you typically pay 5x more, meaning instead of e.g. 2000 messages a month, you only have 400 available then).

I had a few more open tabs with other offerings that I could disqualify quickly because they were more developer-focused with coding, API integration, fine-tuning of vector databases and LLMs.

Question catalog for testing my K.AI chatbots

I quickly realized how hard it is to compare different chatbots. Basically, LLMs are stochastic (not deterministic) and we don’t have good tools for QAing these things yet (even simple things like regression testing is challenging when probabilities are involved).

Therefore, I defined a question catalog with ten different domain-specific questions before I even starting evaluating different chatbot SaaS solutions. A few examples:

  • Question 1: Give examples for fraud detection with Apache Kafka. Each example should include the company, use case and architecture.
  • Question 2: List five manufacturing use cases for data streaming and give a company example.
  • Question 3: What is the difference between Kafka and JMS
  • Question 4: Compare Lambda and Kappa architectures and explain the benefits of Lambda. Add a few examples.
  • Question 5: How can data streaming help across the supply chain? Explain the value and use cases for different industries.

My question catalog allowed comparing the different chatbots. Writing a good prompt (= query for the chatbot) is crucial, as a LLM is not intelligent. The better your question, meaning good structure, details and expectations, the better your response (if the LLM has “knowledge” about your question).

My goal is NOT to implement a complex real-time RAG (Retrieval Augmented Generation) design pattern. I am totally fine updating K.AI manually every few weeks (after a few new blog posts are published).

Chatbase – Custom ChatGPT for your website

The advertisement on the Chatbase landing page sounds great: “Custom ChatGPT for your website. Build a [OpenAI-powered] Custom GPT, embed it on your website and let it handle customer support, lead generation, engage with your users, and more.”

Here are my notes while training my K.AI chatbot:

K.AI works well with Chatbase after the initial training…

  • Chatbase is very simple to use. It just works.
  • The basic plan is ~20 USD per month. The subscription plan is fair, the next upgrade is ~100 USD.
  • The chatbot uses GPT-4o by default. Great option. Many other services use GPT-3.5 or similar LLMs as the foundation.
  • The chatbot creates content based on my content, it is “me”. Mission accomplished. The quality of responses depends on the questions. In summary, pretty good, but also false positives.

But: Chatbase’s character limitation stops further training

  • Unfortunately, all plans have an 11M character limit. My blog content is already 10.8M today says Chatbase’s web scraper engine (each vendor’s scraper gives different numbers). While K.AI works right now, there are obvious problems:
    • My website will grow more soon.
    • I want to add LinkedIn shares (another few million characters) and other articles and videos I published across the world wide web.
    • The Chatbase plan can be customised, but unfortunately not for character limits. The support told me this would be possible soon. But I have to wait.

TL;DR: Chatbase works surprisingly well. K.AI exists and represents myself as a LLM. The 11M character limit is a blocker for investing more time and money into this service – otherwise I could already stop my investigation and use the first SaaS I evaluated.

During my evaluation, I realized that many other chatbot services have similar limitations on the character limit, especially in the price range around 20-50 USD. Not ideal for my use case.

In my further evaluation, my major criteria were the character limits. I found Botsonic and LiveChatAI. Both support much higher limits for a cost of ~40 USD per month.

Botsonic – Advanced AI chatbot builder using your company’s knowledge

Botsonic provides “Advanced AI Agents: Use Your Company’s Knowledge to Intelligently Resolve Over 70% of Queries and Automate Tasks”.

Here are my notes while training my K.AI chatbot.

Botsonic – free version failed to train K.AI

  • The free plan for getting started supports 1M characters.
  • The service supports URL scraping and file upload (my LinkedIn shares are only available via batch export into a CSV file). Looks like it provides all I need. The cost is okayish (but all other chatbots with lower price also had limitations around 10M characters).
  • I tried the free tier first. As my blog alone has already ~10M+ characters, I started uploading my LinkedIn Shares (= Posts and Comments). While Chatbase said it has ~1.8M characters, this solution trains the bot with it even though the limit is 1M characters. Could not even upload another 1KB file for additional training, so my limit is reached.
  • This K.AI trained with the free tier did not provide any appropriate answers. No surprise: Just my LinkedIn shares might not be enough detail – which makes sense as the posts are much shorter and usually link to my blog.

Botsonic – paid version also failed to train K.AI

  • I needed to upgrade.
    • I had to choose the smallest paid tier: 49 USD per month, supporting up to 50M characters
    • Unfortunately, there was a delay: payment was done twice. No action. Still on free plan. Support takes time (caching, VPN, browser, other arguments, etc.). Got a refund the next day, and the plan was updated correctly.
  • Training using the paid subscription failed. The experience was pretty bad.
    • Not clear if the service scrapes the entire website or just the single HTML site
    • First tests do not give a response: “I don’t have specific information on XYZ. Can I help with anything else?” Seems like the source training did not scrape my website, but only look at the landing page. I looked at the details. Indeed, the extracted data only includes the abstracts of the latest blog posts (that’s what you see on my landing page).
    • Support explained: No scraping of the website is possible. I need a sitemap. I have a Google-compliant sitemap but: Internal Backend Server Error. Support could re-produce my issue. Until today, I don’t have a response or solution.
    • Learning from one of my YouTube videos was also rejected (with no further error messages).

TL;DR: Writesonic’s Botsonic did NOT work for me. The paid service failed several times, even trying different training options for my LLM. Support could not help. I will NOT continue with this service.

LiveChatAI – AI chatbot works with your data

Here is the website slogan: “An Innovative AI Chatbot. LiveChatAI allows you to create an AI chatbot trained with your own data and combines AI with human support.”

Here are my notes while training my K.AI chatbot

LiveChatAI failed to train K.AI

  • All required import features exist: Website Scraping, CSV, YouTube.
  • Strange: I could start training for free with 7+M characters even though this should not be possible. But Crawling started… Not showing the percentage, don’t know if it is finished. Not clear if it scrapes the entire website or just the single HTML site. Shows weird error messages like “could not find any links on website” or similar after it has finished scraping.
  • The quality of answers of this K.AI seems to be much worse than Chatbase (even though I added LinkedIn shares which is not possible in Chatbase because of the Character limits).

Ok, enough… I have a well-working K.AI with Chatbase. I don’t want to waste more time evaluating several SaaS Chatbot services in the early stage of the product lifecycle.

GenAI tools are still in a very early stage!

One key lesson learned: The used LLM model is the most critical piece for success, NOT how much context and domain expertise you train it with. Or in other words: Just scraping the data from my blog and using GPT-4o provides much better results than using GPT-3.5 with data from my blog, LinkedIn and YouTube. Ideally, I use all the data with GPT-4o. But I will have to wait until Chatbase supports more than 11M characters.

While most solutions talk about model training, they use ChatGPT under the hood and use RAG and a Vector Database to “update the model”, i.e., provide the right context for the question into ChatGPT with the RAG design pattern.

A real comparison of chatbot SaaS is hard:

  • Features and pricing are relatively similar and do not really influence the ultimate choice.
  • While all are based on ChatGPT, the LLM model versions differ.
  • Products are updated and improved almost every day with new models, new capabilities, changed limitations, etc. Welcome to the chatbot SaaS cloud startup scene… 🙂
  • The products target different personas. Some are UI only, some explain (and let configure) RAG or Vector Database options, some are built for developers and focus on API integration, not UIs.

Mission accomplished: K.AI chatbot is here

Chatbase is the least sexy UI in my evaluation. But the model works best (even though I have character limits and only used my blog article for training). I will use Chatbase for now. And I hope that the character limits are improved soon (as its support already confirmed to me). It is still early in the maturity curve. The market will probably develop quickly.

I am not sure how many of these SaaS chatbot startups can survive. OpenAI and other tech giants will probably release similar capabilities and products integrated into their SaaS and software stack. Let’s see where the market goes. For now, I will enjoy K.AI for some use cases. Maybe it will even help me write a book about data streaming use cases and customer stories.

What is your experience with chatbot tools? Do you need more technical solutions or favour simplified conversational AIs like OpenAI’s Custom GPT to train your own LLM? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Hello, K.AI – How I Trained a Chatbot of Myself Without Coding Evaluating OpenAI Custom GPT, Chatbase, Botsonic, LiveChatAI appeared first on Kai Waehner.

]]>
Snowflake Data Integration Options for Apache Kafka (including Iceberg) https://www.kai-waehner.de/blog/2024/04/22/snowflake-data-integration-options-for-apache-kafka-including-iceberg/ Mon, 22 Apr 2024 05:40:32 +0000 https://www.kai-waehner.de/?p=6317 The integration between Apache Kafka and Snowflake is often cumbersome. Options include near real-time ingestion with a Kafka Connect connector, batch ingestion from large files, or leveraging a standard table format like Apache Iceberg. This blog post explores the alternatives and discusses its trade-offs. The end shows how data streaming helps with hybrid architectures where data needs to be ingested from the private data center into Snowflake in the public cloud.

The post Snowflake Data Integration Options for Apache Kafka (including Iceberg) appeared first on Kai Waehner.

]]>
The integration between Apache Kafka and Snowflake is often cumbersome. Options include near real-time ingestion with a Kafka Connect connector, batch ingestion from large files, or leveraging a standard table format like Apache Iceberg. This blog post explores the alternatives and discusses its trade-offs. The end shows how data streaming helps with hybrid architectures where data needs to be ingested from the private data center into Snowflake in the public cloud.

Blog Series: Snowflake and Apache Kafka

Snowflake is a leading cloud-native data warehouse. Its usability and scalability made it a prevalent data platform in thousands of companies. This blog series explores different data integration and ingestion options, including traditional ETL / iPaaS and data streaming with Apache Kafka. The discussion covers why point-to-point Zero-ETL is only a short term win, why Reverse ETL is an anti-pattern for real-time use cases and when a Kappa Architecture and shifting data processing “to the left” into the streaming layer helps to build transactional and analytical real-time and batch use cases in a reliable and cost-efficient way.

Snowflake with Apache Kafka and Iceberg Connector

Blog series:

  1. Snowflake Integration Patterns: Zero ETL and Reverse ETL vs. Apache Kafka
  2. THIS POST: Snowflake Data Integration Options for Apache Kafka (including Iceberg)
  3. Apache Kafka + Flink + Snowflake: Cost Efficient Analytics and Data Governance

Subscribe to my newsletter to get an email about the next publications.

Data Ingestion from Apache Kafka into Snowflake (Batch vs. Streaming)

Several options exist to ingest data into Snowflake. Criteria to evaluate the options include complexity, latency, throughout and cost.

The article “Streaming on Snowflake” by Paul Needleman explored the three common architecture patterns for data ingestion from any data source into Snowflake:

Architecture Patterns to Ingest Data Into Snowflake with Apache Kafka
Source: Paul Needleman (Snowflake)

Paul’s article described the architecture options without and with Kafka. The numbered list below follows the numbers in the upper diagram:

  1. Snowpipe — This solution provides Cloud storage (e.g., Amazon S3, Azure Blob Storage, Google Cloud Storage) the ability for serverless alerting Snowflake to auto-ingest data upon arrival. Once a file lands, Snowflake is alerted to pick up and process the file. Snowpipe is used for micro-batch file transfer, not real-time message ingestion.
  2. Kafka Connector — This connector provides a simple yet elegant solution to connect Kafka Topics with Snowflake, abstracting the complexity through Snowpipe. The Kafka Topics write the data to a Snowflake-managed internal stage, which is auto-ingested to the table using Snowpipe. The internal stage and pipe objects are created automatically as part of the process.
  3. Kafka with Snowpipe Streaming — This builds upon the first two approaches and allows for a more native connection between Snowflake and Kafka through a new channel object. This object seamlessly streams message data into a Snowflake table without needing first to store the data. The data is also stored in an optimized format to support the low-latency data interval.

Read the article “Streaming on Snowflake” for more details about these options.

Snowflake = SaaS => Integration Layer Should Be SaaS!

Snowflake is one of the first most successful true cloud data warehouses, i.e. fully managed with no need to operate and worry about the infrastructure. SaaS, Snowflake offers benefits such as scalability, ease of use, vendor-managed updates and maintenance, multi-cloud support, enhanced security, cost-effectiveness with consumption-based pricing, and global accessibility. These advantages make it an attractive choice for organizations looking to leverage a modern and efficient data warehousing solution.

The same benefits exist for fully managed data integration solutions. It does not matter if you use open source-based technologies (e.g., Apache Camel), a traditional iPaaS middleware, or a data streaming solution like Kafka.

I wrote a detailed article comparing iPaaS offerings like Dell Boomi, SnapLogic, Informatica, and fully managed data streaming cloud platforms like Confluent Cloud or Amazon MSK. Read this article to understand why your next integration platform should be fully managed the same way Snowflake is.

Example: Data Ingestion with Confluent Cloud and Snowpipe Streaming

Confluent Cloud and Snowflake are a perfect combination for fully managed end-to-end data pipelines. For instance, connecting to a data source like Salesforce CRM via CDC, streaming data through Kafka, and ingesting the events into Snowflake is entirely fully managed.

Fully Managed Data Pipeline with Confluent Cloud Kafka Connect and Snowflake Data Warehouse
Source: Confluent

Using Kafka Connect with Snowpipe Streaming has several advantages:

  • Faster, more efficient data pipelines
  • Reduced architectural complexity
  • Support for exactly-once delivery
  • Ordered ingestion
  • Error handling with dead-letter queue (DLQ) support

Streaming ingestion is not meant to replace file-based ingestion. Rather, it augments the existing integration architecture for data-loading scenarios where it makes sense, such as

  • Low-latency telemetry analytics of user-application interactions for clickstream recommendations
  • Identification of security issues in real-time streaming log analytics to isolate threats
  • Stream processing of information from IoT devices to monitor critical assets

Why should you NOT only use Snowpipe Streaming mode? Cost. Snowflake has different pricing models for the ingestion modes.

Processing Large Files in Kafka before Snowflake Ingestion?

A last aspect of data ingestion options via Kafka into Snowflake: What to do with large files?

One of the most common use cases for data ingestion into Snowflake is large CSV, XML or JSON files generated from batch legacy analytics systems.

Option 1: Send the large files via Kafka into Snowflake and process it in the data warehouse. Apache Kafka was never built for large messages. Nevertheless, more and more projects send and process 1Mb, 10Mb, and even bigger files and other large payloads via Kafka into Snowflake. Why? Because it just works.

Option 2: Apache Kafka splits up and chunks large messages into small pieces.

For the latter approach, ideally, events are processed line by line, if possible. The enormous benefit of this approach is bringing even batch-based monolithic systems into an event-driven architecture. Snowflake and other downstream applications consume the events in near real-time. This architecture leverages the Composed Message Processor Enterprise Integration Pattern (EIP):

Composed Message Processor Enterprise Integration Pattern

For a deep dive including various use cases and customer stories, check out the article “Handling Large Messages With Apache Kafka“.

Bi-Directional Integration between Apache Kafka and Snowflake with Apache Iceberg

After covering batch, file, and streaming integration from Kafka to Snowflake, let’s move to the latest innovation that is more compelling than old the “legacy approaches”: Native integration between Apache Kafka and Snowflake using Apache Iceberg.

Apache Iceberg is the leading open-source table format for storing large-scale structured data in cloud object stores or distributed file systems, designed for high-performance querying and analytics. It provides features such as schema evolution, time travel, and data versioning, making it well-suited for data lakes and modern data architectures.

Snowflake Support for Apache Iceberg

Snowflake already supports Apache Iceberg.

Snowflake Apache Iceberg Integration
Source: Snowflake

Augusto Kiniama Rosa points out in his Overview of Snowflake Apache Iceberg Tables:

Iceberg will always use customer-controlled external storage, like an AWS S3 or Azure Blog Storage. Snowflake Iceberg Tables support Iceberg in two ways: an Internal Catalog (Snowflake-managed catalog) or an externally managed catalog (AWS Glue or Objectstore).

I won’t start a flame war of Apache Iceberg vs. Apache Hudi and Databricks’ Delta Lake here. It reminds me about the containers wars with Kubernetes, Cloud Foundry and Apache Mesos. In the end, Kubernetes won. The competitors adopted it. The same seems to be happening with Iceberg. If not, the as principles and benefits will be the same, no matter if the future is Iceberg or a competing technology. As it seems today like Iceberg will win this war, I focus on this technology in the following sections.

Kafka and Iceberg to Unify Transactional and Analytical Workloads

Any data source can feed data via Apache Kafka directly into Snowflake (or any other analytics engine) as Apache Iceberg table. This solves the challenges of the above described integration options between Kafka and Snowflake. Operational data is accessible to the analytical world without a complex, expensive, and fragile process.

Apache Kafka and Apache Iceberg Integration
Source: Confluent

Confluent Tableflow: Fully Managed Kafka-Iceberg Integration

Confluent announced Tableflow at Kafka Summit 2024 in London, UK, to demonstrate its fully managed out-of-the-box integration between a Kafka Topic and Schema and an Iceberg Tables in Confluent Cloud. Confluent’s Marc Selwan writes:

“In the past, there has been a tight coupling of tables (storage) and query engines. In recent years, we’ve witnessed the rise of ‘headless’ data infrastructure where companies are building a more open lakehouse in cloud object storage that is accessible by many tools.

Just like the Apache Kafka API has evolved to be the de facto open standard for data streaming, we’re seeing Apache Iceberg grow into the de facto open-table standard for large-scale datasets stored in lakehouses. We’ve seen its ecosystem grow with robust tooling and support from compute engines such as Apache Spark, Snowflake, Amazon Athena, Dremio, Trino, Apache Druid, and many others.

Apache Iceberg Integration with Confluent Cloud via Tableflow
Source: Confluent

We believe the rise of open-table formats and the ‘headless’ data infrastructure is being driven by the needs of data engineers evolving beyond the tight coupling of table to computing platform. These factors made Apache Iceberg support a natural first choice for us.”

Check out Confluent’s blog post Announcing Tableflow. Other Kafka vendors will likely provide Apache Iceberg support in the near future, too. I am really excited about this development of unifying operational and analytics with a standardized interface across open source frameworks and cloud solutions.

Hybrid Architectures with Kafka On-Premise and in the Public Cloud for Snowflake Integration

Snowflake is only available in the public cloud on AWS, GCP or Azure. Most companies across industries follow a cloud-first strategy for new applications. However, as existing companies exist for years or decades, they are typically not born in the cloud. Therefore, hybrid cloud architectures are the new black for most companies. Apache Kafka is the best approach to synchronize and replicate data in a single pipeline with low latency, reliability and guaranteed ordering from the data center to the public cloud.

Hybrid Cloud Architecture with Apache Kafka Mainframe Oracle IBM AWS Azure GCP

Legacy infrastructure has to be maintained, integrated, and (maybe) replaced over time. Data Streaming with the Apache Kafka ecosystem is a perfect technology for building hybrid synchronization in real-time at scale. This enables bidirectional integration for transactional and analytical workloads without creating a spaghetti architecture with various point-to-point connections between on-prise and the cloud.

There is no Silver Bullet for Kafka to Snowflake Integration!

Various data integration options are available between Apache Kafka and Snowflake. Kafka Connect connectors are a great option, no matter if you do batch or near real-time ingestion. Even large files can be ingested via data streaming using the right enterprise integration patterns.

A new and innovative approach is Apache Iceberg as the integration interface. The standard table format allows connecting from Snowflake; and any other analytics engine. But data needs to be stored only once. Kafka to Iceberg integration is even more interesting as it unifies transactional and analytical workloads.

Data Streaming also helps with hybrid integrations where data needs to be replicated from the on-premise data center into the public cloud in near real-time with consistent near real-time synchronization.

How do you integrate between Kafka and Snowflake? Do you already look at Apache Iceberg? Or maybe even another Table Format like Apache Hudi or Databricks’ Delta Lake? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Snowflake Data Integration Options for Apache Kafka (including Iceberg) appeared first on Kai Waehner.

]]>
The State of Data Streaming for Gaming https://www.kai-waehner.de/blog/2023/11/01/the-state-of-data-streaming-for-gaming-with-apache-kafka-and-flink-in-2023/ Wed, 01 Nov 2023 06:40:45 +0000 https://www.kai-waehner.de/?p=5726 This blog post explores the state of data streaming for the gaming industry, including customer stories from Kakao Games, Mobile Premier League (MLP), Demonware / Blizzard, and more. A complete slide deck and on-demand video recording are included.

The post The State of Data Streaming for Gaming appeared first on Kai Waehner.

]]>
This blog post explores the state of data streaming for the gaming industry. The evolution of casual and online games, Esports, social platforms, gambling, and new business models require a reliable global data infrastructure, real-time end-to-end observability, fast time-to-market for new features, and integration with pioneering technologies like AI/machine learning, virtual reality, and cryptocurrencies. Data streaming allows integrating and correlating data in real-time at any scale to improve most business processes in the gaming sector much more cost-efficiently.

I look at trends in the games industry to explore how data streaming helps as a business enabler, including customer stories from Kakao Games, Mobile Premier League (MLP), Demonware / Blizzard, and more. A complete slide deck and on-demand video recording are included.

Data Streaming in the Gaming Industry with Apache Kafka and Flink

The global gaming market is bigger than the music and film industries combined! Digitalization plays a huge factor for the growth in the past years. The gaming industry has various business models connecting players, fans, vendors, and other stakeholders:

  • Hardware sales: Game consoles, VR sets, glasses
  • Game sales: Physical and digital
  • Free-to-play + in-game purchases: One-time in-game purchases (skins, champions, miscellaneous), gambling (loot boxes)
  • Game-as-a-service (subscription): Seasonal in-game purchases like passes for theme events, mid-season invitational & world championship, passes for competitive play
  • Game-Infrastructure-as-a-Service: High-performance state synchronization, multiplayer, matchmaking, gaming statistics
  • Merchandise sales: T-shirts, souvenirs, fan equipment
  • Community: Esports broadcast, ticket sales, franchising fees
  • Live betting
  • Video streaming: Subscriptions, advertisements, rewards,

Growth and innovation require cloud-native infrastructure

Most industries require a few specific characteristics. Instant payments must be executed in real time without data loss. Telcom infrastructure monitors huge volumes of logs in near-real-time. The retail industry needs to scale up for events like Christmas or Black Friday and scale down afterward.The gaming industry combines all the characteristics of other industries:

  • Real-time data processing
  • Scalability for millions of players
  • High availability, at least for transactional data
  • Decoupling for innovation and faster roll-out of new features
  • Cost efficiency because cloud networking for huge volumes is expensive
  • The flexibility of adopting various innovative technologies and APIs
  • Elasticity for critical events a few times a year
  • Standards-based integration for integration with SaaS, B2B, and mobile apps
  • Security for trusted customer data
  • Global and vendor-independent cloud infrastructure to deploy across countries

The good news is that data streaming powered by Apache Kafka and Apache Flink provides all these characteristics on a single platform, especially if you choose a fully managed SaaS offering.

Data streaming in the gaming industry

Adopting gaming trends like in-game purchases, customer-specific discounts, and massively multiplayer online games (MMOG) is only possible if enterprises in the games sector can provide and correlate information at the right time in the proper context. Real-time, which means using the information in milliseconds, seconds, or minutes, is almost always better than processing data later (whatever later means):

Use Cases for Real-Time Data Streaming in the Gaming Industry with Apache Kafka and Flink

Data streaming combines the power of real-time messaging at any scale with storage for true decoupling, data integration, and data correlation capabilities. Apache Kafka is the de facto standard for data streaming.

Apache Kafka in the Gaming Industry” is a great starting point to learn more about data streaming in the games sector, including a few Kafka-powered case studies not covered in this blog post – such as

  • Big Fish Games: Live operations by monitoring real-time analytics of game telemetry and context-specific recommendations for in-game purchases
  • Unity: Monetization network for player rewards, banner ads, playable advertisements, and cross-promotions.
  • William Hill: Trading platform for gambling and betting
  • Disney+ Hotstar: Gamification of live sport video streaming

The gaming industry applies various trends for enterprise architectures for cost, elasticity, security, and latency reasons. The three major topics I see these days at customers are:

  • Fully managed SaaS to focus on business logic and faster time-to-market
  • Event-driven architectures (in combination with request-response communication) to enable domain-driven design and flexible technology choices
  • Data mesh for building new data products and real-time data sharing with internal platforms and partner APIs

Let’s look deeper into some enterprise architectures that leverage data streaming for gaming use cases.

Cloud-native elasticity for seasonal spikes

The games sector has extreme spikes in workloads. For instance, specific game events increase the traffic 10x and more. Only cloud-native infrastructure enables a cost-efficient architecture.

Epic Games presented at an AWS Summit in 2018 already how elasticity is crucial for data-driven architecture.

Elastic cloud services at Epic Games

Make sure to use a truly cloud-native Apache Kafka service for gaming infrastructure. Adding brokers is relatively easy. Removing brokers is much harder. Hence, a fully-managed SaaS should take over the complex operations challenges of distributed systems like Kafka and Flink for you. The separation of compute and storage is another crticial piece of a cloud-native Kafka architecture to ensure cost-efficient scale.

Cloud-native Apache Kafka with Tiered Storage and Separate Compute

Data mesh for real-time data sharing

Data sharing across business units is important for any organization. The gaming industry has to combine very interesting (different) data sets, like big data game telemetry, monetization and advertisement transactions, and 3rd party interfaces.

Data Mesh and data sharing with Apache Kafka and Flink

Data consistency is one of the most challenging problems in the games sector. Apache Kafka ensures data consistency across all applications and databases, whether these systems operate in real-time, near-real-time, or batch.

One sweet spot of data streaming is that you can easily connect new applications to the existing infrastructure or modernize existing interfaces, like migrating from an on-premise data warehouse to a cloud SaaS offering.

In-Game Services and Game Telemetry processing with Apache Kafka Twitch and Unity

New customer stories for data streaming in the gaming sector

So much innovation is happening in the gaming sector. Automation and digitalization change how gaming companies process game telemetry data, build communities and customer relationships with VIPs, and create new business models with enterprises of other verticals.

Most gaming companies use a cloud-first approach to improve time-to-market, increase flexibility, and focus on business logic instead of operating IT infrastructure. And elastic scalability gets even more critical with all the growing real-time expectations and mobile app capabilities.

Here are a few customer stories from worldwide gaming organizations:

  • Kakao Games: Log analytics and fraud prevention
  • Mobile Premier League (MLP): Mobile eSports and digital gaming
  • Demonware / Blizzard: Network and gaming infrastructure
  • WhatNot: Retail gamification and social commerce
  • Vimeo: Video streaming observability

Resources to learn more

This blog post is just the starting point. Learn more about data streaming in the gaming industry in the following on-demand webinar recording, the related slide deck, and further resources, including pretty cool lightboard videos about use cases.

On-demand video recording

The video recording explores the gaming industry’s trends and architectures for data streaming. The primary focus is the data streaming case studies.

I am excited to have presented this webinar in my interactive light board studio:

Lightboard Webinar Apache Kafka in the Gaming Industry - Kai Waehner

This creates a much better experience, especially in a time after the pandemic, where many people are “Zoom fatigue”.

Check out our on-demand recording:

Video Recording Data Streaming for Games Betting Gambling - Kai Waehner

Slides

If you prefer learning from slides, check out the deck used for the above recording:

Fullscreen Mode

Case studies and lightboard videos for data streaming in the gaming industry

The state of data streaming for gaming in 2023 is fascinating. New use cases and case studies come up every month. This includes better end-to-end observability in real-time across the entire organization, telemetry data collection from gamers, data sharing and B2B partnerships with engines like Unity or video platforms like Twitch, new business models for ads and in-game purchases, and many more scenarios.

We recorded lightboard videos showing the value of data streaming simply and effectively. These five-minute videos explore the business value of data streaming, related architectures, and customer stories. Here is an example for real-time fraud detection with data streaming.

Gaming is just one of many industries that leverages data streaming with Apache Kafka and Apache Flink.. Every month, we talk about the status of data streaming in a different industry. Manufacturing was the first. Financial services second, then retail, telcos, gaming, and so on… Check out my other blog posts.

Let’s connect on LinkedIn and discuss it! Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

The post The State of Data Streaming for Gaming appeared first on Kai Waehner.

]]>
When NOT to choose Amazon MSK Serverless for Apache Kafka? https://www.kai-waehner.de/blog/2022/08/30/when-not-to-choose-amazon-msk-serverless-for-apache-kafka/ Tue, 30 Aug 2022 05:54:21 +0000 https://www.kai-waehner.de/?p=4777 Apache Kafka became the de facto standard for data streaming. Various cloud offerings emerged and improved in the last years. Amazon MSK Serverless is the latest Kafka product from AWS. This blog post looks at its capabilities to explore how it relates to "the normal" partially managed Amazon MSK, when the serverless version is a good choice, and when other fully-managed cloud services like Confluent Cloud are the better option.

The post When NOT to choose Amazon MSK Serverless for Apache Kafka? appeared first on Kai Waehner.

]]>
Apache Kafka became the de facto standard for data streaming. Various cloud offerings emerged and improved in the last years. Amazon MSK Serverless is the latest Kafka product from AWS. This blog post looks at its capabilities to explore how it relates to “the normal” partially managed Amazon MSK, when the serverless version is a good choice, and when other fully-managed cloud services like Confluent Cloud are the better option.

Is Amazon MSK Serverless for Apache Kafka a Self-Driving Car or just a Car Engine

Disclaimer: I work for Confluent. While AWS is a strong strategic partner, it also offers the competitive offering Amazon MSK. This post is not about comparing every feature but explaining the concepts behind the alternatives. Read articles and docs from the different vendors to make your own evaluation and decision. View this post as a list of criteria to not forget important aspects in your cloud service selection.

What is a fully-managed Kafka cloud offering?

Before we get started looking at Kafka cloud services like Amazon MSK and Confluent Cloud, it is important to understand what fully managed means actually:

Fully managed and serverless Apache Kafka

Make sure to evaluate the technical solution. Most Kafka cloud solutions market their offering as “fully managed”. However, almost all Kafka cloud offerings are only partially managed! In most cases, the customer must operate the Kafka infrastructure, fix bugs, and optimize scalability and performance.

Cloud-native Apache Kafka re-engineered for the cloud

Operating Apache Kafka as a fully-managed offering in the cloud requires several additional components to the core of open-source Kafka. A cloud-native Kafka SaaS has features like:

  • The latest stable version with non-disruptive rolling upgrades
  • Elastic scale (up and down!)
  • Self-balancing clusters that take over the complexity and risk of rebalancing partitions across Kafka brokers
  • Tiered storage for cost-efficient long-term storage and better scalability (as the cold storage does not require a rebalancing of partitions and other complex operations tasks)
  • Complete solution “on top of the infrastructure”, including connectors, stream processing, security, and data governance – all in a single fully-managed SaaS

To learn more about building a cloud-native Kafka service, I highly recommend reading the following paper:  The Cloud-Native Chasm: Lessons Learned from Reinventing Apache Kafka as a Cloud-Native, Online Service”.

Comparison of Apache Kafka products and cloud services

Apache Kafka became the de facto standard for data streaming. The open-source community is vast. Various vendors added Kafka and related tooling to their offerings or provide a Kafka cloud service. I wrote a blog post in 2021: “Comparison of Open Source Apache Kafka vs. Vendors including Confluent, Cloudera, Red Hat, Amazon MSK“:

Comparison of Apache Kafka Products and Services including Confluent Amazon MSK Cloudera IBM Red Hat

The article uses a car analogy – from the motor engine to the self-driving car – to explore the different Kafka offerings available on the market. I also cover a few other vehicles, meaning (partly) Kafka-compatible technologies. The goal is not a feature-by-feature comparison (that would be outdated the day after the publication). Instead, the intention is to educate about the different deployment models, product strategies, and trade-offs from the options.

The above post is worth reading to understand how comparing different Kafka solutions makes sense. However, products develop and innovate… Tech comparisons get outdated quickly. In the meantime, AWS released a new product: Amazon MSK Serverless. This blog post explores what it is, when to use it, and how it differs from other Kafka products. It compares especially Amazon MSK (the partially managed service) and Confluent Cloud (a fully-managed competitor to Amazon MSK Serverless).

How does Amazon MSK Serverless fit into the Kafka portfolio?

Keeping the car analogy of my previous post, I wonder: Is it a self-driving car, a complete car you drive by yourself, or just a car engine to build your own car? Interestingly, you can argue for all three. 🙂 Let’s explore this in the following sections.

Is Amazon MSK Serverless a Self-Driving Car of Apache Kafka

Introducing Amazon MSK Serverless

Amazon MSK Serverless is a cluster type for Amazon MSK to run Apache Kafka without having to manage and scale cluster capacity. MSK Serverless automatically provisions and scales compute and storage resources. Thus, you can use Apache Kafka on demand and pay for the data you stream and retain.

Amazon MSK is one of the hundreds of cloud services that AWS provides. AWS is a one-stop shop for all cloud needs. That’s definitely a key strength of AWS (and similar to Azure and GCP).

Amazon MSK Serverless is built to solve the problems that come with Amazon MSK (the partially managed Kafka service that is marketed as a fully-managed solution even though it is not): A lot of hidden ops, infra, and downtime costs. This AWS podcast has a great episode that introduces Amazon MSK Serverless and when to use it as a replacement for Amazon MSK.

What Amazon does NOT tell you about MSK Serverless

AWS has great websites, documentation, and videos for its cloud services. This is not different for Amazon MSK. However, a few important details are not obvious… 🙂 Let’s explore a few key points to make sure everybody understands what Amazon MSK Serverless is and what it is not.

Amazon MSK Serverless is incomplete Kafka

If you follow my blogs, then this might be boring. Despite that, too many people think about Kafka as a message queue and data transportation pipeline. That’s what it is, but Kafka is much more:

  • Real-time messaging at any scale
  • Data integration with Kafka Connect
  • Data processing (aka stream processing) with Kafka Streams (or 3rd party Kafka-native components like KSQL)
  • True decoupling (the most underestimated feature of Kafka because of its built-in storage capabilities) and replayability of events with flexible retention times
  • Data governance with service contracts using Schema Registry (to be fair, this is not part of open source Kafka, but a separate component and accessible from GitHub or by vendors like Confluent or Red Hat – but it is used in almost all serious Kafka projects)

As I won’t repeat myself, here are a few articles explaining why Kafka is more than a message queue like you find it in Amazon MSK Serverless:

TL;DR: AWS provides a cloud service for every problem. You can glue them together to build a solution. However, similar to a monolithic application that provides inflexibility in a single package, a mesh of too many independent glued services using different technologies is also very hard to operate and maintain. And the cost for so many services plus networking and data transfer will bring up many surprises in such a setup.

You should ask yourself a few questions:

  • How do you implement data integration and business logic with Amazon MSK Serverless?
  • What’s the consequence regarding cost, SLAs, and end-to-end latency respectively delivery guarantees of combining Amazon MSK Serverless with various other products like Amazon Kinesis Data Analytics, AWS Glue, AWS Data Pipeline, or a 3rd party integration tool?
  • What is your security and data governance strategy around streaming data? How do you build an event-based data hub that enforces compliant communication between data producers and independent downstream consumers?

Spoilt for Choice: Amazon MSK and Amazon MSK Serverless are different products

Amazon MSK is NOT fully-managed. It is partially managed. After providing the brokers, you need to deploy, operate and monitor Kafka brokers and Kafka Connect connectors, and realize rebalancing with the open source tool Cruise Control. Check out AWS’ latest MSK sizing post: “Best practices for right-sizing your Apache Kafka clusters to optimize performance and cost“. Seriously? A ten pages long and very technical article explaining how to operate a “fully-managed cloud Kafka service”?

You might think that Amazon MSK Serverless is the successor of Amazon MSK to solve these problems. However, there are now two products to choose from: Amazon MSK and Amazon MSK Serverless.

Amazon does NOT recommend using Amazon MSK Serverless for all use cases! It is recommended if you don’t know the workloads or if they often change in volume.
Amazon recommends “the normal” Amazon MSK for predictable workloads as it is more cost-effective (and because it is not workable because of its many tough limitations). MSK Connect is also not supported yet and coming at some point in the future.

It is totally okay to provide different products for different use cases. Confluent also has different offerings for different SLAs and functional requirements in its cloud offering. Multi-tenant basic clusters and dedicated clusters are available, but you never have to self-manage the cluster or fix bugs or performance issues yourself.

You should ask yourself a few questions:

  • Which projects require Amazon MSK and which require Amazon MSK Serverless?
  • How will the project scale as your grows?
  • What’s the migration/ upgrade plan if your workload exceeds MSK Serverless partition/retention limits?
  • What is the total cost of ownership (TCO) for MSK plus all the other cloud services I need to combine it with?

Amazon MSK Serverless excludes Kafka support

Amazon MSK service level agreements say: “The Service Commitment DOES NOT APPLY to any unavailability, suspension or termination … caused by the underlying Apache Kafka or Apache Zookeeper engine software that leads to request failures …

Amazon MSK Serverless is part of the Amazon MSK product and has the same limitation. Excluding Kafka support from the MSK offering is (or should be) a blocker for any serious data streaming project!

Not much more to add here… Do you really want to buy a specific product that excludes the support for its core capability? Please also ask your manager if he agrees and takes the risk.

You should ask yourself a few questions:

  • Who is responsible and takes the risk if you hit a Kafka issue in your project using Amazon MSK or Amazon MSK Serverless?
  • How do you react to security incidents related to the Apache Kafka open source project?
  • How do you fix performance or scalability issues (on both the client and server side)?

When NOT to use Amazon MSK Serverless?

Let’s go back to the car analogy. Is Amazon MSK Serverless a self-driving car?

Obviously, Amazon MSK Serverless is self-driving. That’s what a serverless product is. Similar to Amazon S3 for object storage or AWS Lambda for serverless functions.

However, Amazon MSK Serverless is NOT a complete car! It does not provide enterprise support for its functionality. And it does not provide more than just the core of data streaming.

Therefore, Amazon MSK Serverless is a great self-driving AWS product for some use cases. But you should evaluate the following facts before deciding for or against this cloud service.

24/7 Enterprise support for the product

MSK excludes Kafka support from its product Amazon MSK. Amazon MSK Serverless is part of Amazon MSK and uses its SLAs.

I am amazed at how many enterprises use Amazon MSK without reading the SLAs. Most people are not aware that Kafka support is excluded from the product.

This makes Amazon MSK Serverless a car engine, not a complete car, right? Do you really want to build your own car and take over the burden and risk of failures while driving on the street?

If you need to deploy mission-critical workloads with 24/7 SLAs, you can stop reading and qualify out Amazon MSK (including Amazon MSK Serverless) until AWS adds serious SLAs to this product in the future.

Complete data streaming platform

AWS has a service for everything. You can glue them together. In our cars analogy, it would be many cars or vehicles in your enterprise architecture. Most of us learned the hard way that distributed microservices are no free lunch.

The monolithic data lake (now pitched as lakehouse) from vendors like Databricks and Snowflake) is no better approach. Use the right technology for a problem or data product. Finding the right mix between focus and independence is crucial. Kafka’s role is the central or decentralized real-time data hub to transport events. This includes data integration and processing, and to decouple systems from each other.

A modern data flow requires a simple, reliable and governed way to integrate and process data. Leveraging Kafka’s ecosystem like Kafka Connect and Kafka Streams enables mission-critical end-to-end latency and SLAs in a cost-efficient infrastructure. Development, operations, and monitoring are much harder and more costly if you glue together several services to build a real-time data hub.

However, Kafka is not a silver bullet. Hence, you need to understand when NOT to use Kafka and how it relates to data warehouses, data lakes, and other applications.

After a long introduction to this aspect, long story short: If you use Amazon MSK Serverless, it is the data ingestion component in your enterprise architecture. No fully managed components other than Kafka and no native integrations to other 1st party cloud AWS services like S3 or Redshift, and 3rd party cloud services like Snowflake, Databricks, or MongoDB. You must combine Amazon MSK Serverless with several other AWS services for event processing and storage. Additionally, connectivity needs to be implemented and operated by your project team using Kafka Connect connectors, or another 1st or 3rd party ETL tools, or custom glue code).

Amazon MSK Serverless only supports AWS Identity and Access Management (IAM) authentication, which limits you to Java clients only. There is no way to use the open source clients for other programming languages. Python, C++, .NET, Go, JavaScript, etc. are not supported with Amazon MSK Serverless.

MSK Connect allows deploying Kafka Connect connectors (that are available open source, licensed from other vendors, or self-built) into this platform. Similar to Amazon MSK, this is not a fully-managed product. You deploy, operate, and monitor the connectors by yourself. Look at the fully managed connectors in Confluent Cloud to understand the difference. Also, note that AWS will only support Connect workers. But it will not support the connectors themselves even if running on MSK Connect.

Event-driven architecture with true decoupling between the microservices

An event-driven architecture powered by data streaming is great for single integration infrastructure. However, the story goes far beyond this. Modern enterprise architectures leverage principles like microservices, domain-driven design, and data mesh to build decentralized applications and data products.

A streaming data exchange enables such a decentralized architecture with real-time data sharing. A critical capability for such a strategic enterprise component is long-term data storage. It

  • decouples independent applications
  • handles backpressure for slow consumers (like batch systems or request-response web services)
  • enables replayability of historical events (e.g., for a Python consumer in the machine learning platform from data engineers).

Data Mesh with Apache Kafka

The storage capability of Kafka is a key differentiator against message queues like IBM MQ, Rabbit MQ, or AWS SQS. Retention time is an important feature to set the right storage options per Kafka topic. Confluent makes Kafka even better by providing Tiered Storage to separate storage from compute for a much more cost-efficient and scalable solution with infinite storage capabilities.

Amazon MSK Serverless has a limited retention time of 24 hours. This is good enough for many data ingestion use cases but not for building a real-time data mesh across business units or even across organizations.  Another tough requirement of Amazon MSK Serverless is the limitation of 120 partitions. Not really a limit that allows building a strategic platform around it.

As Amazon MSK Serverless is a new product, expect the limitations to change and improve over time. Check the docs for updates. UPDATE Q1 2023: Amazon MSK Serverless added unlimited retention time and support for more partitions. That’s excellent news for this service. With this update, if retention time is your critical criterion, Amazon MSK Serverless is stronger since 2023. However, check the storage costs and compare different cloud offerings for this.

But anyway, these limitations prove how hard it is to build a fully-managed Kafka offering (like Confluent Cloud) compared to a partially managed Kafka offering (like “the normal” Amazon MSK).

Hybrid AWS and multi-cloud Kafka deployments

The most obvious point: Amazon MSK Serverless is only a reasonable discussion if you run your apps in the public AWS cloud. For anything else, like multi-cloud with Azure or GCP, AWS edge offerings like AWS Outpost or Wavelength, hybrid environments, or edge deployments like a factory or retail store, AWS is no option.

If you need to deploy outside the public AWS cloud, check my comparison of Kafka offerings, including Confluent, IBM, Red Hat, and Cloudera.

I want to emphasize that no product or service is 100% cloud agnostic. For instance, building Confluent Cloud on AWS, Azure, and GCP includes unique challenges under the hood. Confluent Cloud is built on Kubernetes. Hence, the template and many automation mechanisms can be reused across cloud vendors. But storage, compute, pricing, support, and many other characteristics and features differ at each cloud service provider.

Having said this, as you leverage a SaaS like Confluent Cloud with no knowledge or access to the technical infrastructure. You don’t see these issues under the hood. On the developer level, you produce and consume messages with the Kafka API and configure additional features like fully-managed connectors, data governance, or cluster linking. All the operations complexity is handled by the vendor. No matter which cloud you run on.

Coopetition: The winners are AWS and Confluent Cloud

The reason for this post was the evolution of the Amazon MSK product line.  Hence, if you read this a year later, the product features and limitations might look completely different once again. Use blog posts like this to understand how to evaluate different solutions and SaaS offerings. Then do your own accurate research before making a product decision.

Amazon MSK Serverless is a great new AWS service that helps AWS customers with some times of projects. But it has tough limitations for some other projects. Additionally, Amazon MSK (including Amazon MSK Serverless) excludes Kafka support! And it is not a complete data streaming platform. Be careful not to create a mess of glue code between tens of serverless cloud services and applications. Confluent Cloud is the much more sophisticated fully-managed Kafka cloud offering (on AWS and everywhere). I am not saying this because I am a Confluent employee but because almost everybody agrees on this 🙂 And it is not really a surprise as Confluent only focuses on data streaming with 2000 people employees and employs many full-time committers to the Apache Kafka open source project. Amazon has zero, by the way 🙂

By the way: Did you know you can use your AWS credits to consume Confluent Cloud like any other native AWS service? This is because of the strong partnership between Confluent and AWS. Yes, there is coopetition. That’s how the world looks like today…

Confluent Cloud provides a complete cloud-native platform including 99.99% SLA, fully managed connectors and stream processing, and maybe most interesting to readers of this post, integration with AWS services (S3, Redshift, Lambda, Kinesis, etc.) plus AWS security and networking (VPC Peering, Private Link, Transit Gateway, etc.). Confluent and AWS work closely together on hybrid deployments, leveraging AWS edge services like AWS Wavelength for 5G scenarios.

Which Kafka cloud service do you use today? What are your pros and cons? Do you plan a migration – e.g., from Amazon MSK to Confluent Cloud or from open source Kafka to Amazon MSK Serverless? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post When NOT to choose Amazon MSK Serverless for Apache Kafka? appeared first on Kai Waehner.

]]>